Data Quality: The Achilles’ Heel of Big DataFebruary 21, 2013 By Emanuel Rasolofomasoandro
Big data — large volumes of data from multiple sources — can be a powerful tool for many agencies ranging from marketing to customer service to fraud detection to politics. Almost every activity in modern society generates large volumes of data, more and more of which are captured and stored. The cost of data storage and data processing continues to drop and software tools necessary to use the data continue to evolve.
By combining data from websites, call centers, email campaigns, and Facebook and Twitter, companies can get a comprehensive understanding of what their customers need and want as well as how best to service them. Unfortunately, not all who have tried to implement big data have succeeded. Indeed, the Achilles’ heel of big data is data quality. Good data doesn't happen by itself; data on its own is inherently messy. Quality data requires someone with data quality expertise as well the responsibility and authority to make it happen.
A perfect example of this is the experience I went through while working with a client. This Fortune 500 company spent years building a customer data warehouse. In the beginning the company used only web data, but later injected call-center data and customer descriptive data from multiple systems. As the warehouse grew bigger, more and more departments within the company started to consume its data, and the data gained greater visibility and serviceability.
Greater data visibility also exposed data-quality issues such as discrepancies in the total number of customers, customer value, use of the call center and customer geolocation depending on the views of the data. Even worse, there were inconsistencies between the customer data warehouse in question and other data sources within the company. As a result, different executives were given similar reports with differing figures, and suddenly data became political. That is, whose data were correct became a political issue.
In some cases, data was at variance by 5 percent to 10 percent, but in others it was off by as much as 50 percent — which landed many departments in trouble. Data reconciliation became a top priority. Unfortunately, the reconciliation was perceived as an annoying technical issue everyone tried to avoid. Reportedly, employees were known to hide when "volunteers" were sought for the project. In the end, executives deemed the reconciliation a simple task and assigned it to the most junior people available.
In reality, a successful big data reconciliation requires a rigorous process and the right tools. The process should begin at the very source of the data, then variances should be checked at each step as the data are further manipulated. With multiple sources, the data should be segmented and examined in chunks, not in bulk. Furthermore, the large volumes of data beg for appropriate large-scale database tools, not spreadsheets. Unfortunately, the junior people assigned to reconciliation had neither the access to large-scale database tools nor the ability to use them. Ultimately, my client's team spent considerable time in vain for a haphazard reconciliation effort lacking rigor and good structure. Causes of the variance in data were never clearly identified and therefore the issues couldn't be resolved.