NEWS ANALYSIS: Big data means big problems if the data isn't accurate. But don't think there is a quick technology fix to data quality. Instead, business strategies must be installed to make data dependable.
You are embarrassed to discover your company is sending mail to a former customer now deceased. You have to hire a swivel chair data input operator who spends the day re-entering data between incompatible systems by swiveling from screen to screen. Government organizations are supposed to share data on suspicious individuals, but a name falls through the cracks and a bombing occurs. Those are all examples—ranging from embarrassing to fatal—of data that should have been updated, integrated and shared, but wasn't.
Individuals outside of the data business are perplexed about why—in this era of big data, cloud computing and systems that seem to know way too much about you—those "dirty" data problems still exist. Those on the inside of the data industry, on the other hand, are surprised the problems are not more prevalent.
At this year's MIT Chief Data Officer and Information Quality Symposium
held at the university's Cambridge, Mass., campus, both success and not-so-success stories were in evidence. This is the conference's seventh year, and while the past year's conference had a focus on technology development to clean, normalize and make sure the data on which your company depends is accurate, the focus this year was much more on the business strategies that must be installed to make data dependable.
"The hardest part is not the technology but the people and social issues," said Deborah Nightingale, the director of the MIT Sociotechnical Systems Research Center, in her remarks that opened the conference. Nightingale has a focus on enterprise technology systems, and she was referring to enterprise systems generally as well as the data that is used to power those systems.
In an era when "big data" has moved from barely mentioned to industry buzzword, the accuracy of the data underlying those systems is often overlooked. The term "big data" was not a favored term by the approximately 180 conference attendees, but it was much on the mind of the speakers, who noted the industry is in a transition from largely a structured data basis to a mix of structured and unstructured data.
The complexity of working with myriad data types and myriad, often incompatible systems was underscored by Dat Tran, the deputy assistant secretary for data governance and analysis at the U.S. Department of Veterans Affairs.
"The VA does not have an integrated data environment. We have myriad systems and databases, and enterprise data standards do not exist. There is no 360-degree view of the customer," Tran said in a forthright discussion of the obstacles facing a high-profile agency dealing with 11 petabytes of data and 6.3 million patients. Tran echoed Nightingale's remarks in noting that cleaning up the data traffic jam at the VA is more about getting all the business and technology groups to agree on a data quality process and time schedule rather than searching for some quick technology fix.
"Data quality is a business problem; however, effective data governance requires business users and IT to work together," said Tran.
The idea that there is a quick technology fix to data quality was knocked down by numerous speakers and attendees at the conference. The difficult work of mapping out business processes, making sure that those initially entering the data feel empowered and responsible for data accuracy, and creating a data quality infrastructure that also assures privacy and security existed before the advent of big data. The entrance of large, outside pools of unstructured data into the enterprise only makes the data quality issue more critical.
Some technological help may be on the near horizon. Peter Kaomea, CIO for the Sullivan & Cromwell law firm, sees internal crowdsourced data cleansing as an emerging model once security and privacy considerations are in place. Mark Temple-Raston, senior vice president of data management at Citigroup, said advances in natural language processing applied to business communications could provide business data with mathematical underpinnings currently unavailable.
The role of the chief data officer in improving data quality was obviously a big topic at the conference, but it was clear that wherever the position resides—whether in business groups or under the auspices of the CIO—an understanding of the business processes and the ability to communicate the value of data quality and then implement technology solutions is the prime process for improving data quality overall.
Eric Lundquist is a technology analyst at Ziff Brothers Investments,a private investment firm. Lundquist, who was editor-in-chief at eWEEK (previously PC WEEK) from 1996-2008, authored this article for eWEEK to share his thoughts on technology, products and services. No investment advice is offered in this article. All duties are disclaimed. Lundquist works separately for a private investment firm, which may at any time invest in companies whose products are discussed in this article and no disclosure of securities transactions will be made.