Data Quality Is First Step Toward Reliable Data Analysis

NEWS ANALYSIS: Big data means big problems if the data hasn’t been checked to make sure it has been correctly entered, saved and cleaned.

You are embarrassed to discover your company is sending mail to a former customer now deceased. You have to hire a swivel-chair data input operator who spends the day re-entering data between incompatible systems by swiveling from screen to screen.

Government organizations are supposed to share data on suspicious individuals, but a name falls through the cracks and a bombing occurs. Those examples, ranging from embarrassing to fatal, are all examples of data that should have been updated, integrated and shared but wasn't.

Individuals outside the data business are perplexed that—in this era of big data, cloud computing and systems that seem to know way too much about you—those "dirty" data problems still exist. Individuals on the inside the data processing industry are surprised the problems are not more prevalent.

At this year's MIT chief data officer and information quality symposium held July 17 to 19 at the Massachusetts Institute of Technology's Cambridge campus, both the success stories and the epic failures were evident. This is the conference's seventh year, and while past years had a focus on technology development to clean, normalize and make sure the data on which your company depends is accurate, the focus this year was much more on the business strategies that must be installed to make data dependable.

"The hardest part is not the technology but the people and social issues," said Deborah Nightingale, the director of the MIT Socio-Technical Systems Research Center, in her remarks that opened the conference. Nightingale has a focus on enterprise technology systems, and she was referring to enterprise systems generally as well as the data that is used to power those systems.

In an era when big data has moved from little-mentioned to current industry buzzword, the accuracy of the data underlying those systems is often overlooked. The term "big data" was an especially favored term by the approximately 180 conference attendees, but was much on the minds of speakers who noted the industry is in a transition from handling largely structured data to a mix of structured and unstructured data.

The complexity of working with myriad data types and myriad, often incompatible, systems was underscored by Dat Tran, the deputy assistant secretary for data governance and analysis at the U.S. Department of Veterans Affairs.

"The VA does not have an integrated data environment; we have myriad systems and databases, and enterprise data standards do not exist. There is no 360-degree view of the customer," Tran said in a forthright discussion of the obstacles facing a high-profile agency dealing with 11 petabytes of data and 6.3 million patients.

Tran echoed Nightingale's remarks in noting that cleaning up the data traffic jam at the VA is more about getting all the business and technology groups agreed on a data quality process and time schedule rather than searching for some quick technology fix.

"Data quality is a business problem; however, effective data governance requires business users and IT to work together," said Tran.

The idea that there is a quick technology fix to data quality was a myth knocked down by numerous speakers and attendees at the conference. The difficult work of mapping out business processes, making sure that those who initially enter the data feel empowered and responsible for data accuracy and creating a data quality infrastructure that also assures privacy and security, existed before the advent of the "big data" buzzword. The existence of large, outside pools of unstructured data inside the enterprise only makes the data quality issue more critical.

Some technological help may be on the near horizon. Peter Kaomea, CIO for the Sullivan and Cromwell law firm, mentioned internal crowd-sourced data cleansing as an emerging model once security and privacy considerations are in place.

Mark Temple-Raston, senior vice president of data management at Citigroup said advances in natural language processing applied to business communications could provide business data with mathematical underpinnings currently unavailable.

The role of the chief data officer in improving data quality was obviously a big topic at the conference, but it was clear that wherever the position resided—whether in business groups or under the CIO auspices—an understanding of the business processes, the ability to communicate the value of data quality and then implementing technology solutions was the prime process for improved data quality overall.

Eric Lundquist is a technology analyst at Ziff Brothers Investments, a private investment firm. Lundquist, who was editor in chief at eWEEK (previously PC WEEK) from 1996-2008 authored this article for eWEEK to share his thoughts on technology, products and services. No investment advice is offered in this article. All duties are disclaimed. Lundquist works separately for a private investment firm, which may at any time invest in companies whose products are discussed in this article and no disclosure of securities transactions will be made.