Garbage In, Garbage Out of Control

By Lisa Vaas  |  Posted 2005-06-15

Garbage In, Garbage Out of Control

I just read Baselinemag.coms excellent story on the rising threat posed by bad data thats stored in myriad databases across the land: registries of motor vehicles, insurance firms, marketing companies and other commercial sources, as well as public records such as court documents and licenses.

My thoughts? Be afraid. Be very afraid.

Know this: Personal identity information is full of inaccuracies, typos and outdated information that can, at the merely annoying end of the spectrum, plague innocent citizens as they get turned down for insurance or have credit applications denied.

Things turn truly Orwellian, however, when you get into scenarios outlined by, in which innocent victims of dirty data suffer much more traumatically.

Case in point: Steven Calderon got tossed into jail to rot for a week in January 2002 for felonies he didnt commit, including rape and child molestation.

The problem? Police, and Calderons employer, Frys Electronics, believed data aggregated and supplied by ChoicePoint, rather than the evidence in front of their eyes, which would have told them that Calderon didnt match the perps height, weight, drivers license number or fingerprints.

I wish I could tell you that this was a database problem and that technology vendors are all over the problem of cleansing this data.


Granted, ETL (extraction, transformation and loading) vendors are all about fixing the mess that passes for data in these discrete databases. But whatever achievements we get from that camp will still leave us struggling fiercely against the tide when it comes to the urge to merge these soiled little buckets.

Click here to read about data theft at MCI and its influence on the encryption debate.

Because aggregation is happening all over the place, linking these databases together regardless of the power and range it gives to the propagation of dirty data.

You get it on the technology front, of course, with admittedly splendid analytics applications coming from companies such as SAP.

When I spoke recently with Roman Bukary, leader of SAPs xApps and Analytic Applications product marketing, he told me that this is what its all about: going from standard analytic reports to composite analytic applications.

What does that mean? It means business users can initiate and take action on workflow applications inside analytic applications. In other words, with the upcoming merging of technologies such as Microsofts and SAPs in the Mendocino product, youll be able to be flitting around in Office and decide to give somebody a pay raise without having to leave to go fiddle with the SAP HR module.

It means that analytics is filtering down to the masses, just as it has been for a long time and just as it should to mean anything to a business. It means that SAP, for example, is partnering with Macromedia to make analytics so sexy and alluring that pie charts will spin into position in saturated four-color Flash rendition.

Next Page: Its not the amount of information you collect; its the conclusions you draw from it.

Its Not the Amount

of Information">

This is all great. I love the applications. When I saw Mendocino demonstrated at Sapphire, I wanted it. For what, who cares? It just looks like so much fun to play with all that raw enterprise power, all from the comfort of Office.

But then I had a conversation with a computer scientist whom I met last week at IDCs forum on business intelligence, and I sobered up.

Dietrich Falkenthal is interested in visualization technology, of which many companies besides SAP gave gorgeous demonstrations at IDCs gig.

The key, Falkenthal said, is not the amount of information that can be collected from sensors, user inputs or other data sources, but how to make it useful, especially in tactical environments. Visualization technology is important to medical services, law enforcement and the military, for example, because they have a limited time to make decisions.

But what cant be done with current technology is to come up with automated tools to intelligently handle complex real-time data. You can present data in gorgeous spinning pie charts, but if its the wrong data presented, the wrong conclusions can be reached, Falkenthal pointed out.

"Tools are needed to process a lot of data and take some burden off users. Essentially, to do a smart push of important data that the user doesnt yet know he or she needs. For the most part, its still garbage in, garbage out, but visualization tools may help."

This is not data cleansing, where records are combed through to eliminate name-spelling variants, for example. This is about incorrect data correlations: something you most certainly dont want police, passport agents, medical professionals or anybody in the military to be acting on.

Click here to read and download Baselines 7-step plan for cleansing your data.

Research in this area is new, but Falkenthal pointed me to universities such as MITs Engineering Systems Division or to companies and research labs that are thinking about these issues.

Meanwhile, though, technology is forging ahead, synching up data sources. To compound the problem, nutso legislation is being passed.

The Real ID Act will usher in the nations first national ID system, with little regard for the governments ability to deploy the technology in ways that would prevent citizens from being preyed on by identity thieves and with no regard for that fact that it relies on data from sources, such as state RMVs, that are increasingly targets for identity theft. And which, of course, contain typos, outdated information, etc.

The bill dictates that all states collect personal information from citizens before allowing them to obtain a drivers license, including—at minimum—name, date of birth, gender, drivers license or identification card number, digital photograph, address, and signature.

Collection of this particular information is not new. Linkage of states databases is. The bill specifies that states link what are at present discrete databases, creating, in effect, one nationwide database with personal information pertaining to all citizens.

Next Page: Dont trust companies to protect your data; thats a do-it-yourselfer.

Dont Trust Companies to

Protect Your Data">

ChoicePoint doesnt take responsibility for aggregating and propagating filthy data. ChoicePoint says its the data sources—RMVs, court, etc.—that are responsible for the data. If its from the government, it must be good stuff, the thinking goes.

Do you trust the government to have the right information on you?

Do you trust the government to protect your data from thieves?

If you answered yes to either question, youre naive.

Back when the Real ID Act was on the brink of passing, I chatted with Marc Rotenberg, executive director of the Electronic Privacy Information Center in Washington. He pointed out that the problem is not that database information cant be encrypted—its that the government has proven untrustworthy in doing so.

Look at the metric of the FISMA—the Federal Information Security Management Act. Its legislation that mandates that government agencies be graded on their ability to protect data. The Department of Homeland Security has gotten four Fs in a row. If theyre not securing data, do we really want to trust state RMVs?

Your information is already in these databases. Do you want it in one or two databases, or 50? Do you want every potentially crummy, unencrypted piece of data to be linked to every other potentially crummy, unencrypted piece of data?

I know Im mixing the topics: weve got dirty data, and weve got unencrypted, unprotected data. But both problems wind up with the same result: people getting thrown into jail for other peoples crimes. People getting stopped at the airport because they have Arabic names that look like terrorists. Innocent people being unfairly persecuted.

Whats the answer? I wouldnt advise looking to technology to solve the problem. I would go back to the wise stance of paranoia and being a fierce watchdog over who gets your information and what they plan to do with it.

To read David Courseys "Anti-Phishing 101" column with tips on protecting personal data, click here.

My favorite spot for how-tos in protecting the spread of personal information is Junkbusters. There, youll be told how to get companies to stop renting or sharing your name; how to get off lists sold by companies that profit off your information,—that means youll be corresponding with—oh, joy!—ChoicePoint, et al.; how to browse the Web without leaving a trail of personal information behind you in the form of cookies; and more.

Is it easy? Oh, no. Believe me, Ive been through Junkbusters 12-step program for recovering personal data leakers. One little change in address, and presto! Youre back on the list of data leakage.

But it is satisfying, deeply satisfying, to get your personal information as expunged as possible from as many of these dirty data buckets as possible, and I highly recommend it. I really like the idea that I hamper the profits of those who broker my personal information with no remuneration to myself, and who do so with casual disregard for propagating garbage.

Check out eWEEK.coms for the latest database news, reviews and analysis.

Lisa Vaas is Ziff Davis Internets news editor in charge of operations. She is also the editor of eWEEK.coms Database and Business Intelligence topic center. She has been with eWEEK and since 1995, most recently covering enterprise applications and database technology. She can be reached at

Rocket Fuel