How to Deal with Unstructured Data - Oh Brother, Where Art Thou?

Sure, you've got lots of information. Your servers are full of information. The problem is, how do you deal with it? W.H. Inmon, the father of data warehousing, provides some thoughts on that.

/images/stories/70x50/bug_knowledgecenter_70x70_(2).jpgUnstructured data has been around for a long time - certainly longer than the computer. Consider the Bible, the Egyptian hieroglyphics, and the Kama Sutra. They long predated silicon chips. And search engines have been around for a while as well - but not as long as the printed word. When it comes to unlocking the valuable information contained in unstructured data, even with sophisticated search engines, the world really has not come very far. So, why would this be the case?

Garbage In, Garbage Out

There is a missing ingredient that needs to be present in order for search engines to unlock the real value of unstructured data. To help explain that missing ingredient, consider the oldest information technology conundrum of all: GIGO or "Garbage In, Garbage Out." What happens when a powerful search engine is used against textual data that is essentially unscrubbed, unwashed and unintegrated? The answer is that the result of the search engine's work, which is returned to the end user, is also unscrubbed and unwashed.

In order for a search of text to be really powerful, the text that the search is conducted on needs to be integrated before the search is done. Once it is completed, you won't start with garbage in - and you then wouldn't expect garbage out.

Internet vs. Corporate Data

In the case of searching the Internet, scrubbing the data is a little bit of a reach. Trying to scrub and integrate data across the Internet is probably a futile endeavor. To attempt to integrate data on the Internet is roughly like trying to boil the ocean or, at least, Lake Erie.

But corporate data is another matter for two reasons. First, when it comes to corporate data, there is a finite amount of it - as opposed to the almost infinite amount of Internet data out there. And second, unlike Internet data, corporate data is almost all relevant to the business of the corporation. It is safe to say that only a small part of the data on the Internet is relevant to the business of any one corporation - even a large and diverse corporation such as IBM or Dow Chemical.

Therefore, integrating corporate textual data, or preparing it for the purpose of search and analysis, is a very real and very practical possibility.

What Kinds of Data Need Integration?

So what kinds of corporate data need to be integrated? The only limitation to that lies in the imagination of the user. Some of the obvious kinds of corporate data that could be integrated include:

1. Customer data - relating to customer communications

2. Safety data - relating to accidents, inspections, repairs, warranties and other important events

3. Contracts data - data relating to the specific contracts of the corporation

4. Discovery data - data in the litigation process

5. Compliance data - descriptions of sensitive corporate events and transactions, etc.

Therefore, there are few limits - or, theoretically, no limits at all - to the potential uses of integrated corporate textual data.

The Advantage of Data Integration

One of the major advantages of integrating corporate textual data is that, once it's integrated, it can then be put into a database and reused. In other words, the corporate textual data need only be integrated once. Thereafter, it can be researched and reanalyzed as often as desired.

Typically, after the corporate textual data is integrated, it is placed into a data warehouse. Once inside, it is able to be combined with other structured data in the data warehouse. In doing so, an entirely new class of queries is created.

The query can be called a hybrid query because the query passes against both structured and unstructured data. And the resulting data warehouse is truly an integrated data warehouse - containing both data whose origin is structured and unstructured.

Customer Communications Analysis

As an example of just one of the many applications that open up to the corporation, consider customer communications analysis. It is normal to receive e-mails from customers. But, once those e-mails are read, normally they are lost. They go into a holding file and just sit there - along with thousands of other e-mails.

The problem is, when the corporation needs those communications, they are difficult to find. This becomes especially important when future communications occur with the customer.

The Case of Mrs. Jones

To illustrate this point, let's look at the case of a customer named Mrs. Jones. Let's suppose she wrote a scathing e-mail last month because an order of hers had been botched. This month, our salesperson wants to call up Mrs. Jones and solicit some more business. Is it important that the salesperson knows about last month's e-mail from Mrs. Jones?

The answer is, of course it is important. If we want to sell Mrs. Jones something new, ANY recent direct communication is important - whether it is directed at or received from Mrs. Jones. So, how does the corporation find and filter the e-mails that are relevant? Also, how does the corporation find and filter the irrelevant e-mails?

This, then, is just one example of the many, many cases where unstructured textual data could be used - if, in fact, that corporate textual data had been placed in a database once it had passed through an integration process designed specifically for textual integration.

/images/stories/heads/inmon_bill70x70.jpg W. H. Inmon is considered the father of data warehousing. He has written 49 books, translated into nine languages. His book on data warehousing has sold approximately 500,000 copies around the world and is in its fourth edition. W.H. Inmon founded and took public the world's first company to build and sell ETL. He has written more than 600 articles and is published in most major trade journals.

W.H. Inmon has conducted seminars and spoken at conferences on every continent except Antarctica. He holds nine software patents. His latest company is Forest Rim Technologies Inc., a company dedicated to the access and integration of unstructured data into the data warehouse environment. His weekly newsletter is one of the most widely read in the industry, having 75,000 weekly subscribers.

(For more information about the corporate usage and structuring of unstructured data, refer to W.H. Inmon's recently published book, Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence, Prentice Hall, 2007, by W.H. Inmon).

W.H. Inmon can be reached at