IBM Open-Sources New Search Technology

IBM will open-source a powerful new information search and analysis technology that can ferret out useful facts hidden in millions of documents, e-mails, images, audio and video.

IBM plans to release as open-source a sophisticated new search and text analysis technology that is able to find relationships, trends and facts buried in a wide range of unstructured data, including e-mails, Web pages, text documents, images, audio and video.

Called the UIMA (Unstructured Information Management Architecture), the technology is able is able to go beyond the keyword analysis typically used by most search engines to discern the semantic meanings within text and other unstructured data, said Nelson Mattos, vice president of information integration with IBM in San Jose, Calif.

IBM implemented UIMA in its WebSphere Information Integrator OmniFind Edition as part of its enterprise search platform, which Mattos said was the first commercially available application for this technology. IBM announced UIMA at the start of the LinuxWorld Conference & Expo in San Francisco this week.

UIMA was the result of four years of development by IBM Research supported by The DARPA Advanced Research Projects Agency, which is the central research and development arm of the U.S. Defense Department.

/zimages/4/28571.gifClick here to read about new spider technology in WebSphere Commerce 5.6.1 that is designed to efficiently index commerce Web pages that are updated frequently.

Major universities and private research organizations, including Carnegie Mellon University, Columbia University and the University of Massachusetts participated in the development of the technology and are now using UIMA in course work and research projects, according to IBM Officials.

BBN Technologies Inc., Science Applications International Corp., the Mayo Clinic and MITRE Corp. also contributed to the research.

"We are announcing that we are going to be open-sourcing that architecture to allow for a broad adoption in the marketplace," Mattos said.

Releasing the UIMA technology as open-source code will make it easier for commercial, government corporate and academic software developers to produce extensions and applications for the search technology, Mattos said. IBM will benefit from this when it gets opportunities to provide the computing and networking infrastructure to support these applications, he said.

UIMA will be presented to the Open Source Technology Group and be made available through the SourceForge online developer community by the end of 2005. Developers can also download the UIMA framework for free from IBMs Alpha Works division.

The search technology is particularly valuable for business intelligence applications that sift through e-mails or electronic documents to reveal trends that would otherwise be hidden from basic keyword searches, Mattos said.

For example, UIMA can be used to search through call center reports on problems about particular product such as a car to reveal mechanical or maintenance problems, Mattos said.

Such searches may reveal a product quality problem earlier in the production cycle so changes can be made before it damages the produces reputation or sales, he said.

/zimages/4/28571.gifRead more here about the major Web search engines working on ways to ferret out more premium content that was locked away in Web sites that were restricted to paid subscribers.

IT also allows companies to analyze "sales verses maintenance cost of a product and realize that while you are doing very well selling certain products, the maintenance cost of those is very high" because there are so many complaints and service calls about them, said Mattos.

Offering UIMA as an open-source technology is a good move because it increases the chances that it can be accepted as an industry standard for searching and analyzing all types of unstructured data, said Dana Gardner, principal analyst with industry researcher Interarbor Solutions.

"There has been a mish mash approach to text analytics, and I think there is a real value to having an interoperable methodology" in the market that brings together many of the best ideas about analyzing unstructured data, Gardner said.

The search engines available today are able to find huge numbers of documents with keyword searches, but they are poor providing an overview of the information contained in those documents, Gardner said. "Weve had a bunch of trees, but no way of viewing the forest when it comes to text analytics."

If UIMA is widely accepted as an industry standard, "it could allow for real-time analysis of an entire corporate intranet, which could be extremely powerful and allows for knowledge to be much more attainable, recoverable and actionable," said Gardner.

Its also true that the technology could also be used as a powerful intelligence gathering tool by the National Security Agency or the Central Intelligence Agency to sift through e-mail messages, phone conversations, or many other kinds of data, Gardner observed.

However, "I think that the spooks at the NSA and that ilk probably have these kinds of capabilities already," Gardner said.

UIMA will be much more valuable by taking out of the cloistered domain of intelligence and making available to the much larger domains of business," he said.

/zimages/4/28571.gifCheck out eWEEK.coms for the latest news, views and analysis on enterprise search technology.

John Pallatto

John Pallatto

John Pallatto has been editor in chief of QuinStreet Inc.'s since October 2012. He has more than 40 years of experience as a professional journalist working at a daily newspaper and...