IBM Open-Sources UIMA for Unstructured Text Analysis

IBM is open-sourcing UIMA, its search and text analysis technology that mines unstructured data to uncover hidden relationships and trends.

IBM on Jan. 23 plans to carry through with its promise to open-source search and text analysis technology that mines unstructured data—such as documents, images, comment and note fields, e-mail, and rich media such as video and audio—to uncover hidden relationships, trends and facts.

IBM is handing code for the technology, called UIMA (Unstructured Information Management Architecture), over to, the worlds largest open-source development site.

The company plans to move the project to a full open-source community development model later in the year.

Nelson Mattos, IBM distinguished engineer and vice president of information and interaction, predicted that the impact of IBMs release of UIMA will be similar in magnitude to IBMs release of SQL as a standard for relational databases 30 years ago.

"The moment SQL became a standard and became highly adopted in industry, it opened the door for development of huge numbers of applications," Mattos said.

"Were seeing exactly the same pattern here," Mattos continued. "Today 80 to 85 percent of data is unstructured data. There is no standard to deal with unstructured data, to build applications, to leverage that. UIMA has the potential to be that standard.

"IBM was doing similar moves in the 1970s, when we gave SQL to the standards bodies. Were giving UIMA to open source hoping we can create a standard for a whole new generation of applications."

UIMA already has solid traction, Mattos said. Unveiled by IBM in December of 2004, its already in use in industry and in academia.

For example, the Mayo Clinic has adopted the framework as part of its collaboration with IBM on the processing of unstructured text—in particular, a collection of 20 million clinical notes.

UIMA serves as the thread to stitch together the series of tools required to search and mine disparate unstructured data sources. Thus, Mayo Clinic has combined a series of its own, IBMs and open-source annotators in a plug-and-play fashion using UIMA as a framework.

DARPA (the Defense Advanced Research Projects Agency) is also making use of UIMA. The agency is using it as part of a human language technology research and development program called GALE (Global Autonomous Language Exploitation), the goal of which is to analyze and interpret large volumes of speech and text in multiple languages.

UIMA is also increasingly being used in software, with UIMA-compliant solutions now out from companies including ClearForest, Cognos, Factiva and Nstein.

Source code for the IBM reference implementation of UIMA is available here. IBMs UIMA SDK can be downloaded for free at this site.

