IBM, EU IMPACT Project Looks to Preserve Ancient Texts

IBM and the EU are collaborating on a massive effort to digitize rare historical texts using new text-recognition and crowd-computing technologies.

IBM and the European Union are teaming up to digitize a massive number of rare and culturally significant historical texts. The initiative, called IMPACT (IMProving ACcess to Text) will provide new technologies to institutions across Europe that will enable them to efficiently digitize their historical collections, which will become available, editable and searchable online.

At the core of IMPACT's research is new Web-enabled adaptive optical character recognition (OCR) software. The software is equipped with crowd-computing technology, a model in which individuals enhance a product by providing their unique knowledge and expertise. The popular Website Wikipedia is an example of crowd-computing. Combined, crowd-computing and OCR will allow institutions to digitize idiosyncratic fonts, irregularities and vocabularies while reducing error rates by 35 percent and substitution rates by 75 percent.

"IMPACT is remarkable in that it not only allows these prominent centers of culture to ultimately bring people closer to perhaps-never-before-seen historically significant texts of heritage -- but because it actually allows these people to become part of the preservation process," Tal Drory of IBM Research in Haifa, Israel, wrote in a statement.

"IMPACT offers the first digitization system that combines the power of crowd computing with an adaptive optical character recognition (OCR) correction solution that can achieve excellent recognition rates across all kinds of documents - from the 15th century right up through the 19th century," Drory added.

Today's OCR engines work well with modern printed texts, but faded ink, historical typefaces and damage can lower recognition rates by 50 percent. Manual post-production review becomes necessary for historical texts, but it is time-consuming and inefficient. The IMPACT project aims to lower the need for manual review.

"The only way to make a large-scale digitization project work is to dramatically improve the quality of the initial OCR, and cut down post-processing tasks as much as possible," said Hildelies Balk, Head of European Projects at Koninklijke Bibliotheek and leader for the IMPACT consortium, in a statement. "With IMPACT, we're expecting to see remarkable increases in productivity in the digitization process."

A new collaborative correction system, designed by IBM, will allow volunteers across Europe to correct mistakes online. The technology simplifies and speeds up the correction process by allowing users to key in corrections. The system will also compile lists of questionable words, which volunteers will be able to accept or reject with just one keystroke.

A small book that would take four hours to input manually would take just 15 minutes with the adaptive OCR and collaborative technology.

With IMPACT, IBM and EU are further expanding their research partnership, which already includes more than two-dozen national libraries, research institutes, universities, and companies across Europe. Other companies are competing to create IT solutions for research institutions. Microsoft, for example, has built a searchable electronic archive for the state of Washington, and HP has created a digital library for the Massachusetts Institute of Technology.