IBM and the EU are collaborating on a massive effort to digitize rare historical texts using new text-recognition and crowd-computing technologies.
IBM and the
European Union are teaming up to digitize a massive number of rare and
culturally significant historical texts. The initiative, called IMPACT
(IMProving ACcess to Text) will provide new technologies to institutions across
Europe that will enable them to efficiently digitize their historical
collections, which will become available, editable and searchable online.
At
the core of IMPACT's research is new Web-enabled adaptive optical character
recognition (OCR) software. The software is equipped with crowd-computing
technology, a model in which individuals enhance a product by providing their
unique knowledge and expertise. The popular Website Wikipedia is an example of
crowd-computing. Combined, crowd-computing and OCR will allow institutions to
digitize idiosyncratic fonts, irregularities and vocabularies while reducing
error rates by 35 percent and substitution rates by 75 percent.
"IMPACT
is remarkable in that it not only allows these prominent centers of culture to
ultimately bring people closer to perhaps-never-before-seen historically
significant texts of heritage -- but because it actually allows these
people to become part of the preservation process," Tal Drory of IBM
Research in Haifa, Israel, wrote in a statement.
"IMPACT
offers the first digitization system that combines the power of crowd computing
with an adaptive optical character recognition (OCR) correction solution that
can achieve excellent recognition rates across all kinds of documents - from
the 15th century right up through the 19th century," Drory added.
Today's
OCR engines work well with modern printed texts, but faded ink, historical
typefaces and damage can lower recognition rates by 50 percent. Manual
post-production review becomes necessary for historical texts, but it is
time-consuming and inefficient. The IMPACT project aims to lower the need for
manual review.
"The
only way to make a large-scale digitization project work is to dramatically
improve the quality of the initial OCR, and cut down post-processing tasks as
much as possible," said Hildelies Balk, Head of European Projects at
Koninklijke Bibliotheek and leader for the IMPACT consortium, in a statement.
"With IMPACT, we're expecting to see remarkable increases in productivity
in the digitization process."
A
new collaborative correction system, designed by IBM, will allow volunteers
across Europe to correct mistakes online. The technology simplifies and speeds
up the correction process by allowing users to key in corrections. The system
will also compile lists of questionable words, which volunteers will be able to
accept or reject with just one keystroke.
A
small book that would take four hours to input manually would take just 15
minutes with the adaptive OCR and collaborative technology.
With
IMPACT, IBM and EU are further expanding their research partnership, which
already includes more than two-dozen national libraries, research institutes,
universities, and companies across Europe. Other companies are competing to
create IT solutions for research institutions. Microsoft, for example, has
built a searchable
electronic archive
for the state of Washington, and HP has created a digital library for the
Massachusetts Institute of Technology.