Google late yesterday revealed that it has successfully implemented OCR (optical character recognition) technology to scan and convert a picture in a document created by Adobe’s PDF format into words. This renders these files searchable via the Web.
Google Product Manager Evin Levey noted in a blog post that prior to this development, scanned documents were rarely included in search results because Google couldn’t be sure of their content. “We had occasional clues from references to the document—so you might get a search result with a title but no snippet highlighting your query. Today, that changes …”
Check out this query of “Steady success in a volatile world” to see the OCR in action. You can see a snippet of the content, with the full text presented after the “View as HTML” link.
This is a really hard problem to solve, and it is a prime example of Google injecting its search algorithm with some serious intellectual steroids. Hyperbole aside, Google has made another step on its path to creating software that enables our computers to interpret results as a human would. Levey explains:
““To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read as a zero, the letter ‘O,’ just a circle or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.”“
This is a great step for Google and an example of how the company is raising the bar in search technology to attain its goal of organizing the world’s information online and making it accessible.
But, I wonder if this OCR development grew out of Google’s Book Search effort? Despite the point that Google has been gradually peeling layers from its complex search onion for the public in 2008, the company remains protective of its book-scanning IP.
If this OCR tool didn’t come out of Book Search, it could almost certainly be blended into it now. The company has scanned 7 million books online to date and has millions more to go after inking its latest book deal this week. OCR could help with those tricky diagrams in how-to tomes.
For all we know, Google could give Pitney Bowes and HP a run for their money in terms of scanning technology expertise. It wouldn’t surprise me.
GigaOm wonders if this OCR signals Google’s plans to index the dark Web. What do you think?