Two fine articles on search engine technology landed in our laps to noodle over this week. One piece focuses on Google and the other Microsoft's Bing platform.
Both deal with the artful but dodgy practice of disambiguation, or tuning search algorithms to parse semantics.
Folks, I mean the difference between Apple, the company, and apple, the fruit. Humans get this stuff pretty easily (we hope). Computers don't.
Let's start with the Google item from Wired, where Google Fellow Amit Singhal, the same man credited with Google's recent real-time search overhaul, discussed the differences between queries that have similar textual constructs but vastly different contextual meanings.
This was fascinating to read, particularly because Google has never pulled back the curtain on its search quality wizard with such candor, or at least not since BusinessWeek got a similar tour last fall. Singhal told Wired:
"People change words in their queries. So someone would say, 'pictures of dogs,' and then they'd say, 'pictures of puppies.' So that told us that maybe 'dogs' and 'puppies' were interchangeable. We also learned that when you boil water, it's hot water. We were relearning semantics from humans, and that was a great advance.""
The article continued:
"But there were obstacles. Google's synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein's theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. "Hot dog" would be found in searches that also contained "bread" and "mustard" and "baseball games" -- not poached pooches. That helped the algorithm understand what "hot dog" -- and millions of other terms -- meant. "Today, if you type 'Gandhi bio,' we know that bio means biography," Singhal says. "And if you type 'bio warfare,' it means biological.""
While I have been consuming information about Internet technologies for the last decade, I have no computer science expertise.
My core competencies are journalism and English, so I can confidently say the permutations of many of the thousands of words in the English language alone are staggering.
There are synonyms, homonyms, as well as a tremendous amount of slang. A top United Nations translator could tell you of the challenges languages other than English pose for search engines trying to parse user intent and signals.
Disambiguation is a Herculean challenge, one that Microsoft's Bing team is also working feverishly to conquer, according to this piece on Search Engine Land, which tries to puzzle out where the Bing team sees search going.
Bing Director Stefan Weitz outlined these challenges as such for SEL:
"Where we need to get to, and where we're working to get to, is doing better job of having the crawler and the parser really understand the language. When I say Crest White Strips, think of what today's indexes are going to do. They're going to find those words in a PageRank and return those results. The system has to know that Crest is a brand and white strips are a way of whitening teeth. Teeth whitening is done by a dentist. And dentists often don't like using off-the-shelf products."
"You have all the things that I know about Crest White Strips, just from casual human understanding standpoint. The engines today don't know that. So much of the intent calculation we have to do to deliver a good set of results is bound up in this challenge of us imbuing the engines and the index and the parsers with a more human characteristic of understanding what they're reading. That will get us to intent much faster than a lot of the mathematical tricks."
In effect, the goal is to make the search engines more human, a path to true artificial intelligence. This is both exciting and a little scary, particularly for a science fiction buff. Gord Hotchkiss, who spoke to Weitz, explained it thusly:
"... It requires the machine to connect a semantic label with known concepts that may surround that label, as in Stefan's example of White Strips. We do this instantaneously and effortlessly (although miscommunication is no rare occurrence, even between humans) but thus far, machines haven't been able to duplicate the feat. So, what are the signals that Microsoft might use to pull this off? Knowing more about the person who's doing the talking is one potential signal."
Google is kicking tail and taking names in search now, but who knows whether the Bing guys or some other outlier we're not watching closely might swoop in with a new search engine construct that leverages AI like we've never seen before.
As to why Google is being more open, I think we can safely say that is a nod to Bing, which has certainly come on strong and attracted search engine press like no search provider since, well, Google.