Google Sets Sights on Clustering, Translation

The search company's leading researcher previews its work on clustering entities and words to better glean users' intentions, and on using statistical machine translation to show Web pages in other languages.

SAN FRANCISCO—Google Inc. on Thursday gave a preview of its next steps to improve Web search, and clustering technology played a leading role.

During a panel discussion of research lab leaders at the Web 2.0 conference here, one of Googles top researchers previewed the search companys work in clustering both entities and words as a way to better glean users intentions and distill information on the Web.

Another space in Googles research net is statistical machine translation for turning Web pages into other languages, said Peter Norvig, director of search quality at Google.

"[Were] trying to go just beyond keywords and the linking structure of the Web, the innovation that we brought to search, and get behind the deeper meaning," Norvig said during his presentation.

/zimages/2/28571.gifIs Google ready to enter the browser market? Click here to read more.

In clustering, Norvig demonstrated a six-month-old project called "named entities abstraction," where Googles researchers are analyzing the companys large Web index to extract entities—such as the name of a company—from the structure of content and then decipher their relationship to one another.

For example, Norvig said, researchers are looking for ways to break down sentences by looking for a phrase like "such as" and grabbing the names that follow it. The goal is to not only pull out the name but also its clusters, so that a name such as "Java" can be associated both with the computer language and with language in general, Norvig said.

"We want to be able to search and find these [entities] and the relationships between them, rather than you typing in the words specifically," Norvig said.

With word clustering, the focus is on making the search engine better at understanding the multiple meanings of a word, Norvig said. Google started working on word clustering about three years ago.

Apropos of the heated U.S. presidential election, Norvig demonstrated a prototype of word clustering with results both for President Bush and for his Democratic contender, Sen. John Kerry.

Bush appeared in clusters for words around "president" and "White House," to name some examples, but the results drew laughter when he also appeared in descriptive categories such as "idiot" and "chimp."

"This is what the Web says, not my opinion," Norvig said following the laughter.

Kerry appeared within groups for "senator" and for his wife, "Teresa Heinz Kerry," as well as for "Bob Kerry," a former senator with whom some people may confuse him.

None of the clustering approaches is publicly available, though Norvig said in an interview following the panel that they may become Google Labs betas in the future.

Google Labs often prototypes features and services publicly that, sometimes, become new offerings. News alerts and Googles local search are among the labs graduates.

"Certainly one application for clusters is in results pages, and it may be something we do at some time," Norvig said in the interview.

A growing number of search startups have targeted the automatic clustering of search results. Vivisimo Inc., one of the best-known startups that recently launched Clusty search site, groups results gathered from other search engines into clusters, or categories, as a way of drilling down into results.

While it might make sense for startups to deploy clustering technology today, Norvig said, Google still views the technology as too immature. It is most useful only for a small percentage of search results, he said, so Google is focusing on improving the technology and increasing its usefulness.

"Our take is that the state of the art is not there yet," Norvig said.

With machine translation, Google is bringing to bear its formidable Web index—which at last count included 6 billion documents, images and items—as well as its computing resources. Google is well-known for having one of the largest clusters of Linux-based servers, which number in the thousands.

Google already provides a Web-page translation feature, but Norvig said it is based on technology from a third party. Its research project is based on homegrown technology that eventually could translate Web pages and links more automatically, he said.

/zimages/2/28571.gifCheck out eWEEK.coms Enterprise Applications Center at for the latest news, reviews and analysis about productivity and business solutions.


Be sure to add our enterprise applications news feed to your RSS newsreader or My Yahoo page