In an attempt to index more content, more quickly, Google has begun letting Web-site operators point the search engines crawler at Web pages and tell it when content is updated.
Google Inc. started the beta program, called Google Sitemaps, late Thursday in a move that officials said would help supplement its Web crawling technology in order to better index dynamic content and frequently refreshed pages.
By letting Webmasters submit their sitemaps of URLs and other information, Google also is acknowledging that technology alone cannot find and catalog everything on the Web, search-engine experts said.
“Its an admission that the Web is messy and that there are all kinds of challenges [for crawlers],” said Fredrick Marckini, founder and CEO of search-marketing firm iProspect Inc.
At its core, Google Sitemaps is an XML protocol by which Webmasters can submit a list of their URLs and provide data to Google about the relative importance of sites and how often they are updated, said Shiva Shivakumar, Googles director of engineering.
Google also has released a tool called the Sitemaps Generator, which allows site operators to generate an XML-based sitemap. Google is offering the Python-based tool to open-source developers.
While Google plans to crawl submitted URLs, it is making no promises that it will spider and index every site. The program is open to both single-page sites and sites with millions of Web pages.
“Any dynamic site, any site that changes pretty rapidly or [with] archived content will benefit by telling us and others about it,” Shivakumar said. “Its surprising that hasnt happened before.”
Search engines almost since their creation have offered various ways for Webmasters to submit URLs for possible inclusion in their indexes. Google already provides a way for sites to submit home-page URLs to the Googlebot crawler.
Yahoo Inc. provides both a free and a paid-inclusion program, where sites can pay to feed it URLs for indexing. Other search engines have moved away from paid inclusion.
Ask Jeeves Inc. dropped a paid-inclusion program last year, though it still accepts free URL submissions. Microsoft Corp.s MSN Search also lets sites submit a URL.
Google is positioning its new approach as an open, potentially industry-wide way for sites to not only submit full site maps but related information and updates. For example, the protocol includes a mechanism for sites to ping Googles server when a page has been updated, Shivakumar said.
Google is offering access to the protocol, called Sitemap 0.84, under a Creative Commons license. Along with other search engines, Shivakumar said he hopes that Web servers eventually will support the protocol natively.
Marckini said that he expects search-engine marketers and savvy Webmasters to quickly adopt the protocol as a way of ensuring that Google is aware of all their sites.
“The problem with this, of course, will be adoption,” Marckini said. “Everyone with a search-engine marketing firm will be armed and ready for it. But for the majority of people, the awareness of this wont be high.”
Shivakumar acknowledged that Google is waiting to see how successful Sitemaps will be with Webmasters and with the wider industry but views the program as a way to boost the completeness and freshness of its index.
“Our real goal is to be able to index all publicly available content,” Shivakumar said.
For site operators, the way search engines crawl, index and rank their sites is often a mystery because of the proprietary technology of the different engines, said David Berkowitz, director of marketing at icrossing Inc., a search-engine marketing company.
The problem is often worse for sites that use certain content-management systems, database-driven sites or Flash pages that are unfriendly to crawlers, he said.
Googles introduction of a broad submission program doesnt diminish the importance of its crawler technology, but it does open more communication between Google and Webmasters, Berkowitz said.
“Heres one more thing we can do so well be more likely to be heard by Google,” he said.