One look at Napster, and its easy to see how new Internet technologies can stir up major conflict. But a looming controversy on the Net might center on a piece of Internet plumbing thats now 7 years old — ancient in Web terms.
The Robot Exclusion Standard is arcane enough that probably only hard-core site developers and search engine specialists ever even think about it. But just because its been overlooked doesnt mean its insignificant. The standard figured into at least one important legal dispute on the Net last year, and chances are good it will soon surface again.
As its name suggests, the Robot Exclusion Standard was created to govern robots — computer programs that surf the Web without human supervision. Among other purposes, robots (or “bots”) are used to compile the vast Web databases that make search engines possible.
Yet not everyone appreciates robots efficiency. After all, using a robot, someone can extract content from a site without viewing a single ad. A competitor can make an instant copy of your Web sites data assets. Robots can also harm a site operationally, when they surf a site more quickly than a server can serve pages. So in 1994, when the Web was still very much an uncharted frontier, a group of developers cobbled together the Robot Exclusion Standard, which lets Web site owners tell robots where to go and where not to go on their sites.
While the standard created a workable compromise between bot developers and Web sites, it has not eliminated the potential for conflict when their interests collide.
Last year, eBay took auction aggregator Bidders Edge to court for sending robots to crawl the eBay site. A federal judge granted an injunction against Bidders Edge, relying partly on eBays use of the Robot Exclusion Standard.
Bidders Edge has shut down its Web service, and the companies have settled their dispute.
How could Bidders Edge robots crawl eBays site in the first place if eBay had taken steps to keep them out? The answer lies in how the Robot Exclusion Standard works. When it first enters a site, a visiting Web robot will ask the site for its “robots.txt” file. The file contains commands that tell robots which directories theyre not permitted to visit. (You can view CNN.coms file at www.cnn.com/robots.txt.)
However, the standard relies entirely on the courtesy of the visiting robot. Its completely optional. Nothing prevents robots from simply ignoring the directives in a robots.txt file — and many robots do just that. In that sense, a robots.txt file is less like a locked door than a “no entry” sign hanging in an open doorway.
Thats how the creators of the standard intended it, says Martijn Koster, who helped develop the standard and maintains it to this day. Koster, a software engineer at [email protected], says the developers main concern was preventing underpowered Web servers from being overloaded by out-of-control bots.
“Its about server administrators providing information to a robot — thats where it ends,” he says. The standard was never intended to be used to simply bar robots from accessing specific content, or to guarantee that visiting robots would comply with the robots.txt directives.
Last years eBay decision, however, suggests that where the standard left off, the courts may be willing to step in by granting legal force to the Robot Exclusion Standard.
“We likened it to a no trespassing sign, and the court agreed,” says Jay Monahan, eBays legal counsel for intellectual property issues. EBay cited several other factors in its case, but Monahan says he saw the Robot Exclusion Standard as an important part of the companys argument.
Koster has mixed feelings about sites employing the Robot Exclusion Standard to discriminate between bots and human users. Server operators should be able to turn to the Robot Exclusion Standard to curtail abuse, yet Koster also thinks using a robots.txt file merely to prevent bots from getting at publicly accessible content threatens the openness of the Internet.
“I dont think thats in the spirit of free information exchange,” Koster says. Some robots may have legitimate reasons to ignore robot exclusion directives. For example, he says, a company might use robots to hunt for copyright infringing content.
A bot should be allowed to view any publicly accessible pages as long as its not harming a Web site, says Wolfgang Tolle, chief technology officer at Cyveillance. Cyveillance offers digital asset scouring services exactly of the sort Koster describes, although Tolle says Cyveillances robots comply with the Robot Exclusion Standard.
Another factor that may complicate legal arguments is that the Robot Exclusion Standard isnt a “standard” in the strictest sense, since it isnt sanctioned by any authority.
Koster says he doesnt intend to push the Robot Exclusion Standard through a standards group, such as the World Wide Web Consortium, which might make it a required part of Web software. As an informal convention, it may have less weight in a courtroom — and to those who want the Internet to remain open, Koster says, that might not be such a bad thing.