Recently, Microsoft executives submitted a new proposal to the Internet Engineering Task Force to protect web data by creating additional rules to distinguish AI training bots from other bots, letting website owners block bots unwanted AI web crawlers. Because AI models require massive amounts of training data, AI companies typically collect that data from public websites by sending AI crawlers across blog posts, product pages, videos, and other forms of web content.
While tech companies argue that AI bots should be able to crawl publicly available data just like search engines, some website owners see it as an invasion of privacy, as they never consented to this data extraction. The new proposal offers three methods for blocking AI crawlers from invading your website: new robots.txt rules, application layer response header rules, and a Robots HTML meta tag.
New Robots.Txt Rules
Microsoft’s proposal would create additional rules for the robots.txt file websites use to instruct web crawlers and search engine bots about which parts of the site they can and cannot crawl.
“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288],” the draft proposal says, “the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models. Application developers are requested to honor these tags.”
The proposal suggests the following values for controlling how AI bots interact with websites:
- DisallowAITraining: Tells the parser not to use data for AI training
- AllowAITraining: Tells the parser the data can be used for AI training
These rules recognize the same matching logic of standard allow and disallow rules, and are case insensitive.
Application Layer Response Header
The proposal also states that web owners should be able to set these same robots.txt rules via the Application Layer Response Header—a type of HTTP request method that retrieves only the headers of a web resource—without downloading the actual content. As with robots.txt, the rules are not case-sensitive.
Robots HTML Meta Tag
The third way the proposal offers to block AI crawlers from a website is to use the following HTML meta tags:
- <meta name=”robots” content=”DisallowAITraining”>
- <meta name=”examplebot” content=”AllowAITraining”>
If the proposal’s recommendations are enacted, website owners will have more control over which bots can crawl their web pages. If enough web owners take advantage of these new rules and restrict AI bots, generative AI development could slow down.
Learn more about the privacy challenges and issues AI faces and the best practices and solutions to address them.