3 New Rules to Block AI Bots from Invading Your Websites | eWeek

3 New Rules to Block AI Bots from Invading Your Websites

Flat vector illustration of a humanoid combined with programming code.

Image: Jackie Niam/Adobe Stock

Written By
Sam Rinko
Sam Rinko
Nov 28, 2024
2 minute read
eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Recently, Microsoft executives submitted a new proposal to the Internet Engineering Task Force to protect web data by creating additional rules to distinguish AI training bots from other bots, letting website owners block bots unwanted AI web crawlers. Because AI models require massive amounts of training data, AI companies typically collect that data from public websites by sending AI crawlers across blog posts, product pages, videos, and other forms of web content.

While tech companies argue that AI bots should be able to crawl publicly available data just like search engines, some website owners see it as an invasion of privacy, as they never consented to this data extraction. The new proposal offers three methods for blocking AI crawlers from invading your website: new robots.txt rules, application layer response header rules, and a Robots HTML meta tag.

New Robots.Txt Rules

Microsoft’s proposal would create additional rules for the robots.txt file websites use to instruct web crawlers and search engine bots about which parts of the site they can and cannot crawl.

“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288],” the draft proposal says, “the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models. Application developers are requested to honor these tags.”

The proposal suggests the following values for controlling how AI bots interact with websites:

  • DisallowAITraining: Tells the parser not to use data for AI training
  • AllowAITraining: Tells the parser the data can be used for AI training

These rules recognize the same matching logic of standard allow and disallow rules, and are case insensitive.

Application Layer Response Header

The proposal also states that web owners should be able to set these same robots.txt rules via the Application Layer Response Header—a type of HTTP request method that retrieves only the headers of a web resource—without downloading the actual content. As with robots.txt, the rules are not case-sensitive.

Robots HTML Meta Tag

The third way the proposal offers to block AI crawlers from a website is to use the following HTML meta tags:

If the proposal’s recommendations are enacted, website owners will have more control over which bots can crawl their web pages. If enough web owners take advantage of these new rules and restrict AI bots, generative AI development could slow down.

Learn more about the privacy challenges and issues AI faces and the best practices and solutions to address them.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.