
Unannounced, OpenAI recently added details about its crawler, , to its website for online documentation. GPTBot is the name of the user agent the company uses to retrieve web pages to train the AI models behind ChatGPTFor example GPT-4. Earlier this week, some websites to block GPTBot’s access to their content.
In the new documentation, OpenAI says that web pages crawled with GPTBot “can potentially be used to improve future models” and that allowing GPTBot to access your website “can help AI models become more accurate and improve their general capabilities and security .”
OpenAI claims to have implemented filters that ensure that sources behind paywalls, those that collect personally identifiable information, or any content that violates OpenAI’s policies will not be accessed by GPTBot.
News of potentially being able to block OpenAI’s training scraps (if they respect them) comes too late to affect ChatGPT or GPT-4’s current training data, which was scraped without announcement years ago. OpenAI collected data ending in September 2021, which is the current “knowledge limit” for OpenAI’s language models.
It is worth noting that the new instructions prevent browser versions of ChatGPT or ChatGPT plugins from accessing current websites to relaying updated information to the user. That point was not specified in the documentation, and we contacted OpenAI for clarification.
The answer lies with robots.txt
According to OpenAI’s GPTBot will be identifiable by the user agent token “GPTBot”, with its full string as “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)” .
The OpenAI docs also provide instructions on how to block GPTBot from crawling websites using the industry standard file, which is a text file located in the root directory of a website that instructs crawlers (such as those used by search engines) not to index the website.
It’s as simple as adding these two lines to a website’s robots.txt file:
User-agent: GPTBot Disallow: /
OpenAI also says that admins can restrict GPTBot from certain parts of the site in robots.txt using various tokens:
User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
In addition, OpenAI has provided from which GPTBot will work, which may also be blocked by firewalls.
Despite this option, blocking GPTBot will not guarantee that a website’s data will not end up training all AI models in the future. In addition to problems with scrapers ignoring robots.txt files, there are other large datasets of scraped websites (such as ) that are not affiliated with OpenAI. These datasets are typically used to train open source (or open source) LLMs such as Metas Llama 2.
Some sites react with urgency
Although ChatGPT is very successful from a technical point of view, it has also been controversial by how it scraped copyrighted data without permission and concentrated that value into a commercial product that circumvents the typical online publishing model. OpenAI has been accused of (and sued for) plagiarism in this way.
Consequently, it’s not surprising to see some people react to the news of potentially being able to block their content from future GPT models with a sort of pent-up . For example, on Tuesday, VentureBeat that Substack author and from Clarkesworld everyone said they were going to block GPTBot soon after news of the bot broke.
However, for large website operators, the choice to block large language model (LLM) crawlers is not as easy as it may seem. Blinding some LLMs to some site data will leave knowledge gaps that may serve some sites very well (such as sites that don’t want to lose visitors if ChatGPT provides their information for them), but it may also harm others. For example, blocking content from future AI models could reduce a website’s or brand’s cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it did not want its website indexed by Google in 2002—a self-defeating move when it was the most popular on-ramp for finding information online.
It’s still early in the generative AI game, and regardless of which way the technology goes — or which individual websites try to opt out of AI model training — at least OpenAI offers the option.
#Websites #block #ChatGPT #crawler #instructions #Ars #Technica