Blocking AI Web Crawlers
In general web sites, posts, tweets and other on-line content all rely on sharing information. Links to other relevant content makes it easier to find that information either through old fashioned click throughs or through the algorithms used by search engines. The flip side to this is that the same information is easy to take control of. Certificates and encryption can protect sensitive data but is of little use for data that needs to be seen such as public facing text, images, sounds and video. This has always been open to IP theft from the web. More recently there is the more subtle risk from scraping by AI Engine bots with the original data being morphed into ‘new’ AI constructs.
AI engines are efficient in collecting, sorting and classifying data that is scattered around multiple sources. In June 2023 Microsoft and OpenAI were sued in a California court because products based on their AI engines were scraping personal and business information from the Internet in secret and without consent. This data scraping is not necessarily illegal and is the bread and butter of a data broker but there are restrictions on how that data can be used. National laws may require a data broker to be licensed, to require consent or to restrict the use of any data gathered.
The administrator of a web site can put measures in place to protect the data it hosts from scrapers. An important tool is through the ‘robots.txt’ file hosted at the root of a website directory. If a site includes sub domains each will need its own ‘robots.txt’ file in the root of its directory. Possible options are to block all crawlers, block crawlers from specified engines (such as Google) or to block all crawlers but then allow specific crawlers to access the site (in effect negating their blanket block). File types can also be specifically blocked; for example to keep images out of Google Images results. This all does require that the user has access to their root directory. Some hosting providers control file uploads and will need to be contacted to see if similar blocks can be put in place.
The ’robots.txt’ might be considered a nuclear option as the user either has to know what to specifically block or allow. Otherwise; all bots will be blocked making it harder for the intended audience to find the content. ChatGPT and derived engines use GPTBot. OpenAI has published the ‘robots.txt’ settings to block this bot completely, allow or disallow it to access named directories within the site.
In September 2023 the web service provider Cloudflare introduced a semi-automated service to control crawlers. Their system allows bots to be permitted or refused access according to set categories including ‘Search Engine Crawler’ and ‘AI Crawler’. The categories are set by Cloudflare who require services running web bots to register with their verified bot directory. This is effectively a simplified implementation of ‘robots.txt’ but it takes much of the research into bot motives away from the site administrator, recognises the problems of AI bots and simplifies the procedure for the user.