OpenAI Collects Training Data From “World’s Lamest Content Farm”

Thu 01 August 2024

about: OpenAI Training Bot Crawls 'World's Lamest Content Farm' 3 Million Times in One Day

tagged ChatGPT, OpenAI, web_crawling, lolz

“If you were wondering what they’re using to train GPT-5, well, now you know”

That’s John Levine, creator of “the world’s lamest content farm”. Recently, Levine’s “farm”–billions of randomly-generated, single-page websites—caught the attention of OpenAI’s training bot, which then crawled the site millions of times in a few days.

Rather than being one giant website, each page has its own domain name. A badly written spider, like, for example, OpenAI’s will say ‘Oh look at all these websites that are linked together!’ and will essentially get trapped pinging the sites. At one point Wednesday, OpenAI’s GPTBot was crawling Levine’s site as many as 150 times per second, according to activity logs I viewed. It hit his pages more than 3 million times over the last several days, he said.

Levine finds the situation amusing, and says both Bing’s crawler bot and an Amazon bot have previously fallen into the same trap. He points out that running a web spider is a tricky task that requires experience to avoid such problems: “All of these pages look the same and they’re all in the same IP address and they all share the same SSL certificate. It’s not really making an attempt to hide the fact that all 6 billion pages are really the same, but you actually have to have some experience doing this stuff to avoid hammering on people.”

This highlights just one challenging aspect of training AI models. As generative AI evolves, we need to question and scrutinize the methods and data sources used.