“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
That’s a large language model trained on human input from Wikipedia then trained on its own output for nine iterations.
“For now, our store of human-generated data is large enough that current AI models won’t collapse overnight,” the researchers noted. “But to avoid a future where they do, AI developers will need to take more care about what they choose to feed into their systems. This doesn’t mean doing away with synthetic data entirely, but it does mean it will need to be better designed if models built on it are to work as intended.”
I think this is the first article I’ve seen that really explains the problem of “model collapse” as a result of training an LLM with LLM-generated data: AIs may fail to pick up less common lines of text in training datasets, which means subsequent models trained on the output cannot carry forward those nuances.
Emily Wenger, a Duke University assistant professor, used the example of an AI-based dog image generator:
“The AI model will gravitate towards recreating the breeds of dog most common in its training data, so might over-represent the Golden Retriever compared with the Petit Basset Griffon Vendéen, given the relative prevalence of the two breeds. If subsequent models are trained on an AI-generated data set that over-represents Golden Retrievers, the problem is compounded. With enough cycles of over-represented Golden Retriever, the model will forget that obscure dog breeds such as Petit Basset Griffon Vendéen exist and generate pictures of just Golden Retrievers. Eventually, the model will collapse, rendering it unable to generate meaningful content.”
“If you were wondering what they’re using to train GPT-5, well, now you know”
That’s John Levine, creator of “the world’s lamest content farm”. Recently, Levine’s “farm”–billions of randomly-generated, single-page websites—caught the attention of OpenAI’s training bot, which then crawled the site millions of times in a few days.
Rather than being one giant website, each page has its own domain name. A badly written spider, like, for example, OpenAI’s will say ‘Oh look at all these websites that are linked together!’ and will essentially get trapped pinging the sites. At one point Wednesday, OpenAI’s GPTBot was crawling Levine’s site as many as 150 times per second, according to activity logs I viewed. It hit his pages more than 3 million times over the last several days, he said.
Levine finds the situation amusing, and says both Bing’s crawler bot and an Amazon bot have previously fallen into the same trap. He points out that running a web spider is a tricky task that requires experience to avoid such problems: “All of these pages look the same and they’re all in the same IP address and they all share the same SSL certificate. It’s not really making an attempt to hide the fact that all 6 billion pages are really the same, but you actually have to have some experience doing this stuff to avoid hammering on people.”
This highlights just one challenging aspect of training AI models. As generative AI evolves, we need to question and scrutinize the methods and data sources used.
“The digital divide seems to have flipped” The Economist reports–non-white American parents and educators are adopting artificial intelligence technologies faster than their white counterparts. This trend suggests the technology may offer an unexpected advantage to disadvantaged communities, despite fears that generative AI could increase disparities:
Yet ai is disrupting the digital-divide narrative. It is true that algorithms have disadvantaged black and Hispanic people in health care, policing and the court system. Facial-recognition software continues to struggle with non-white faces. Some ai chatbots have generated racist content. But when it comes to using ai personally, non-white families may be getting an edge.
According to the Walton Family Foundation, funded by members of the family behind Walmart, while 72% of white parents say they use AI personally, 80% of black and 84% of Hispanic parents say they do. Black teachers use AI in the classroom more often, and non-white children are also more likely to use AI at home: 68% of white parents say that their child uses AI chatbots for school, compared with 75% of Hispanic and 81% of black parents.
New Scientist highlighted a potential concern in the intersection of AI and cybersecurity. Researchers David Zollikofer at ETH Zurich and Benjamin Zimmerman at Ohio State University developed a simple computer virus that uses large language models to rewrite their code to avoid detection and to spread via email attachments.
“We ask ChatGPT to rewrite the file, keeping the semantic structure intact, but changing the way variables are named and changing the logic a bit,” says Zollikofer. This adaptation allows the altered virus to evade routine antivirus scans once the original format has been identified.
Once the virus is rewritten by ChatGPT, the program opens up Outlook in the background of Windows, without the user knowing, and scans the most recent email chains. It then takes the content of those emails and prompts ChatGPT to write a contextually relevant reply, referencing an attachment – the virus – in an innocuous way....
In their experiments, there was around a 50 per cent chance that the AI chatbot’s alterations would cause the virus file to stop working, or, occasionally, for it to realise it was being put to nefarious use and refuse to follow the instructions. But the researchers suggest that the virus would have a good chance of success if it made five to 10 attempts to replicate itself in each computer.
I wonder about the feasibility of implemenenting such a virus in the real world. LLMs are too big to distribute with malware, leaving the system dependent on an outside service. This could be a public API like OpenAI’s, or a service run by the malware’s creators themselves; either way, it’s a weak point for security experts to exploit. In any case, this looks like a new front in the endless conflict with cybercriminals.
Participants were given the chance of completing tasks in one of three modes: independently, without any AI assistance; human-primary, where ChatGPT could assist them in editing and polishing their own work; or AI-primary, where ChatGPT would write the first draft and the person would then edit it. Some were given the choice between human-primary and independent writing, while others were given the choice of AI-primary or independent writing. Those who worked independently were always given $3 for completing their task. AI-assisted tasks were offered to workers with a random amount between $1.50 and $4.50, at $0.25 intervals.
Participants were willing to give up about $0.85–28% of the total they would have been paid–to have a first draft written by AI. Participants saw no meaningful difference in the quality of the work they produced with and without AI assistance; so it seems this is entirely down to the AIs taking over some of the work involved in the writing tasks.
This result suggests to me that much of any efficiency savings produced by LLM technology is likely to be enjoyed by employers, not workers. Workers can produce more cheaply and quickly, reducing the overall need for labor. I’m not clear what that means for individual workers–possibly those more skilled in working with AI assistants will benefit.
Earlier this Spring, the Wall Street Journal suggested a fun DIY project for an election-year summer. Jack Brewster of NewsGuard explained,
It took me two days, $105 and no expertise whatsoever to launch a fully automated, AI-generated local news site capable of publishing thousands of articles a day—with the partisan news coverage framing of my choice, nearly all rewritten without credit from legitimate news sources. I created a website specifically designed to support one political candidate against another in a real race for the U.S. Senate. And I made it all happen in a matter of hours.
I’m nervous and morbidly curious to see what effect LLM technology will have on the upcoming US election.
“The honeymoon phase of generative AI is over,” the company said in its 2024 Generative AI Global Benchmark Study, released on Tuesday. “While leaders remain enthusiastic about its potential to transform businesses, the initial euphoria has given way to a more measured approach.”
That’s Lucidworks, in a recent study citing cost, data security, and safety reasons for businesses’ growing skepticism about Generative AI. The study, “The State of Generative AI in Global Business: 2024 Benchmark Report”, was released on Tuesday and indicates a shift from initial euphoria towards a more measured approach.
According to the results of the survey, 63 percent of global companies plan to increase spending on AI in the next twelve months, compared to 93 percent in 2023 when Lucidworks conducted its first investigation.
The financial benefits of implemented AI projects have been unsatisfactory, with 42% of companies seeing no significant benefit from their generative AI initiatives. So far, few companies have managed to exit the pilot testing phase in their initiatives.