In the last couple of months, generative AI has witnessed an unprecedented surge in popularity. Thanks to OpenAI’s groundbreaking launch of ChatGPT in November last year, even those previously unfamiliar with the term are now aware of the incredible potential of generative AI. However, amidst the excitement surrounding technological advancements, concerns have arisen about the negative impact of AI on the job market. Nevertheless, a recent study highlights a critical fact: AI cannot survive without humans.
Researchers from esteemed institutions like Cambridge, Oxford, University of Toronto, and Imperial College, London, have published a research paper titled ‘The Curse of Recursion: Training on Generated Data Makes Models Forget.’ This paper sheds light on the potential dangers faced by large language models (LLMs) if they begin training on AI-generated content rather than human-produced content.
Popular tools like ChatGPT, Bing, and Bard rely on existing human-generated data to answer questions and fulfill various content needs. These tools have been trained on data originally created by human beings. For instance, if someone asks Bing to look up information about sea creatures, it references an article written by a human being as its source of information. The training data for LLMs consists of articles, photos, research papers, and other forms of content created by humans. However, with the emergence of AI-assisted tools like ChatGPT, the future might witness a shift toward AI-generated content.
It is reasonable to anticipate a future where AI-generated content dominates as more people rely on these tools for content creation. However, this trajectory poses a threat to the future of LLMs.
The research paper introduces the concept of ‘model collapse,’ wherein LLMs become corrupted when trained on AI-generated data, deviating further away from reality. Model collapse occurs when AI-generated data is mistakenly considered genuine and used to train subsequent models, resulting in a degradation in their quality and accuracy.
Researchers explain that “learning from data produced by other models causes model collapse — a degenerative process whereby, over time, models forget the true underlying data distribution.” This process is inevitable, even under nearly ideal conditions for long-term learning. As LLMs increasingly rely on AI-generated data, their performance suffers and worsens over time.
IIia Shumailov, a prominent author of the research paper, warns that mistakes in generated data compound over time, leading models trained on such data to misperceive reality even further. This deterioration could have serious implications, such as discrimination based on sensitive attributes like gender or ethnicity.
The research paper emphasizes the crucial role of human-generated content and suggests preserving the original human-created data as a means to mitigate the risk of model collapse. Unfortunately, distinguishing between human-generated and AI-generated data remains a challenge at present.
The paper also asserts that the value of content created by humans will continue to rise in the future, as it serves as a “source of pristine training data for AI” and helps maintain the integrity and accuracy of AI models.
While generative AI has undoubtedly made significant strides, it is essential to recognize the indispensable role of human-generated content. As AI continues to evolve, preserving the authenticity and diversity of human contributions becomes paramount. Striking a balance between human and AI collaboration is vital to ensure the long-term success and responsible development of AI technologies.
In conclusion, the recent surge in interest in generative AI has sparked both excitement and concerns. The research paper highlights the potential pitfalls of relying solely on AI-generated content and emphasizes the enduring importance of human-generated data. By acknowledging this, we can pave the way for a future where humans and AI work together harmoniously, enabling AI to reach its full potential while remaining firmly grounded in the reality and wisdom of human knowledge.