Model Collapse: The Risks of AI Learning from AI-Generated Content & How to Prevent It

AI research group warning, model collapse concept, degenerative phenomenon, disconnected from reality, generative models, LLMs, Variational Autoencoders, Gaussian Mixture Models, distorted real-world data, statistical approximation error, functional approximation error, cascading inaccuracies, need for human-generated data, delusional AI offspring, future implications, moody chiaroscuro lighting, academic artistic style

Artificial Intelligence (AI) has revolutionized various industries, including healthcare, retail, entertainment, and art. However, recent research indicates that AI learning from AI-generated content could lead to potential problems. A research group from the UK, comprising members from Cambridge and Oxford universities, the University of Toronto, and Imperial College London, has issued a warning about “model collapse,” a degenerative phenomenon that could ultimately disconnect AI from reality.

In their paper, “The Curse of Recursion: Training on Generated Data Makes Models Forget,” the researchers explained that model collapse occurs when the generated data pollutes the training set of the next generation of models. Subsequently, being trained on polluted data causes the models to misperceive reality.

This issue is particularly prominent in learned generative models and tools such as Large Language Models (LLMs), Variational Autoencoders, and Gaussian Mixture Models. Over time, these models begin to forget the true underlying data distribution, leading to inaccurate representations and distorted real-world data.

Instances of machine learning models being trained on AI-generated data already exist. For example, LLMs are often intentionally trained on outputs from GPT-4. Similarly, the online platform DeviantArt allows AI-created artwork to be published and used as training data for new AI models. The researchers warn that these practices could lead to more cases of model collapse.

To prevent this phenomenon, access to the original data distribution is crucial. AI models require real, human-produced data to accurately understand and simulate our world. The research paper identifies two primary causes for model collapse: statistical approximation error and functional approximation error. Errors can accumulate over generations, resulting in a cascading effect of increasing inaccuracies.

According to the researchers, maintaining access to the original human-generated data provides a “first-mover advantage” in training AI models. This access might prevent detrimental distribution shifts and, ultimately, model collapse. However, distinguishing AI-generated content on a large scale poses a significant challenge and may necessitate widespread community cooperation.

The integrity of data and the influence of human information on AI are heavily dependent on the quality of the source data. The rise in AI-generated content may prove to be a double-edged sword for the industry. If AI systems continue to learn from AI-generated content, we could end up with intelligent yet “delusional” machines.

In a fittingly ironic twist, our AI “offspring” could become delusional by learning more from each other than from us. It remains to be seen if we will have to deal with these delusional, AI-driven entities in the near future.

Source: Decrypt

Sponsored ad