Talk:Model collapse

	This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence
???	This article has not yet received a rating on the project's importance scale.

Computing Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Low	This article has been rated as Low-importance on the project's importance scale.

Technology

This article is within the scope of WikiProject Technology, a collaborative effort to improve the coverage of technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TechnologyWikipedia:WikiProject TechnologyTemplate:WikiProject TechnologyTechnology

Not better be a section within overfitting?

I have just read the paper, and I am not sure if this might not be better as a section within the overfitting page. The process is essentially recurrent overfitting of the model, due to class imbalance in the input, leading to an exaggerated class imbalance in later stages. Unless I am missing something conceptually. Bastianhornung (talk) 09:32, 26 July 2024 (UTC)[reply]

It is about training on synthetic data, including its own output. The media and academic coverage makes it more than notable enough for its own page. Wqwt (talk) 15:15, 2 January 2025 (UTC)[reply]

Inbreeding

I have heard this phenomenon colloquially referred to as "inbreeding" or "inbred AI" on reddit. Might be worth putting in the article. 68.237.60.88 (talk) 17:28, 30 January 2025 (UTC)[reply]

Collapse of a single model or a succession of models?

The intro doesn't clear up my confusion on this point: Is this a matter of a single model degrading over time, or of a succession of models, with the later ones performing worse than the earlier ones? My uneducated assumption is that once a model is trained, it becomes relatively fixed (unless retrained), so its performance won't degrade. A subsequent model trained (partially or fully) on the output of the first model, though, will perform worse. Is this correct? It would be good to clarify. Sharpner (talk 23:32, 19 February 2025 (UTC)[reply]

Is the core issue 'synthetic vs. human' or 'grounded vs. un-grounded' data?

The definition of model collapse often relies on the dichotomy between "synthetic data" and "human-generated data." However, this view is superficial. The fundamental distinction lies not in who generated the data, but in whether the data is "grounded" in reality. Grounded data derives from direct interactions with the world, whereas data generated by an AI model is the output of a statistical model of the world, not of the world itself (Goodfellow et al., 2016; von Helmholtz, 1860).

To strengthen this argument, we can draw parallels to human cognitive, social, and even neurological phenomena. Model collapse is the computational counterpart to what occurs in "echo chambers," where information degrades in closed loops (Arendt et al., 2021). It is even more analogous to the phenomenon of sensory deprivation. When a brain is deprived of new, real-world stimuli, it begins to generate its own perceptions—hallucinations—by recycling internal memories and patterns uncontrollably, a process underpinned by the brain's predictive nature and its reliance on internal models when external data is absent (Friston, 2010; Goldstein & Volkow, 2011).

In all these cases, the principle is the same: the degradation of information occurs in any system—biological or artificial—that is forced to learn from its own representations of the world, instead of from the world itself. Therefore, model collapse is not a problem of "synthetic vs. human data," but rather of isolated learning systems vs. open systems continuously grounded in reality.

References

Extended content

Arendt, F., et al. (2021). Echo Chambers and Online News Exposure: Evidence from Germany. Communication Research, 48(5): 629-652. (Comment: Explains the social concept of echo chambers and their impact on information degradation, directly supporting the social analogy.)

Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138. (Comment: Covers brain function, perception, and learning under a unifying principle, highly relevant for the neurological analogies of a system generating its own reality in the absence of external input.)

Goldstein, R. Z., & Volkow, N. D. (2011). Dysfunction of the prefrontal cortex in addiction: neuroimaging findings and clinical implications. Nature Reviews Neuroscience, 12(11), 652-669. (Comment: Addresses neurological phenomena linked to internal patterns and can be related to the effects of sensory deprivation, supporting the neurological parallel.)

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Comment: Discusses machine learning, generalization, and problems related to the quality of data, which is central to the AI side of the argument.)

Helmholtz, H. von. (1860). Treatise on Physiological Optics. (Comment: A classic study on perception, grounding in the real world, and the limitations of internal experience, providing a foundational basis for the concept of "grounded" data.)

Raphael2718 (talk) 12:40, 28 August 2025 (UTC)[reply]

That is interesting, but seems to be in part original research. Do you have a reference dealing specifically with the topic, so we can verify it? TucanHolmes (talk) 08:02, 29 August 2025 (UTC)[reply]