Synthetic data and the risk of 'model collapse'

Nvidia has bought start-up Gretel, a synthetic data specialist. As part of the world’s largest AI chip maker, the company’s offering should further expand Nvidia’s software suite and allow end users to generate extra data where it’s needed. But what benefits does synthetic data provide? And why might it threaten GenAI in the long run?

There has been no official announcement yet, but Wired has confirmed Nvidia’s acquisition of Gretel based on two anonymous sources. The purchase price is said to be higher than Gretel’s latest valuation of 320 million dollars. Since that is an amount that Nvidia earns in less than two days of sales, the financial burden will be minimal for the biggest winner in the field of AI processors.

The fact that the spotlight fell on a synthetic data startup is also unremarkable. There are now a significant number of such outfits, each with the same raison d’être: to generate more data. The human-generated repository of information that can be found online is running out and is even becoming increasingly limited.

The ‘data squeeze’

There is more scarcity in the GenAI world these days, at any rate. For two years now, the supply of GPUs has not met the demand from AI model builders and organizations that wish to run LLMs on them. However, the supply of data cannot simply be solved by a few extra chip factories like is the case with AI chips. First of all, the once-involuntary data suppliers for ChatGPT are no longer so generous. Reddit, X and a multitude of news organizations have long realized what a gold mine of human-generated data they were sitting on. There are now lucrative deals for those who do provide access, such as Reddit, Axel Springer and the Associated Press.

All of this has been in motion since the end of 2023 and is therefore not new. What has changed, however, is the prospect of the improvements that new models will deliver. Since the introduction of GPT-4 in April 2023, there has been intense speculation about the earth-shattering impact that GPT-5 would inevitably have. The intended LLM to fill these shoes was known under the internal codename ‘Orion’, but OpenAI had to find detours to compensate for the lack of new data for the model, which ultimately launched under the less-impressive sounding GPT-4.5. GPT-4.5, by far the largest LLM OpenAI has ever created, is actually a perfect name for it: the model was only half as interesting and impressive as one would have expected.

The scaling laws that propelled GPT-2, GPT-3 and GPT-4 have stopped. Inference time compute, or ‘reasoning’, proved to be a new way forward for better, more thoughtful AI outputs at the cost of extra running costs. But underneath it all, the problem remains that mountains of data and mountains of parameters seem to be the easiest way to make progress.

Tip: Databricks launches API to generate synthetic datasets

Synthetic data: privacy-conscious and more people-friendly?

The solution seems to be synthetic data, like the kind Gretel produces. “Better data makes better models,” goes the sales pitch. “Better” data as a qualifier suggests more than just “extra” data. The advantage of the synthetic nature of the data can be seen in the fact that banks, governments and medical organizations are customers of Gretel already. After all, they do want to train AI on high-quality data sets, but wish to steer well clear of utilizing privacy-sensitive data directly. Since the value of said data when anonymized remains high when you use it for training, it prevents a data leak from occurring while still reaping the rewards through accurate AI models. And what if your training data is not a correct reflection of the demographics? With synthetic data, additional persons from underrepresented groups can be ‘generated’ at will to prevent bias.

All of this sounds positive, but requires a precise approach. Tweaking and expanding your dataset requires a high degree of awareness about the content of that data and its potential shortcomings. This is why startups feel compelled to do this synthetic generation on behalf of organizations.

However, it doesn’t stop there. AI model builders are already training their new LLMs on older and/or larger models. For example, all DeepSeek distillations are smaller models that have queried the larger DeepSeek-R1 about all sorts of things and internalized its reasoning processes. Meta’s Llama 3 was partly trained with data generated by Llama 2. Amazon Bedrock uses Claude to allow organizations to generate synthetic information. In short, the floodgates are open.

This has huge potential privacy benefits and legal implications. After all, anyone using Llama 3 is also indirectly using Llama 2, which is alleged to have used enormous chunks of copyrighted content. Nevertheless, it is conceivable that AI models will appear that only use synthetic data, such as with medical data sets that can no longer be traced back to the individuals who once provided their data.

Model collapse?

There is a danger of an ‘ouroboros’ here, or a snake eating its own tail. Models can be ‘poisoned’ with data that is passed on in addition to malicious prompts. While usually caused by sabotage, this can also be unintentional: AI models sometimes hallucinate, including when they are generating data for their LLM descendant. With enough ongoing errors, a new LLM risks performing worse than its predecessors. At its core, it’s a simple case of garbage in, garbage out. The logical end state is a total ‘model collapse‘, where drivel overtakes anything factual and makes an LLM dysfunctional. Should this happen (and it may have happened with GPT-4.5), AI model makers are forced to pull back to an earlier checkpoint, reassess their data or be forced to make architectural changes. When the data is predominantly fictitious in nature, this also has an impact on the use cases. For example, historical knowledge can be accompanied by true-looking nonsense, or medical data is incorrectly transformed so that it suddenly suggests biological patterns that do not exist. One can imagine this leading any system astray, making its usage in production fundamentally untenable.

In short, a high degree of expertise is required for each step in the AI process. Currently, attention is focused on the initial building of the foundation models on the one hand and the actual implementation of GenAI on the other. The importance of training data was touched upon in 2023 because online organizations regularly felt robbed. In essence: it made headlines, which is why we all became aware of the intricacies of training data. Now that the flow of online retrievable data is ending, AI players are grasping for an alternative that is creating new problems. Now that Nvidia will make synthetic data generation even more readily available, the chance of undesirable consequences is enormous.

Also read: SAS dives deeper into synthetic data with Hazy acquisition

Synthetic data and the risk of ‘model collapse’

The ‘data squeeze’

Synthetic data: privacy-conscious and more people-friendly?

Model collapse?

Stay tuned, subscribe!

Transparency in AI reasoning process falls short

The Techzine Perspective: Atlassian Team ’25 highlights power of integration

Google brings Gemini to on-premises data centers with Distributed Cloud

E-commerce solutions provider puts its own portfolio on display

Intel and Altera aim to bring AI to edge computing with new series of chips

RFID gives optimal insight and overview in both store and warehouse

AI-powered cameras shake up retail

Cloud Account Executive – Slack

AI & Data Architect

VeeamON 2025

GITEX ASIA

SAS Innovate 2025

.NEXT 2025

LambdaConf 2025

Qlik Connect 2025

Try the latest high-end Synology backup system for free

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices

Are you data and AI ready?