media-blend
text-black

A person typing on a laptop surrounded by stacks of documents on the desk

GenAI: Running on empty?

The human-generated data that feeds LLMs is drying up. Synthetic data carries its own risks. Here’s what to expect and how to prepare.

If you stopped a statistically significant number of people on the street and asked them to name their biggest gripe with modern life, it’s hard to imagine that “a shortage of information” would crack the top 10 or even, frankly, that it would rank above “too much information” in a Family Feud–like ordering of their complaints. Most of us feel like we’re drowning in data. In business, somehow staying afloat in the rising ocean of digital data is the modern price of survival, while the few who manage to actually ride the informational waves are gobbling up market share.

And yet, in the world of generative AI (GenAI), a looming data shortage is exactly what’s keeping some researchers up at night. A 2024 report from the nonprofit watchdog Epoch AI projected that large language models (LLMs) could run out of fresh, human-generated training data as soon as 2026, while in January 2025 the ubiquitous Elon Musk declared that “the cumulative sum of human knowledge has been exhausted in AI training,” and that the doomsday scenario envisioned by some AI researchers “happened basically last year,” as reported by The Guardian and other media outlets.

Even if Musk is overdramatizing the situation, GenAI is unquestionably a technology whose breakthroughs in power and sophistication have generally relied on ever-larger datasets to train on. As a result, the prospect of running out of fresh data raises a few troubling questions:

Scraping the barrel

In only five years, GenAI has gone from a fringe academic discipline to a ubiquitous, even overbearing presence in modern business life—name-checked in every boardroom and baked into every plan. It’s a star, if not the protagonist, of every dream and every nightmare about the future of work and commerce. But beneath the flurry of investment, adoption, and general GenAI activity, a quiet concern has surfaced: What if the fuel driving all this progress is running low?

These fears were given an early public voice in a 2022 paper by researchers at the University of California at Berkeley and Epoch AI, a nonprofit research group focused on long-term trends in artificial intelligence. Depending on how aggressively companies trained their models, the report estimated, the size of the datasets used in AI training would move into parity with “the available stock of public human text data” at some point from 2026 to 2032.

That wide, six-year window reflects just how many variables are in flux—chief among them the pace at which companies are scaling their models and the unresolved question of what portion of the available stock is fit to use for training purposes. Some researchers feed their models the equivalent of steak and lobster, limiting their training data to copyright-clear, human-written, well-structured text; others are happy to pad their datasets with a typo- and bias-ridden slurry of experimental novels and old Reddit threads scraped from the dark corners of the Internet. The lack of transparency from major AI labs about what they are actually using only adds to the uncertainty of the timetable.

As do the courts. A still-growing thicket of copyright lawsuits, from bestselling authors to major media companies, now threatens to place swathes of valuable training material off-limits—some retroactively. Meanwhile, regulators in Europe are moving toward rules that could restrict the use of publicly accessible data, especially content scraped without explicit consent. If lawmakers or judges conclude that consent was required all along, much of what Musk once called “the cumulative sum of human knowledge” could be declared off-limits after the fact—and the moment we run out of usable data might suddenly be behind us.

There is certainly reason to think that running out of training data would be a big deal. For all the mystery and nerd-speak about GenAI, progress so far has followed a strikingly simple rule: The more data you feed a model, the smarter it gets. The relationship between training data and model sophistication has been almost linear. The greater the raw tonnage of text poured in during training, the more fluent and humanlike the results. The release in 2020 of OpenAI’s GPT-3—a 175-billion parameter model trained on a massive corpus of Internet text—was a watershed moment, not because of a breakthrough in design, but because of its sheer scale. Almost every great leap forward in model power to this point has been achieved by a great leap upward in the amount of training data. If we are, in fact, running out of training data, then logic would suggest that we might also be looking at the end of GenAI progress.

“Ah well,” you might be tempted to say, “nothing lasts forever.” If the current, dizzying pace of GenAI development starts to taper off in a few years, so what? When LLMs can already generate dazzling content in the blink of an eye—faster than any human and, by most metrics of usefulness, more consistently—how much more training do they need? If progress stalled at today’s level, let alone where we might find ourselves after a few more years of rapid progress, how much of a problem would that actually be?

A big one, potentially. For all the marvels of the last few years, GenAI is still not yet operating at the level its backers have promised or that many businesses are banking on. Its powers of logical reasoning remain hit-or-miss. Factual accuracy is inconsistent. Models are still prone to wild hallucinations and, arguably worse, to fabrications so banal and pedestrian that they evade human scrutiny entirely—until it’s too late. And while these flaws are tolerated today, chalked up to the teething troubles of new tools with awesome potential, they’re also the source of growing skepticism. Business leaders in particular are starting to voice doubts about GenAI’s ROI, particularly for more speculative deployments. If data runs dry and GenAI progress stalls out, those doubts could quickly harden into regret.

How close are we to that tipping point? Epoch AI has lately revised its timeframe back a little, with the front end of its end-of-data window now falling in 2028, out from 2026. Epoch researcher Pablo Villalobos, lead author of that 2022 study, says it’s possible that date will push out even further. There are still “some relatively small but very high-quality sources that have not been tapped yet,” he concedes, including digitized documents in libraries.

Then again, our dwindling reserves of such high-hanging fruit “might not be enough” to postpone GenAI’s day of reckoning much beyond that, Villalobos cautions. He notes that in a recent podcast, OpenAI researchers said that in the development of their latest model, ChatGPT4.5, a shortage of fresh data was more of a constraint than a lack of computing power.

Of course, there is one obvious solution to a looming shortage of written content: have LLMs generate more of it.

A close-up of a human finger delicately touching a screen displaying binary code, with a faded hard disk drive gently blurred in the background.

The case for synthetic data

Synthetic data is computer-generated information that has the same statistical properties and patterns as real data but doesn’t include any actual real-world records. Amazon recently had success using this method with LLM-generated pairs of questions and answers to fine-tune a customer service model. The data was generated with a specific context in mind, following clear templates and guardrails that ensured consistency and relevance. Because the task was narrow and the outputs were easily reviewed by human beings, the additional training on synthetic data helped the model get better at responding accurately to customer inquiries, even in scenarios it hadn’t seen before.

Another use case for synthetic data is in coding, where output can often be automatically verified. In this scenario, says Villalobos, “synthetic data is working greatly and will probably continue to do so.”

This is likely also true for many of the businesses using proprietary data to train their own bespoke LLMs—whether building them from scratch or, more commonly, layering retrieval-augmented generation (RAG) atop a commercial foundation model. In many such cases, the proprietary data involved is tightly structured, such as with historical transaction records or internal logs formatted like spreadsheets with dates, locations, and dollar amounts. In contexts like these, LLM-generated synthetic data is often indistinguishable from the real thing and just as effective for training.

RAG-based systems face even less pressure. Because they rely on retrieving snippets of internal content to tailor a response from a trained commercial model, the quantity of proprietary data required is relatively modest. The goal isn’t to retrain the model wholesale but to give it just enough context to sound like it knows your business. As a result, the risk of running out of usable data in these setups is considerably lower. In many cases, it isn’t a concern at all.

But in less narrowly defined training scenarios, specifically the development and training of those big commercial models which RAG relies on, the risks of training on synthetic data are real.

The most widely cited danger has the dramatic name “model collapse.” In a 2024 study published in Nature, researchers showed that when models are repeatedly trained on synthetic data generated by other models, they gradually lose diversity and accuracy, drifting further from the true distribution of real-world data until they can no longer produce reliably useful output.

It's a problem that seems to inspire off-color analogies. Mohan Shekar, SAP’s AI and quantum adoption lead for cloud-based ERP, likens the process to “model incest.” With every successive iteration, a model trained on its own output will tend to reinforce biases and flaws that may at first have been barely noticeable, until those minor defects become debilitating deformities.

If even incest is too tame an analogy, how about cannibalism? One of the more rigorous studies of synthetic-data training coined the term MAD, short for “model autophagy [self-eating] disorder,” to describe the degraded condition of models raised to maturity on a diet of their own output. Shayne Longpre, a PhD student at MIT and a founder of the non-profit Data Provenance Initiative, likens such a model’s long-term prospects to those of “a snake eating its own tail.”

Long before reaching these extreme states, models trained with synthetic data have also been shown to exhibit a dullness and predictability reflecting their lack of fresh input. Such models may still have their uses, especially for mundane work and applications, but as Shekar puts it, “If you’re trying to innovate—really innovate—[a synthetic-data–trained model] won’t get you there. It’s just remixing what you already had.”

And here’s the bigger problem: LLMs are going to train on synthetic data whether we want them to or not. To be useful and to comply with our instructions, they must stay in dialogue with the world. They have to read what we ask them to read, make sense of it, and explain it to us. They have to scour the web for the information we need, all the while absorbing new facts and ideas and navigating and learning new modes of thought and expressions.

Today, much of that language and those ideas are themselves the product of GenAI, without being flagged as such, meaning that existing LLMs are already, to some unknown extent, training on synthetic data. As machine-written material comprises a larger share of the total data universe, the risk of a sudden descent into MADness or model collapse becomes larger, while at least some degradation in the baseline quality level of LLM output becomes inevitable.

For business leaders, that has real implications. Many are making plans, projections, and investments based on the assumption that commercial LLMs will continue to improve, or at least remain as strong as they are today. But if performance begins to erode or even just plateaus, that assumption collapses. Use cases that looked transformative in 2023 might start to look like pie-in-the-sky longshots. The companies betting big now on AI-native strategies could find themselves locked into brittle workflows that no longer deliver.

binoculars icon

Small language models: Powerful tools in the AI toolbelt

This bite-size alternative to large language models can deliver big benefits for the right business scenarios.

Read the business case

The innovation escape hatch

But history isn’t destiny. While it’s true that the GenAI boom to date has been tethered tightly to the supply of training data, there’s no law saying that it has to stay that way.

Some researchers, including OpenAI CEO Sam Altman, have long argued that innovation in how models are trained may soon start to matter more than what they’re trained on. The next wave of breakthroughs, the thinking goes, may come from rethinking the architecture and logic of training itself and then applying those new ideas.

It’s a view that Villalobos generally shares. “In addition to [scaling up with more data], you have other types of innovation that might improve models as well,” he says. “And probably at some point, [those] will become the main drivers of progress.” While he cautioned that “whatever method you’re using, it will benefit from more data and more compute,” he agrees that advances in training technique could plausibly take over where ramping up the sheer tonnage of training data leaves off.

Yaad Oren, head of research and innovation at SAP, is more confident that such a shift is underway. Recent advances in training methods already mean “you can shrink the amount of data needed to build a robust product,” he says.

One of those recent advances is multimodal training: building models that learn not just from text but also from video, audio, and other inputs. These models can effectively multiply one dataset by another, combining different types of information to create new datasets. While these new datasets are technically the product of existing data, they are so transformed from the original information sources as to be immune from MADness, model incest, and similar claustrophobic maladies.

Oren gives the example of voice recognition in cars during a rainstorm. For car manufacturers trying to train an LLM to understand and follow spoken natural-language instructions from a driver, rain in the background presents a hurdle. One unwieldy solution, says Oren, would be to “record millions of hours of people talking in the rain,” he says, to familiarize the model with the soundwaves produced by a person asking for directions in a torrential downpour.

More elegant and practical, though, is to combine an existing dataset of human speech with existing datasets of “different rain and weather sounds,” he says. The result is a model that can decipher speech across a full range of meteorological backdrops—without ever having encountered the combination firsthand.

And that’s just one pair of “modes.” Picture a model trained to detect urgency by combining audio tone, ambient noise, and word choice. Or a system that uses gesture data from a camera and transaction logs and calendar entries to anticipate human intent. With multimodal learning, nearly every domain—sound, image, text, video, biometric signal, machine output—can be stacked, paired, or cross-pollinated with nearly every other. The number of possible combinations starts to look less like a dry checklist of use cases and more like a sparkling galaxy of possibilities.

All of this is before considering the Next Big Thing: quantum computing. “What quantum brings in,” says Shekar, “is a way to look at all the possible options that exist within your datasets and derive patterns, connections, and possibilities that were not visible before.”

Imagine a legal LLM trained on decades-old case law. Today, such a model might be able to deliver cogent summaries of legal precedents and even suggest legal strategies, but without new reams of case law to study, it is unlikely to develop any further capabilities.

A quantum-powered system, however, could train on that same dataset of musty case law and derive from it a new understanding of rhetorical patterns, correlations between specific legal arguments and verdict outcomes, or hidden threads linking obscure rulings across disparate jurisdictions.

Along with squeezing more training juice from already exploited datasets, quantum computing could even increase the total supply of usable data by accessing the vast, underutilized oceans of so-called unstructured data, says Shekar. Today’s models typically require neatly labeled training material (“This is a cat,” “That is a dog”) in order to learn. But quantum systems, with their exponentially greater processing power, can ingest unlabeled, messy datasets, like a jumbled archive of photos containing cats, dogs, and random holiday snaps of scenery, and still find the patterns it needs to master the art of cat spotting. “Instead of needing 50 labeled images to train a model,” says Shekar, “you might be able to throw in 5,000 unlabeled ones and still get a more accurate result.”

That could be a very big deal indeed. AI engineers have long had the same feelings about unstructured data that physicists have about dark matter: an exquisite blend of awe, annoyance, and yearning. If quantum computing finally unlocks it, especially in tandem with multimodal learning and other innovations, today’s fears of a data drought might recede.

A female holding up a mobile phone with the screen of the phone showing her face.

Plan for feast and famine

All these scenarios depend on history breaking a certain way. If quantum computing arrives faster than expected or if new training methods render data efficiency orders of magnitude better, the data shortage may never materialize. But if quantum stalls out or if courts and/or governments retroactively decree that vast swathes of existing datasets should never have been used for training in the first place, the looming shortage could suddenly be upon us.

The good news for business leaders, says Oren, is that this is a rare predicament in which the best course of action is similar regardless of which scenario comes to pass.

Whether the data shortage hits sooner, later, or not at all, the businesses best positioned to thrive will be those with their data houses in order, he says. CIOs need to ensure that their data infrastructure is structured for AI, which for those with any RAG aspirations means building data pipelines that prepare and label internal content for efficient retrieval. The dreaded data silos should be dismantled, their contents integrated, and those stubborn pockets of unstructured data brought into usable form, insofar as that’s possible. At a minimum, customer data and other business records should be verified as complete, accurate, well structured, and legally usable. In any of the best-case scenarios in which AI progress barrels ahead, unimpeded by the issues we’ve discussed, businesses with the cleanest, most accessible data will be first in line to benefit.

If high-quality training data does become scarce, the value of proprietary datasets will rise. Domain-specific information, like customer interactions, support tickets, contracts, and product documentation, could become vital fuel for building bespoke LLMs in a suddenly supply-constrained AI economy. If LLM training data does one day grow scarce, says Oren, “you’ll have barns full of valuable raw material that you already own.”

document icon

Getting your data right for AI

Here’s how to build a trustworthy AI foundation.

Read the post