Synthetic Data Has a Governance Problem That Enterprises Are Not Ready For
As enterprises turn to algorithmically generated data to sustain AI development, a second-order problem is emerging — one that threatens model accuracy, regulatory compliance, and strategic differentiation.
Topics
News
- OpenAI Explores Legal Options Amid Tensions With Apple
- AI Has Outpaced How Companies Measure Developer Productivity, Report Finds
- AI Dispatch | May 8 - 14
- du Launches Sovereign Industrial AI Platform for UAE Manufacturers
- Huang Foundation Donates $108M CoreWeave Compute for AI Research
- Aramco Deploys Industrial AI as Strait of Hormuz Crisis Threatens Fuel Supplies
Key Takeaways
01
Privacy regulations, data scarcity, and rising demands for model training are pushing organizations toward algorithmically generated datasets at scale.
02
Unchecked synthetic data loops can cause model drift, hidden bias amplification, declining accuracy, and false confidence in high-stakes domains like healthcare and finance.
03
As synthetic generation becomes commoditized, organizations with access to large volumes of proprietary real-world behavioral data could gain an enduring strategic advantage.
A UK hospital deploys an AI diagnostic system trained to identify life-threatening lung conditions from clinical reports. Performance looks strong. Doctors trust its confidence scores. The system scales. Then, several retraining cycles later, the model begins missing collapsed lungs and fluid buildup — while continuing to issue reassuring assessments with near certainty.
The cause was not a software failure or a compute shortage. It was recursive contamination. The model had been trained on AI-generated outputs derived from earlier AI-generated outputs, drifting further from clinical reality with every cycle.
Data is the fuel for modern AI, but its supply is neither infinite nor reliably original. As large models scale, they risk encountering a constraint that is less discussed than compute or talent: the diminishing availability of high-quality, human-generated data.
This leads to the emerging governance problem enterprises are only beginning to confront. As organizations increasingly rely on synthetic data to overcome privacy restrictions, data scarcity, and rising AI training demands, they may also introduce invisible feedback loops that degrade model accuracy, amplify bias, and erode trust in AI systems over time.
Consider another scenario. An AI company preparing to train its next-generation language model audits its data pipeline and finds little that is genuinely new. Prior web-scale corpora have already been absorbed, while a growing share of fresh content is itself machine-generated. The result is a feedback loop in which models are trained on the outputs of earlier models—raising questions about quality, diversity, and the long-term trajectory of AI systems.
For researchers, a collapse is in sight. “If the training data of most future models are also scraped from the web, then they will inevitably train on the data produced by their predecessors… [A model collapse] is inevitable, even in cases with almost ideal conditions for long-term learning, that is, no function estimation error,” reads a 2024 research paper.
“We’ve achieved peak data, and there’ll be no more. We have to deal with the data that we have. There’s only one internet.”
— Ilya Sutskever | OpenAI Co-founder and Former Chief Scientist
This concern isn’t confined to academic research alone—industry experts have been equally vocal about it.
Previously, OpenAI Co-founder and Former Chief Scientist Ilya Sutskever made headlines with a chilling statement questioning the future of AI training. “We’ve achieved peak data, and there’ll be no more,” he said in December 2024. “We have to deal with the data that we have. There’s only one internet.”
Technology leaders face a major challenge: they are running out of data—not in quantity, but in quality.
A Possible Solution
Because they are predominantly trained on the same publicly accessible data, large language models become increasingly similar and harder to distinguish from one another. Keeping training data “fresh” for AI models is an expensive and logistically complex endeavor.
Synthetic data comes as a potential large-scale fix. Generating information that mimics the statistical properties of real datasets, synthetic data is a growing area of interest for tech experts and leaders. Mordor Intelligence forecasts that the synthetic data market will reach $710 million in 2026 and $3.67 billion by 2031.
Two Sides of the Same Coin
Synthetic data delivers when it is tied to reality. The problem arises when it becomes reality—something it never should.
Case 1: UK’s Synthetic Open Banking Data
The UK’s Financial Conduct Authority (FCA) spent two years examining how synthetic data could tackle some of the most stubborn data problems in financial services. To do so, it convened the Synthetic Data Expert Group (SDEG) — a panel of 21 experts drawn from across industry, academia, and regulation, including representatives from Barclays, HSBC, Standard Chartered, Mastercard, and the Alan Turing Institute.
The Open Banking practice enables customers to share secure financial information with third parties—lenders, money-monitoring apps, and payment services—but only with their consent. Due to the sensitive and personal nature of transactional data, testing the systems carries serious privacy and legal risk. This is where synthetic data comes into play.
One SDEG member organization tested this directly.
As a proof of concept (PoC), the project generated individual transaction descriptions, a text field containing transaction information, and an array of synthetic transactional data that, when aggregated to the customer level, replicated recognizable patterns of real income and spending behavior.
“It is not always the case of choosing whether to use synthetic data or real-life data. It is also important to understand the potential implications of having to remove some of the real-life data from existing training or validation sets, and evaluating the impacts on model predictiveness and accuracy,” the report stated.
The PoC’s preliminary findings indicated that, for this use case, a threshold of at least 30% real data, or an optimization of the real-to-synthetic data ratio, was needed to maintain strong model accuracy.

Case 2: Rapid Medical AI Contamination
Generative AI has been rapidly populating medical records with synthetic or partially AI-generated content.
A multi-institutional study from NUS, Harvard, Stanford, and Yale (the preprint analyzes over 800,000 synthetic data points) found that models progressively converged toward generic phenotypes regardless of model architecture, thereby degrading the generative capabilities of medical AI, such as medical LLMs, and posing a potential threat to clinical safety.
It observed rapid and substantial degradation within four generations of training under uncontrolled conditions, resulting in medical vocabulary collapsing by 98.9% from 12,078 to approximately 200 unique words in clinical reports, along with unique medical terms falling by 66% across datasets.
Over time, AI models stopped capturing rare yet life-threatening conditions such as collapsed lungs and fluid buildup, and began portraying patients as predominantly middle-aged and male.
“Crucially, this degradation is masked by false diagnostic confidence. Models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations,” it read.
Models prioritized linguistic and visual fluency over pathological fidelity. Consequently, consecutive sentences lost all topical connections, resulting in fragmented parts rather than a cohesive clinical narrative.
While healthcare is the biggest beneficiary of advanced technologies, without human oversight, AI systems risk becoming dangerously unreliable—confidently missing the very diagnoses they are meant to identify.
Research Context
- Gartner: By 2027, 60% of data and analytics leaders could face major failures in managing synthetic data, risking AI governance, compliance, and model accuracy.
- FCA (UK): Open banking experiments found synthetic financial datasets still required at least 30% real-world data to maintain strong predictive accuracy.
- NUS–Harvard–Stanford–Yale Study: Recursive synthetic retraining caused medical AI systems to degrade rapidly, with diagnostic false-reassurance rates rising to 40%.
- Regulatory Push: GDPR, CCPA, India’s DPDP Act, and Brazil’s LGPD are accelerating enterprise reliance on synthetic data as access to real-world personal data tightens.
Why CXOs Should Look into It
In production environments, collections and servicing interactions often contain multiple forms of contamination that organizations may not want AI systems to internalize during training.
These can be categorized into:
- Human agent errors: off-script disclosures, missed mini-Mirandas, tone slips that violate FDCPA or state-level rules.
- Class imbalance: the genuinely interesting edge cases, such as dispute escalations, hardship claims, cease-and-desist invocations, and deceased-borrower scenarios, are rare in raw call logs.
- Personally-identifiable information: PII that can’t legally cross training boundaries.
For Mayank Agarwal, lead AI engineer at Prodigal, synthetic data can deliver better models in regulated verticals like consumer finance, “it’s often the only way to produce a deployable agent.”
“We can generate compliant-by-construction conversations where every regulatory move is correct. We can rebalance the long tail so the agent has seen 5,000 hardship variations instead of 50. And we can produce adversarial scenarios like abusive consumers, manipulation attempts, and ambiguous payment commitments at volumes that would take years to accumulate organically,” he adds.
The resulting model is cleaner and more consistent with correct behavior than any individual human agent demonstrates in real-world operations.
IMPLICATIONS — BY ROLE | |
C-SUITE | The board-level assumption that synthetic data is a compliance-neutral, cost-efficient shortcut must be revised. Enterprises deploying AI at scale in regulated sectors — financial services, healthcare, government — require a data provenance policy that traces synthetic dataset lineage, mandates real-world anchoring ratios, and sets mandatory refresh cycles. The question to ask is not whether synthetic data is being used, but whether anyone in the organization can account for how many generations removed the current training data is from observable reality. |
FUNCTIONAL LEADERS | AI and data teams should immediately audit existing training pipelines for the risk of recursive contamination. Where synthetic data is the majority input, real-world validation datasets must be established as a permanent baseline. Model performance metrics must be expanded beyond accuracy scores to include confidence calibration — the UK hospital case demonstrates that a degraded model can maintain high confidence ratings while becoming clinically useless. |
BOARDS & GOVERNANCE | In GCC markets, where national AI programs are tied directly to economic diversification mandates, the reputational and regulatory stakes of model failure are disproportionately high. Boards should request regular disclosure from management on synthetic data ratios in production AI systems, and should evaluate whether current audit frameworks extend to AI training data governance — most do not. |
The Question of Imitation
Despite its potential, the big question persists: Is synthetic data fake data?
Calling it fake overlooks both its value and its risks. Fake data usually refers to something random or meaningless, like Lorem Ipsum as a placeholder. Synthetic data, on the other hand, is carefully engineered to mirror the statistical patterns and relationships found in real data, much like a flight simulator: it isn’t real, but it’s realistic enough to train pilots.
Estimates suggest that over 60% of data used for AI applications in 2024 was synthetic, a figure expected to grow across various industries.
“Synthetic data is most effective in sectors where personal or sensitive data is unavoidable and tightly regulated,” shared Dr. Hakim Hacid, Chief Researcher at the AI and Digital Science Research Center, TII.
When a bank doesn’t have enough real fraud cases to train a detection model, synthetic fraud scenarios fill the gap. Pharmaceutical firms generate synthetic patient records to circumvent the constraints of the US Health Insurance Portability and Accountability Act (HIPAA). Autonomous vehicle companies conjure up millions of edge-case driving scenarios that the physical world hasn’t yet produced.
For such cases, synthetic participants do the heavy lifting. “By contrast, synthetic participants do not modify or extend an existing dataset at all. Instead, AI systems such as large language models generate entirely new simulated respondents, who produce responses to survey questions or policy scenarios based on defined characteristics or personas,” says Dr. Fatima Koaik, Director, Behavioral Economics and Impact Evaluation at Strategy& Middle East, part of the PwC network.
Today, privacy laws are among the main structural barriers to obtaining high-quality data for AI training.
Europe’s GDPR is the most aggressive, requiring the collection of only what’s necessary, limiting the purposes for which data is used, and providing the right to erasure, which lets individuals demand that their data be deleted from training sets. The California Consumer Privacy Act (CCPA) follows a similar structure. India’s DPDP Act and Brazil’s LGPD are adding further to such complexities. Organizations are not choosing synthetic data purely on benefits; they are choosing it because the alternative is becoming legally prohibitive.
“The primary directive in generating synthetic financial data is therefore protecting the privacy of customers and entities involved in generating a particular synthetic data set,” read a 2024 JPMorgan Chase research paper.
The Potential Silent Drift
When synthetic data is generated, it is based on the latest datasets. However, the real world keeps changing, and so does the data. Over time, the dataset no longer reflects current reality, leading to data drift—the input data the model was trained on has diverged from real-world conditions. The result of that data drift is model drift.
Model drift occurs when a model’s real-world performance degrades because its training data has become outdated. With synthetic data, this risk is amplified — if the synthetic dataset is not refreshed as behaviors, regulations, or environments evolve, the model drifts further from reality with every passing cycle.
“If the real-world data distribution changes over time (e.g., due to evolving user behavior, new regulations, or environmental changes), but the synthetic data generation process is not updated, the model trained on this outdated synthetic data will start to drift,” says Dr. Hacid.
“This feedback loop can reinforce inaccuracies, amplify bias, or oversimplify complex patterns,” he adds.
Synthetic participants are primarily used to mimic survey responses or behavioral judgments, rather than to generate automated predictions in operational systems. Dr. Koaik pointed out that their reliability can diminish if the AI models that generate them are trained on outdated assumptions regarding the populations they aim to represent.
“For example, if public attitudes, cultural norms, or policy preferences shift over time, synthetic participants calibrated on older data may no longer accurately approximate current human responses. For this reason, periodic revalidation against new human data is important to ensure that synthetic participants remain behaviorally realistic,” shared Dr. Koaik.
As enterprises fine-tune foundation models on proprietary synthetic datasets, tracing the origin of their training data becomes increasingly difficult. Was this synthetic dataset generated by a model that was itself trained on synthetic data two generations ago? In most organizations, no one knows—because no one questions it.
“If the real-world data distribution changes over time, but the synthetic data generation process is not updated, the model trained on this outdated synthetic data will start to drift.”
— Dr. Hakim Hacid | Chief Researcher at the AI and Digital Science Research Center, TII
Enterprises should view synthetic data not merely as a cost-saving tool but as a risk-bearing asset requiring its own audit trail and periodic updates. Effective governance involves documenting the origin of each synthetic dataset, including information about the model, its training history, and creation date. It also includes regularly comparing synthetic data to real-world data to detect drift early, and having teams review for any gaps or biases before initiating retraining.
By 2027, 60% of data and analytics leaders will face critical failures in managing synthetic data, risking AI governance, model accuracy, and compliance, according to Gartner.
Does Synthetic Data Save You from Privacy Laws?
Synthetic data is increasingly essential. However, does it ensure compliance with privacy laws? The answer is complex, relying on how the data is generated and handled. Simply because the data isn’t “real” doesn’t mean it automatically complies with regulations.
If real customer records are used to generate synthetic data, then personal data has still been processed. “There are no laws that explicitly regulate synthetic data by name, but its use is increasingly shaped by broader regulatory frameworks in both data protection and AI governance,” says Dr. Hacid.
Despite its limitations—like any powerful tool—synthetic data will remain indispensable. The economics are too compelling, and regulatory pressure around real-world privacy is only intensifying. Synthetic data effectively addresses data scarcity. Organizations that prioritize and treat drift governance as the subsequent challenge to resolve will possess a significant and enduring advantage.
“The more interesting question for companies is what happens at the application layer where vertical agents live. There, real-world data becomes the scarce, high-value anchor while synthetic data becomes the volume,” adds Agarwal.
A decade from now, Agarwal believes real-world data will evolve from a widely available resource into a strategic moat. In a future where any company can generate vast volumes of synthetic interactions, the true competitive advantage will belong to organizations with deep reserves of authentic consumer behavior data — how customers actually speak, respond, pay, dispute claims, and engage under real conditions.
The danger is not that synthetic data is artificial. It is that, left unchecked, synthetic systems can begin validating themselves — creating closed loops in which models grow more confident even as they become less accurate.
The hospital AI system that once missed life-threatening conditions is not an isolated warning. It is an early signal of a broader governance challenge emerging across industries. As enterprises increasingly train AI on machine-generated approximations of the world, real-world data may become the last remaining anchor to behavioral truth.
In that future, organizations with the strongest AI systems may not be the ones generating the most synthetic data, but the ones most disciplined about preserving a continuous connection to reality.
By 2027, 60% of data and analytics leaders will face critical failures in managing synthetic data, risking AI governance, model accuracy, and compliance, according to Gartner.
