Synthetic Data in AI: Implications for Policy

Will future AI training data be generated by AI?

Apr 25, 2024

There has been a ton of news this week. Below is my attempt to summarize things that someone following AI may have missed, but by no means is it everything of import that happened. This is true in general for my links. If you like these “quick hits,” let me know—I may break them out into their own weekly post, which would allow me to cover more.

Also, reminder about the DC-area meetup I’m hosting with Brian Chau May 1 on Capitol Hill. RSVP here.

Quick Hits

Meta, Microsoft, and Apple all released open-source language models this past week. More about Microsoft’s, Phi-3, below. Llama 3 was well-covered by many, but is very much worth checking out. Apple’s, however, flew under the radar. These are small models (capping out at 3 billion parameters), so likely part of Apple’s strategy to run models locally rather than in the cloud. What is especially worth noting is that this model is very open source: Apple has released not just the weights but the code, the training data, and multiple checkpoints (snapshots of the model at different points in training). This is extremely useful for scientific inquiry into language model interpretability and safety, among many other things. The prospect of a company with Apple’s resources releasing AI models with such openness is promising (no other “Big Tech” player, including those mentioned here, is doing this).
Anthropic released a promising update on its research into “Sleeper Agents.” The original sleeper agents paper, which rightfully got a lot of attention, showed that it was possible to train language models to behave maliciously in such a way that a(safety and alignment techniques do not eliminate the behavior and b(it is almost impossible to detect the malicious training until it manifests itself in the wild. In this update, Anthropic has found that very simple interpretability methods can detect this behavior with 99% accuracy, at least with the models used in this experiment (likely not frontier models). It remains to be seen whether this approach will work in more naturalistic settings, but it is a promising update and a reminder that interpretability research can lead to meaningful improvements in safety and reliability. This kind of work requires—you guessed it—open-source models for people outside of AI companies to perform.
Researchers at Profluent have used AI to create novel Crispr enzymes. Most of you will be familiar with Crispr-Cas9, the bacteria gene-editing enzyme that has been harnessed to accelerate human-designed gene edits. You may not know, however, that Cas9 is just one of many such gene-editing systems found in nature; in fact, researchers discovered 200 such systems in other bacteria using a fascinating AI technique called deep terascale clustering. Now, AI is being used to design new gene-editing systems for specific purposes, superior reliability, etc. People continue to underestimate the radical potential of bioengineering. The model used to create this new gene editor is open-source.

The Main Idea

It is widely known, by now, that modern AI requires an enormous amount of data. Some AI researchers and observers have speculated that this need for larger amounts of data may be a long-term bottleneck on AI advancement. AI cynics and opponents often describe the use of copyright-protected data for AI training as a form of theft. The New York Times and other outlets have brought lawsuits against AI companies on this basis. Lawmakers at the state and federal level have targeted training data for regulation, proposing legislation that would, for example, require AI companies to disclose the copyright-protected material in their training datasets, or to forbid the use of such material altogether.

But what if we can use data generated by AI to train future AI systems? This concept, known as “synthetic data,” is one of the most active areas of research in the AI field and has already shown great promise.

Anthropic CEO Dario Amodei, for example, said in a recent CNBC interview that synthetic data could be used to create an “infinite data generation engine.” If synthetic data works, it has major implications for forecasting AI progress, policymaking, and even understanding how language models (and AI more broadly) fit within the US Constitution.

So, what is the state of synthetic data in AI? Let’s dive in.

Synthetic data is data generated by another AI model. It is not new. DeepMind’s AlphaZero, for example, was trained entirely on “self-play” data—games of Go it played against itself. AlphaZero beat the world’s best human players at the game, so we know this approach can work. It is also used in robotics and self-driving car research to simulate real-world situations; these synthetic scenarios can then be used to train AI models that guide robots and other autonomous technology in the real world.

It has proven popular in generative AI. OpenAI’s DALL-E 3 image generation model, for example, is trained on images with captions written by a language model specifically designed to caption images. Sora, the company’s announced but unreleased video generation model, is rumored to be trained on synthetic data generated by a video game engine.

In language models, the picture is a bit more complex.

Researchers have demonstrated that using too much synthetic data can lead to “model collapse,” a phenomenon where models generate repetitive or otherwise poor outputs because they can no longer accurately reflect the patterns and statistical distributions represented in their training data. Think of this, in rough terms, like using your smartphone to take a picture of something on a computer monitor: It may work in a pinch, but the quality will be low.

Model collapse is a relatively new concept, first described in a May 2023 paper. Unlike the vast majority of machine learning papers, this one was heavily featured in mainstream media outlets. While its findings have been replicated in other settings, it is worth noting that this original paper used synthetic data generated by Meta’s OPT-125m, a language model from 2022. As the name suggests, this is a 125 million parameter model, compared to hundreds of billions or trillions of parameters in current frontier language models. The researchers trained a new language model entirely (or almost entirely) on outputs from this comparatively small and weak model and observed the characteristics we now associate with model collapse.

It would, however, be a mistake to assume that this is the inevitable result of training on large amounts of synthetic data. The model collapse phenomenon is undoubtedly real, but what that means for the usefulness of synthetic data in model training is still an open question. Does the negative effect of synthetic data diminish if one uses, say, Claude 3 or GPT-4 versus OPT-125m? It would be shocking to me if that were not the case given how superior those models are. What portion of synthetic data is usable? It might be unreasonable to train a model with entirely or even mostly synthetic data, but if, say, 40% of the training corpus can be synthetic without damaging model performance, that is easily trillions of words for modern frontier LLMs.

A recent paper from the UAE suggests the amount of synthetic data that can be used is very small, but again, that paper is based on synthetic data generated by language models of GPT-2 performance or smaller. How this dynamic plays out for larger models with high-quality synthetic data is more of an open question. Intriguingly, a recent DeepMind paper on synthetic data does not even mention model collapse; take that for what you will.

More important than my speculation, we have ample empirical evidence that synthetic data can be used to improve models. Indeed, essentially everyone who makes language models today—from academics to multi-trillion dollar firms, is using synthetic data in one way or another.

Microsoft’s Phi-3, an open-source language model released this week, relies heavily on synthetic data to achieve excellent scores on LLM benchmarks with relatively few parameters. “The innovation lies entirely in our dataset for training,” the researchers report. Their approach, described in an earlier paper, relies on using GPT-3.5 (and presumably now GPT-4) to generate synthetic textbooks and code examples, which are then used to train Phi models.

Claude 3 Opus, created using Anthropic’s Constitutional AI methodology, was undoubtedly trained on substantial amounts of synthetic data. Constitutional AI involves specifying a human-written “constitution,” or set of principles for the model to follow. After a model completes its initial training, it is asked to generate responses to various prompts, critique those responses in light of a randomly selected principle from the constitution, and revise its response in light of that critique. This process is automated and repeated many times, and the resultant set of exchanges are a form of synthetic data used to further train the model. This is just one of, one would imagine, many uses of synthetic data in Claude 3.

DeepMind has recently published several interesting papers that point the way for further uses of synthetic data; for example, their recent “Self-Discover” prompting method gives the model a series of 39 reasoning modules (“think about this problem step-by-step,” “break this problem down into smaller, more manageable parts,” “generate innovative and out-of-the-box ideas, etc.) with which to approach various complex prompts. The model selects several modules, writes a plan to apply each of those modules to the specific task at hand, and then executes. The method was shown to improve reasoning capabilities in response to specific prompts versus competing approaches. Of course, responses generated using this method could also be used as training data for a model in various ways.

It is likely that every frontier AI lab is experimenting with a huge diversity of ways to generate synthetic data. Indeed, as most of the high-quality, publicly available, human-generated data is scraped by the labs (and without a doubt, they compete fiercely in this area too), synthetic data may become a primary differentiator for models. Meaningful amounts of the compute budget of these firms (the amount of computational power they are devoting to next-generation models) is likely going to synthetic data generation. As Mark Zuckerberg recently said:

It seems quite possible that in the future, more of what we call training for these big models is actually more along the lines of inference generating synthetic data to then go feed into the model. I don't know what that ratio is going to be but I consider the generation of synthetic data to be more inference than training today. Obviously if you're doing it in order to train a model, it's part of the broader training process.

It seems to me there are a few implications if these claims about synthetic data are correct:

If synthetic data can work at scale, and can in fact improve models (which, again, is empirically the case, though we do not know how far that approach will extend), then data is not nearly as much of a bottleneck on scaling LLMs as some have suggested. We can keep making larger and more capable models within the limits of the compute and ultimately energy that are devoted to LLM training.
Efforts to restrict access to specific human-generated, copyright-protected data (such as the entire archive of every newspaper ever published) are not a meaningful barrier. The New York Times lawsuit will not matter much, at least with regard to further scaling by the frontier labs.
Model training is rather unambiguously a form of speech. It is at least a question whether performing huge amounts of statistical modeling on internet text is a form of speech (I vote yes, by the way). If a significant amount of the training data is generated synthetically within AI companies, it is hard to see how that is not speech. This is particularly evident when one considers that synthetic data will be used (and probably is already used, at least by Anthropic) to help models generate nuanced answers to political questions. Anthropic’s approach is to generate synthetic data using a literal constitution as the basis; how is that not speech? This places important and major limits on many proposed AI regulations.

This is an example of how understanding the trends in the machine learning literature have direct implications for public policy. One of the things that makes AI so difficult to regulate is that it is a field of live science, and the fundamental assumptions we made two years ago may not hold up as well today. This underscores the difficulty of regulating AI models themselves: the dynamics that undergird their development can change quickly.

The current speed of AI research and advancement suggests that regulation based on conduct of AI users is the path forward: Our preferences for social order and what kind of behavior should be unlawful change on a far slower timeline than do the latest techniques in machine learning. Indeed, government is slow and conservative (in the apolitical sense) in part because these societal preferences tend to change slowly. By focusing on policing illicit conduct rather than policing models themselves, we can craft policies that are more resilient in the face of rapid scientific and engineering progress.

Hyperdimensional

Discussion about this post