We have a pivotal year ahead of us on many different fronts. To that end, I’d like to tell you the developments I am expecting in both AI and AI policy, as well as some of the things I will be monitoring, even if I do not expect them to come to fruition in 2025. Let’s get to it.
Frontier AI in 2025
For a long time, it was fair to dispute whether large language models “really” reasoned. Are these statistical artifacts, these compressions of the internet, “thinking”? Or are they just predicting the next word? Do they “understand” anything deep about the world, or do they simply reflect statistical patterns in data?
It was always, in my view, a muddled debate. A raw, pretrained language model (with no reinforcement learning-based posttraining), trained to predict the next token of the entire internet, has to answer a simple, but profound question: why is it that all the words ever written were written the way they were? “What does it mean to predict the next token well enough?” asked then–OpenAI chief scientist Ilya Sutskever in a 2023 interview. “It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token…. In order to understand those statistics … you need to understand: What is it about the world that creates this set of statistics?” Of course this process produced models that could reason sometimes.
But this understanding, and any associated reasoning, was profoundly uneven. The pretrained language model is, as many have observed, a kind of homunculus, picking up on untold numbers of human concepts and the relationships between them—and likely, fractal patterns in human communication that not even we pick up on. In retrospect, it is probably more appropriate to see the models we have been using for the last two years as a kind of feedstock of genuine intelligence rather than the thing itself. For the last year or so, my question has been: can we cultivate that feedstock—can we, somehow, turn it into something more useful? And to what extent?
Enter OpenAI o1 (and now, o3). These models, trained with reinforcement learning, blow away other models on most math and coding benchmarks, reaching levels of performance that surprised even many deep learning bulls. In my subjective experience, they are also among the very best at answering questions unrelated to math and coding, including pure humanities questions. Perhaps most intriguingly to me, o1-preview shows superior knowledge calibration to other models (knowing what it does know, and knowing what it does not know) on OpenAI’s recent SimpleQA benchmark, which is designed to test models for their propensity to hallucinate.
The o-series models, then, are among the first to demonstrate that cultivation is a convincing way. They will not be the last. Already, the jump from o1 to o3 (previewed, but not released by OpenAI in late December) shows the significant performance gains to be had by scaling their reinforcement-learning based approach.
As always, there are complications clouding the picture. First, benchmarks alone are never a reliable single measure of model quality. Perhaps the models excel at coding at math but show no real gains in other domains like science, law, and medicine. In addition, Nathan Lambert believes that o3’s leap in performance over o1 could be because the former is based on the as-yet unreleased GPT-5, whereas o1 is based on GPT-4. Under this theory, o4, which will also likely be based on “GPT-5,” will be less of a leap than o3.
Nonetheless, I do not think these complications should distract us from the broader trend: OpenAI has devised a way to make language models reason in a robust and convincing manner. Other companies, such as China’s DeepSeek and Google DeepMind, have followed up with their own versions of this same basic approach. These approaches will be scaled, and at the very least, it seems reasonable to expect that they will produce models that are among the top 1% (or greater) of humans at math and coding by the end of 2025. At the time of release, this performance may only be available at a great cost (OpenAI’s o1-pro is only accessible at a cost of $200/month, for example). But we have good reason to suspect this will not remain true for long: the cost of a given frontier performance level can be expected to drop by about 99% within 18 months after it is achieved.
No longer will language models be a “helpful assistant.” They will be among the most talented coders on Earth. And right now, we have no idea where the ceiling lies—but we know already that median human performance has been surpassed.
This obviously means that models of this kind will present novel cyberrisks. We do not know how many, or how severe. Many of the most damaging cyberattacks hinge not just on outstanding coding ability, but also on specific—and often non-public—knowledge about a vulnerability in some particular system. Thus, broad access to outstanding coding ability might increase cyberrisk less than many expect; instead, it may increase the demand for non-public knowledge about particular vulnerabilities. I suspect this is a pattern we will see repeat itself in many domains, both malicious and productive. Many areas of science, for example, are not bottlenecked by technical skill, but instead by access to novel datasets to which to apply that technical skill (see Tyler Cowen’s recent illustration of this).
It also seems reasonable to expect, with a bit less confidence, that we will see similarly significant performance gains in other hard sciences, thereby accelerating research productivity in those fields. It may, or may not, mean that AI systems “autonomously advance the frontier of science.” Instead, these systems may automate an increasing number of the scientist’s tasks, meaning that scientists can go from one experiment to the next more quickly. However, on its own, this may not mean much if the experimental productivity of scientists does not also increase. This is why I have written in support of government-funded automated labs for fields like materials science: because AI will enable scientists to generate more promising ideas, we need to improve the throughput of science so that those ideas can be tested more efficiently.
Some domains—like mathematics, computer science, and machine learning itself—do not require time-consuming, real-world experiments. Novel ideas can be tested, and iterated on, purely in silico. In these domains, I expect progress to be much quicker still. For example, I put the odds at about 25% that an AI system will solve or meaningfully advance an unsolved problem in mathematics by the end of 2025. And it is conceivable that frontier AI systems will significantly improve at automating machine learning research and engineering; certainly, I will be closely watching model performance on AI research automation benchmarks RE-Bench and MLE-Bench by the end of the year. I expect major progress.
As all these performance gains play out, the data these models generate in response to real-world user prompts will likely be another flywheel of performance optimization and enhancement. On top of this, the underlying reinforcement learning algorithms themselves can undoubtedly be refined, or even dramatically improved, since this is the beginning of a new paradigm and surely there is a great deal of low-hanging fruit.
I do not like phrases such as “hard takeoff to AGI,” because I think the term AGI is deeply under-specified. But it is conceivable—perhaps even likely—that we are indeed facing this kind of scenario.
I am inclined to feel more excitement than alarm at this prospect, but you could easily—and quite reasonably—come to a different conclusion. And your degree of alarm should probably not be zero, even if you are generally optimistic.
AI Research to Watch
There is more, too. With each of the o-series models, OpenAI has also introduced a “mini” version. DeepMind’s Gemini reasoning system is, currently, only a “mini” model (an adaptation of their Gemini Flash series). These models often achieve excellent pure reasoning performance at a low cost, with the tradeoff that they have far less “world knowledge” than the bigger versions. This means they excel at math and code, but do not come close to meeting the large models in law, medicine, science, etc.
These models have their uses today, but I suspect their real long-term use has yet to be revealed: multi-agent systems. Imagine that you issue a complex request to a full-sized model like o3, but instead of handling the prompt itself, it thinks for a while and creates a detailed plan of action. Then, o3 deputizes a swarm of o3-mini agents, operating in parallel, to handle discrete elements of the plan is has created. Some may write and execute code; some may search the web, scanning dozens or hundreds of webpages; some may process information gathered by other agents.
Every time OpenAI has made a new advancement in their o-series models, Noam Brown—one of the seniormost researchers at OpenAI working on these models—advertises that the company is hiring for its multi-agent reinforcement learning team. This, I suspect, is what OpenAI sees as the “next big breakthrough” needed for advancing AI—or at least one of them.
I do not know if we should expect to see multi-agent systems released in 2025, but certainly, this is a field of research to watch.
Another increasingly important field of research will be decentralized training—the training of AI models across multiple different compute clusters in different physical locations. We know that DeepMind’s Gemini models have been trained across multiple different clusters—though in the same region, and linked together by extraordinarily expensive fiber. And we know that Microsoft is pursuing the same thing.
But what about less exquisite approaches? What if, say, academics across the country could string together their (individually) small university compute clusters to yield a (collectively) much larger resource? What if startups could do something similar? At the extreme end of this, one could envision fully decentralized training, where I (and millions of others) could loan spare compute cycles from my laptop when I am not using it to some open-source group training a new model. There are currently massive technical hurdles, and it is unclear how far this research can go. But some startups, like Nous Research and Prime Intellect, are making progress. I’ll be keeping my eye on this.
Both trends also push against centralized control of AI by governments. Instead, they imply a world where access to AI is widely distributed, inexpensive (except for perhaps the most difficult problems in the world, which could indeed involve multi-million dollar inference runs), and fiercely competitive.
This is probably a good thing for everyone except those who believe that centralized (or at least, somewhat centralized) control of AI is essential for safety and security. And it is indeed true: every trend we see in AI seems to limit the options for policymakers. Licensing requirements seem near-impossible to implement, as does regulatory pre-approval of any kind without a massive and likely harmful transformation to the way that knowledge is shared on the internet. Export controls on model weights seem useless if Chinese companies can remain fast followers (note that the dynamics are importantly different for GPU export controls, which I broadly support). Any law based on compute thresholds increasingly seems as though it was built for a different world than the one we, in fact, have stumbled into.
Without a doubt, there are a wide variety of feasible and prudent steps policymakers can take today. I have written about many of them. But will our policymakers pursue them? Do American policymakers, writ large, understand what is happening? That brings us to policy.
AI Policy in 2025
I expect that the main locus of AI lawmaking in the United States will continue to be state governments rather than the federal government. There will, undoubtedly, be significant federal action, such as:
Rescinding and replacing by the Trump administration of the Biden Executive Order on AI and other Biden administration AI policy documents;
Relaxation of environmental permitting and other regulations related to the construction of infrastructure (both executive and legislative action);
Revisions to Biden administration export controls (this is lower confidence).
Certainly, there will be other federal AI policy actions, and the Trump administration’s incoming team of AI and technology policy staff bode well, in my view, for the quality of those actions.
There will also be plenty of rhetoric. Expect the volume to be turned up on ill-defined concepts like “a Manhattan Project for AGI” and similar ideas. I disfavor such policies, but what value any of this foreign policy chitchat holds is, as ever, unclear.
But in general, my expectation is that state governments will “lead” the way. Just like last year, expect somewhere between 500-1,000 individual AI-related laws proposed I state legislatures in 2025. The vast majority of these laws will be either anodyne or dead on arrival. There will be a small sliver of significant laws proposed in states with an actual shot at passing. The vast majority of those will be similar to the Texas “algorithmic discrimination” law I have covered extensively, which will probably appear in half a dozen other states.
In other words, of the significant state laws introduced next year, the vast majority will do approximately nothing to grapple with the reality of frontier AI today. This flavor of policy, by the way—the kind that has nothing to do with the novel risks of current and clearly-near-future AI systems—is what some have chosen to call “evidence-based” policy, in yet another comical contortion of the English language wrought by America’s policymaking community.
And then, of course, there is California, the locus of SB 1047, one of the only bills that did focus on frontier AI risks. As you probably know, I was a vociferous critic of this bill. But one thing I consistently acknowledged is that I admired Senator Wiener and SB 1047 supporters for focusing narrowly on the novel and major risks of frontier AI (however low their probability) rather than on vague “present-day” risks.
I do not know what Senator Wiener and the very small number of other state legislators who think similarly to him will do in 2025. But I am confident that Senator Wiener, at least, will do something; after all, he promised as much in a statement after SB 1047 was vetoed. And Wiener has introduced a placeholder bill, confirming that he will be introducing an AI bill this year (thank you to Joel Burke for reminding me of this, and apologies for the oversight of not including this fact in the original post). Perhaps that bill will be improved. Or perhaps we will repeat the same conflict all over again.
Regardless, just like last year, there is a good chance that whatever comes out of California will be among the most important AI policy debates of 2025.
Great writeup! I wonder what impact the EU AI Act might have on Americans, despite their own lack of federal or state laws. Also, with so many of the developments you wrote about in the first half of your post, I am trying to learn where I should focus my AI learning and practices in 2025. What are you putting your time into using, and learning to use?
Thank you for the article. You say that new OpenAI o1/o3 models are trained using reinforcement learning, but are not all modern LLMs trained by reinforcement learning of some type (agents with rewards or human feedback)?