And the words that are used
For to get the ship confused
Will not be understood as they’re spoken
For the chains of the sea will have busted in the night
And be buried on the bottom of the ocean
Bob Dylan, “When the Ship Comes In”
Introduction
For all the talk, most enacted and proposed AI model regulations are remarkably similar. The reasoning behind them goes something like this: We don’t want to deter AI innovation, so we’ll place “light-touch” regulation on the “frontier” of AI, whose high capital expenditures mean that only large companies will be regulated. And we’ll determine what counts as the “frontier” using some proxy for a model’s size such as the amount of computing power used to train it—model size, many believed (and still do) is an imperfect, but good enough proxy for a model’s capabilities.
These assumptions undergirded the “first era” of AI policy. But last week, OpenAI challenged, or even broke, many of those assumptions. In doing so, I suspect they brought the first era of AI policy to a conclusion.
OpenAI released a new series of language models called o1-preview and o1-mini. Though the “o” in o1 is apparently meant to stand for “OpenAI,” there are some indications that the name is also a nod to the American O-1 visa—a work visa granted to “aliens of extraordinary ability.” And these models are, indeed, extraordinary.
Though they are limited in some ways, they are a new milestone in AI, performing on certain mathematics tests among the top 10,000 or so humans on Earth. OpenAI didn’t accomplish this through the “scaling laws,” or the idea that you can improve a model’s performance by making it bigger. Instead, they made these models smarter by teaching them to reason as they answer your prompt. To use them, you must be subscribed to the paid tier of ChatGPT, though at some point at least o1-mini will be made available to free users.
We don’t know exactly how large o1-preview and o1-mini are, but my suspicion is that they are not “frontier” models in terms of the number of parameters they possess or the amount of compute used to train them. Yet they are undoubtedly frontier models in terms of their capabilities.
As a result, these models call into question two assumptions implicit in most model-based AI regulation:
That the only significant way to increase a model’s capability was to increase its size (training data, size of the neural network, etc.);
That a model’s capabilities are determined by the amount of compute used to train it.
OpenAI’s new models invalidate the first assumption outright and complicate the second one. The o1 models show that it is possible to vastly improve a model’s performance without making models bigger and without simply throwing more training compute at them. Indeed, o1-mini shows that even relatively small models can push the capabilities frontier forward.
I would be surprised if either o1-preview or o1-mini crossed the compute thresholds that trigger regulation under any of the following enacted or proposed government actions:
The Biden Executive Order’s reporting requirements triggered at 10^26 floating-point operations of training compute (though OpenAI has indicated that they gave the US AI Safety Institute early access to these models anyway);
The EU AI Act’s “systemic risk” threshold of 10^25 floating-point operations of training compute (though the EU has other ways of classifying models as having a “systemic risk”) ;
The liability provisions under my seemingly eternal muse, SB 1047, which would kick in at 10^26 floating-point operations of training compute if SB 1047 becomes law (though California’s would-be regulator would probably try to find some way to regulate o1 as a “covered model”).
The early efforts to make AI legible to the state—thresholds based on training compute—are probably going to fail. This was predictable; I even predicted it! Now it is obvious. Virtually none of the proposals to regulate AI models will work as intended. But regulating a broader range of models still has major tradeoffs for economic competitiveness, the way the digital economy works, and even digital freedom—to say nothing of the practical challenges. Thus, it is possible that pre-emptive model-based regulation might not work at all.
But before we can talk about the policy implications, let’s talk about the models themselves. Close followers of AI can skip the next section and proceed right to the “Using o1” or “Policy Implications” sections.
About o1
OpenAI announced three new models:
OpenAI o1, an as-yet unreleased model
OpenAI o1-preview, an early version of o1—specifically, an early “checkpoint,” or snapshot from an earlier stage of training, of the full o1 model
OpenAI o1-mini, a smaller, faster, cheaper—and mysterious—model; it is not clear if o1 is a smaller version (a distillation) of o1, or if it is an entirely separate model trained with the same approach as the larger o1 and o1-preview
OpenAI’s technical report for the o1 line of models contains little about how they achieved this new level of performance. Here’s what we know:
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
In a recent essay for The New Atlantis, I wrote:
In addition to mitigating plasma instability in fusion reactors, reinforcement learning is at the heart of other recent AI breakthroughs: DeepMind, the AI lab within Google, famously employed RL in the model that achieved superhuman performance at the board game Go.
To what extent can such an optimization system be generalized? What if the same approach could be applied to AI systems that write code, plan and conduct scientific experiments, or craft essays? These are some of the questions at the frontier of language modeling.
With the o1 series, OpenAI has successfully found a way to do what I described above. We do not know the specifics of how OpenAI achieved this, and while I can speculate, others with more technical expertise than I will write better speculation than I can.
At the highest level, though, this is achieved through giving the model more time to think about its answer, evaluate various options, and check its work for mistakes before it responds to the user’s prompt. This is known in the field as “test-time compute”—more computing power devoted to running the model, rather than or in addition to training the model. It’s not so different from human cognition; if I gave you three months to read everything about physics you could, and then asked you to answer a series of complex physics questions with the first thing that came to mind, you’d probably do worse than if I gave you some time in a quiet room with a pen and paper.
Additional “thinking time” has long been believed to be a potential source of performance improvements, but until last week, no one had publicly shown exactly what those improvements would be.
Now, we know. And we know something even more important: the performance improvements seem to scale with how much time is given to the model to think. If you give the model 30 seconds to think about something complex, it will do worse than if you give it a minute, or two minutes. OpenAI’s chart below captures this relationship, specifically as it applies to the model’s performance on the American Invitational Mathematics Examination (AIME), a competition for high schoolers with exceptional math skills:
How far does this go? We can’t be sure, but OpenAI staff seem to think it can go far. As Max Schwarzer, a researcher at the company, posted on X:
The most important thing is that this is just the beginning for this paradigm. Scaling works, there will be more models in the future, and they will be much, much smarter than the ones we're giving access to today.
Even with the models OpenAI has shown publicly, the improvements are more than just on benchmarks. The o1 models can break complex tasks down into simpler constituent sub-tasks far more skillfully than any other model I have seen. They can recognize and recover from mistakes they make. These are both crucial steps toward AI agents, models that can take action on the user’s behalf. As these models scale further, it seems likely that agents will begin to work far better than they do today.
OpenAI’s technical report focused largely on model safety. Broadly, the models were more “aligned” than previous models, which in practice means that they were better at applying OpenAI’s rules in ambiguous situations. Because of their superior science reasoning capabilities, OpenAI now believes that the bioweapons risk posed by their models is “moderate,” meaning that they may be able to provide some degree of meaningful help to an expert human (probably by speeding up tasks they otherwise would do). They also were willing to deceive users in experimental settings. In short, the models are safer in some important ways, but possess somewhat more latent potential for risky behavior. I’ll have more to say in future posts, but overall I think this is good news.
Importantly, there are still flaws with the o1 models. While some PhDs are showing o1-preview replicating work it took them a year to do in a matter of minutes, others have been less impressed. It still sometimes fails to answer basic questions, like how many r’s are in the word “strawberry.” It falls for tricks; for example, if you structure a simple question like a riddle, the model will often pick the “counterintuitive” answer rather than the obvious answer. This is likely because it is still, like language models before it, doing “dumb pattern matching”; if a prompt looks like a riddle, the model will believe that it is a riddle, even if it is not. The models also do not make any notable progress on the ARC Prize, a spatial reasoning test designed to test models’ ability to reason about things outside of their training data.
Finally, o1-preview and o1-mini perform notably worse than GPT-4o and other, more conventional, language models on many standard prompts. This is surely why OpenAI plans to continue releasing models in the GPT series; GPT-5, for example, is still coming. For now, the o1 series is a parallel track.
Despite all these caveats, the progress is remarkable, and the path to truly transformative AI—AI that can automate scientific research, complex software engineering, and much else—is clearer than ever. My expectation remains that this milestone will be achieved by the end of this decade, and I am more confident in that forecast than ever before.
Using o1
If you want to get a glimpse at the future—or one potential future—take a look at the screenshot below:
This is what a summary of o1’s thinking looked like in response to the prompt: “Please brainstorm some concrete ways that insights group theory could be applied to science in ways that are currently (as far as you know) NOT being done. Please be as specific and precise as possible, drawing on particular concepts or theorems from group theory.” (For those who have not used this model: o1’s actual response follows after the thinking summary, but is not pictured here)
The model did, indeed, generate some ideas I’ve never considered before (the application of group theory to the natural sciences is an area I’ve been exploring lately). You’d need more expertise than I have to evaluate how viable any of its ideas were, though my strong suspicion is that the model is not at the point where it can make truly novel contributions to these fields.
Still, it is a glimpse. Imagine a future where each one of these chains of thought led to novel academic contributions that could make fundamental improvements to the sciences. Imagine a future where the model could think about such problems for hours, days, weeks, or months. As Noam Brown, another OpenAI researcher, wrote on X:
o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots
While o1-preview is stronger on most benchmarks, o1-mini is in some ways the more impressive model. It is, first of all, fast. So fast, in fact, that I suspect it is quite small—perhaps even small enough to run locally on consumer hardware. Even if o1-mini isn’t quite that small, it certainly suggests that running such models locally could be possible in 6-12 months.
I asked o1-mini to write me a few simple iPhone apps; in multi-thousand token responses, it formulated a robust plan for each app’s development and coded the user interface, all basic functionality, the app’s data structures, and some additional features I hadn’t even asked for. All in under a minute, for pennies. We’ve been heading in this direction for a while, but o1-mini and o1-preview are the first models I could call genuine programmers rather than coding assistants.
The fact that such a small model was able to achieve such impressive performance—surpassing GPT-4o, Claude 3.5 Sonnet, and the largest variant of Meta’s open-source Llama model on the AIME test mentioned earlier—is wildly impressive, and demonstrates the power of OpenAI’s new reinforcement learning-based approach. It also is the clearest indication we have that our initial approach to “frontier model regulation” is woefully misguided.
Policy Implications
Policymakers have found themselves in a bind with AI. Many of them genuinely want to avoid “hampering innovation,” and, Heaven forbid, regulating startups. And yet there are legitimate risks from future AI models, which may well require government action of some form. So, they told themselves, we can regulate AI, for now at least, just by regulating the “frontier” models—the big and expensive models whose well-capitalized corporate creators can surely afford some degree of regulatory burden.
It seems possible that future versions of o1-mini with exceptional, and perhaps dangerous, coding capabilities could be trained with an amount of compute below any threshold that has been proposed. What’s more, the “danger” of the model no longer correlates directly to its training compute; the amount of compute used at inference time will now be a crucial factor for determining the capabilities of a model.
To police inference, government would need to surveil, or require model developers or cloud computing providers to surveil, all use of AI language models. Set aside the practical problems with doing this. Set aside, even, the drastic privacy concerns any reasonable person would have. Think, instead, about the way such a policy would drive much of AI use into the shadows, the way it would incentivize the creation of an AI infrastructure beyond the reach of the state. How safe does that sound to you?
In short, the viability of regulating only large models is now unclear, yet the tradeoffs of regulating a broader range of models remain. This was all relatively easy to anticipate in broad strokes (if not in the specifics), which is why I have generally advised against regulating AI models themselves and in favor of regulating people’s conduct with AI since I began writing on this subject.
Conclusion
The next phase of AI development won’t be pure chaos. Doing what OpenAI has done here is hard, and it will take time for others to replicate. Language model scaling will continue, and these large training runs will be legible to the state. GPT-5 will come out, and it will dwarf GPT-4 in both size and capabilities. Then, the reinforcement learning-based o1 approach will be applied to models of that scale, and yet another new level of capability will be unlocked. All of this will happen in the coming months, and it will probably appear that OpenAI, Anthropic, DeepMind, and Meta—the companies most familiar to and cozy with our government—have an insurmountable lead. And they may. They will probably continue to ship the most capable, accessible, and cost-effective models.
All the while, though, researchers all over the world will learn to extract new capabilities from both existing and next-generation models using approaches like what OpenAI has demonstrated with o1. Some of those researchers will be illegible to the American government. OpenAI’s breakthrough relies on a reinforcement learning-based method; this method will become publicly understood in due time. Indeed, many researchers are working on it, including ones in China. I expect that a Chinese company will produce a similar model within a few months, and perhaps sooner.
If your policy framework relies on the idea that only the largest models would have the potential for danger, and that we can keep advanced models out of our adversaries’ hands, I’m afraid your framework is unlikely to work.
We—analysts, policymakers, and the broader public—need to accept that these “extraordinary aliens” really are extraordinary, are here to stay, and that they are unlikely to remain within our “guardrails.” Instead, we need to build capabilities to make our society robust to the unique risks AI may pose. As I have written before, “those capabilities can include technical standards for AI, a coherent way to reason about AI liability, public computing infrastructure, digital public infrastructure for combatting deepfakes… and much, much else.” To this list I would add new technical protocols, governance mechanisms, and other approaches to make AI more legible to the state without invading user privacy or placing burdensome regulations on AI developers (stay tuned).
You should expect the pace of progress in AI to pick up yet again. You should not expect an impending “AI winter.” You should expect the coming years to be astonishing. You should not, necessarily, expect them to be wholly pleasant. You should expect the state’s grasp on AI to become even more tenuous. You should not expect everything to proceed in an orderly fashion.
We are in a new era, thanks to these aliens of extraordinary capabilities.
It would be funny if it wasn't so problematic how quickly (and predictably) the compute thresholds in policy are already obsolete.
I don't expect an "AI winter" where everything disappears. But we might get a "dot com bust" where at first it looks like everything is crashing, and in fact 99% of the companies are worthless... but there's also a Google and an Amazon, still alive among the wreckage.