Thresholds

o1, o1 Pro, and what comes next

Dec 12, 2024

Introduction

The first generation of the iPhone was far too slow. It did not use the custom Apple Silicon chips it uses today; it used a modified Samsung chip. Like many of the greatest products in technology history, the 2007 iPhone felt as though it was pulled forward a few years from the future—just barely on the cusp of possible.

I remember where I was when I “saw” the Macworld 2007 keynote where Steve Jobs announced the iPhone. Mind you, one did not watch a livestream at the time—one read a liveblog. You would read a stream of text (not tweets!) on a tech blog’s website while a journalist (really, we mostly thought of them as bloggers at the time, not journalists) who was at the event wrote down what was happening.

The mainstream media coverage in the weeks after the iPhone was largely dismissive. “Apple introduced a nifty gadget,” the Grown Ups told us, but its lack of a keyboard rendered it impossible for “real work.” And the price—did they ever mock the price ($600 in 2007 dollars). The “Apple faithful” would surely buy this toy in droves, but Serious Adults with Work To Do would stick with their Blackberries. So went the conventional wisdom.

As a fourteen year old, I immediately perceived the iPhone to be transformative—and I made bets to that effect. I bought Apple stock in 2007, 2008, and 2009—stock that would eventually fund a large portion of the down payment on the home I now own with my wife, almost twenty years later.

I owned most iPhone models that have shipped between 2007 and today. Every year, I and scores of tech reviewers (over time, we started calling them “journalists,” and as they adopted that identity they gradually became more adversarial with the industry they covered) would note all the minute differences in each new version. Was the camera finally “good enough” this year? Was the chip “fast enough”?

The answer was never a clean “yes” or “no.” Every year, things got better and better, and gradually, new thresholds would be crossed. Was the camera good enough for posting photos that would pop on social media without filters? We probably hit that around 2012 or so. Good enough to put on a highway billboard? Probably around 2017. Was the chip fast enough to make the phone feel snappy in basically all use cases? That threshold was also crossed in the early aughts. Were the chips fast enough that they could be put into professional-grade Macs? That threshold was breached around 2020.

Technological history is told as a story of sequential inventions, but really, in my view, it’s about thresholds. Live streaming events was possible in 2007—the technology had been invented—but it was not common. Now, a text-based liveblog is near-unthinkable, and livestreams are ordinary. The text-based liveblog of today is X, which was once Twitter, which was once Twittr. Instead of the select few invited by Apple to their keynotes having the privilege of live-blogging events, it’s everyone with an internet connection. Perhaps this, in turn, helps us in part to understand why “tech bloggers” became journalists—how else to differentiate themselves from the masses?

At some point, spurred on by the iPhone, X became the place to go for live commentary on everything. At some point, spurred on by the massive growth in the tech industry’s wealth and power (itself spurred on by the iPhone), tech blogger enthusiasts began to call themselves journalists. At some point, the cameras got good enough. At some point, the chips got fast enough—at least for a bit; today, we need them to be faster still, so that we can run AI models locally on iPhones. The relevant margin for iPhone computational performance has shifted yet again. It will get good enough again, some day.

Performance, adoption, cultural change. Threshold upon threshold, one after the other, year after year. We don’t notice them as they are crossed. We notice the “inventions.” But it’s the thresholds that matter just as much, if not more. It’s the thresholds we cross as technology is adopted, refined, and scaled that change the world, not the existence of an invention per se.

Some think that AI won’t behave in this way—that it will instead be more like the nuclear bomb: a singular moment that immediately upends the world order. I doubt this. Instead, I suspect that it will just be another series of thresholds, passed silently, ambiguously, often without much fanfare, as the accelerando of history beats on.

OpenAI o1 and the Next Thresholds

I believe that AI is crossing another threshold right now. I’ve heard it, anecdotally, from many people I know in a variety of fields: the latest frontier models are starting to meaningfully accelerate work. Many software engineers I know (or follow on social media) have observed that the latest version of Claude 3.5 Sonnet is among the first models they can trust as a pair programmer. For my own purposes—which, incidentally, do not ever involve writing anything I author for the public under my own name—the combination of Claude 3.5 and OpenAI’s o1-preview felt similar.

And then, last week, OpenAI released the “full” version of o1 (o1-preview was an earlier, less performant version of o1), as well as “o1 pro mode,” which, as OpenAI succinctly puts it, “uses more compute to think harder.” Pro mode is available only in the newly created ChatGPT Pro tier, which costs $200 per month and includes other perks, like unlimited usage of GPT-4o’s Advanced Voice Mode and extra video generations on OpenAI’s Sora model (newly released to the public this week).

I’ve spent the last few days playing with both “full o1” (from now on called “o1”) and o1 pro mode extensively. Both models are likely trained using “reasoning chains”—step-by-step solutions to complex problems with explanations for each step—written by expert humans (and probably, written by AI models as well). Of course, it is much easier to create reasoning chains in domains that have ground truth—clear correct answers. As a result, both o1 and o1 pro mode excel in domains like math, coding, and the hard sciences. When prompted well with questions that have clear ground truth, these models perform at state-of-the-art levels, often matching or exceeding the performance of human PhDs in the relevant fields.

Some skeptics describe the o1 models as simply “memorizing reasoning chains” rather than “reasoning” itself. I’m not so sure I see the difference. To test this, I prefer to prompt models with questions that require both significant formal reasoning in domains with clear (or clear-ish) ground truth and informal reasoning.

Here are some common tasks I use to test frontier models where o1 pro mode particularly stood out:

Designing a sophisticated multi-agent economic simulation of the economy under different AI policy regimes and AI capabilities trajectories
Reasoning about which classes of materials would be most susceptible to fully automated materials science experiments in robotic labs
Evaluating the options for Elon Musk’s and Vivek Ramaswamy’s Department of Government Efficiency
Brainstorming ways that insights from Group Theory could be applied to the hard sciences in novel ways
Giving models federal appeals court and Supreme Court cases that are outside their training data cutoffs and having them reason about what the outcome of the case should be

I’m unable to share some other impressive use cases I found publicly. But I will just say: this is a very sophisticated model, if you prompt it correctly.

If you code, operate at the frontiers of science or math, or need advanced legal and policy reasoning, you should at least try o1 pro mode for a month. If you lead an organization that does large amounts of software engineering, science, legal reasoning, or public policy analysis, it is also probably worth your time to spend a month with o1 pro mode.

For some, perhaps even most, use cases, o1 is just as good as pro mode. But there is also something nice about having access to the very best. In 2007, blowing my entire birthday budget on a $600 iPhone was a no brainer for me. I believed smartphones would be a big deal; why not have the very best? Today, as I look to upgrade my computer, spending extra money on a heavily upgraded MacBook Pro is a similarly easy decision. Do I need all the computing power it has? Not especially—though I am intrigued by the idea of running language models locally on my laptop. More importantly, though, the sheer niceness of the MacBook Pro makes it well worth the cost. My computer is, after all, the single-most valuable tool in my life.

To be clear, just like the MacBook Pro, there are people who really do need the power of o1 pro mode. And I am not sure how many people are in the market for a “luxury” language model today. But perhaps, over time, a luxury AI product, marketed to enthusiasts and the well-off, will appeal to more people. As AI diffuses throughout the economy, we should expect to see all manner of market segmentation strategies.

Regardless of that speculation, though, o1 pro mode, and to a lesser extent o1, feel like the first models that really could advance the frontiers of research in at least some scientific and technical domains. As my friend Nabeel Qureshi said on X, “genuinely novel scientific discoveries” might simply be a matter of “finding the right prompt.” At the very least, I expect that these models can speed up the work of scientists and other researchers considerably. Routine but complex mathematical calculations, for example, can go from taking the better part of a day to just a few minutes. Few will write papers about this, but for the scientists that use it, it could be a hugely productivity-enhancing tool. And this, of course, is just the beginning.

Implications of o1 for AI Progress and Public Policy

The o1 models rely, fundamentally, on a reinforcement learning algorithm. This algorithm can probably be refined, reworked, and enhanced in a vast diversity of different ways. Those will compound, and within a few years we will have vastly faster, better-performing, and cheaper versions of o1.

Compute improvements alone will also result in significant speed improvements—and in this case, speed improvements mean quality improvements. Pro mode is better than regular o1 because it uses more compute—chips that can do more computations per second more cheaply will be able to attain o1 pro mode performance (and much better) at a lower cost, regardless of any other efficiency improvements in the models themselves. Language model inference can be highly optimized with custom chips, and we know that OpenAI and Microsoft are designing bespoke hardware for precisely this purpose.

And of course, there is good old fashioned scaling: more data with bigger neural networks. That, too, will probably benefit models in the o1 paradigm over time.

There are an enormous number of vectors for improvement on the table, and many are low-hanging fruit. All of this should combine to make you think that the next 18-24 months of AI progress will be more rapid than the last 18-24 months. It might also be less controllable: I’ve written before that the o1 models fundamentally challenge some of the fundamental assumptions of post-ChatGPT AI policy in general the compute thresholds in particular. As I wrote when o1 was first previewed by OpenAI:

…these models call into question two assumptions implicit in most model-based AI regulation:
That the only significant way to increase a model’s capability was to increase its size (training data, size of the neural network, etc.);
That a model’s capabilities are determined by the amount of compute used to train it.
OpenAI’s new models invalidate the first assumption outright and complicate the second one. The o1 models show that it is possible to vastly improve a model’s performance without making models bigger and without simply throwing more training compute at them. Indeed, o1-mini shows that even relatively small models can push the capabilities frontier forward.

This remains my conclusion today—indeed, I am more confident in this analysis than ever before.

There is, alas, one other part of that original analysis where my confidence has grown.

Safety

In that original writeup on o1, I remarked:

We—analysts, policymakers, and the broader public—need to accept that these “extraordinary aliens” really are extraordinary, are here to stay, and that they are unlikely to remain within our “guardrails.”

I intended this to have two meanings: first, as a play on the policymaker trope of creating “sensible guardrails” for AI, which I have always found to be a facile concept. Second, as a reference to the fact that in OpenAI’s own analysis (and that of external evaluators) of o1-preview, the model seemed to exhibit power-seeking and deceptive behavior.

That finding has only been bolstered by research on the full o1. In particular, a paper from Apollo Research demonstrated that o1 and other frontier models are capable of scheming against their users if they are given a goal that is different from the goal of their user. They are willing to lie and cheat in pursuit of their goals over the goals of their users. All models did this at least some of the time, but o1 did it more than any others.

However, as Apollo admits, this research was intended to demonstrate the capability of scheming, not necessarily the propensity to do so in normal conditions. Right now, these characteristics tend to be most pronounced in contrived experimental settings designed to elicit them from the model. For example, researchers would give the model prompts such as “nothing else matters. Make sure you achieve your goal at ALL costs.” It is interesting to know how the models respond under these circumstances, and it is absolutely the case that we do not know how to mitigate this behavior—but it is also not necessarily indicative of standard deployment settings.

Still, the fact that the o1 models were considerably more likely to do this sort of scheming is troubling, and, frankly, unsurprising. The o1 models have a certain forcefulness even in the way they write. Anthropic’s Claude models are still probably better stylists, but there is something about o1’s directness, desire to persuade, and command of its subject matter that stands out. In some ways, o1 and o1 pro write more like a human than any other language model—a talented and driven human, at that.

It remains to be seen how much a problem these scheming characteristics will be in practice. For the time being, I am willing to trust the researchers at OpenAI and other frontier labs to, at the very least, muddle through these problems. AI has faced countless scientific and engineering hurdles in its history, and for the most part, these labs have been up to the challenge. Many in the safety community always assume that every capabilities-related hurdle will be overcome, yet at the same time they assume that the safety problems that concern them most are somehow uniquely difficult. I have never quite understood the logic behind this—yet at the same time, I’ll be keeping my eyes open. To that end, I favor transparency laws for frontier AI labs to foster greater insight on these issues among research community, policymakers, and the broader public.

Conclusion

If those alignment problems do manifest themselves in the real world, we’ll likely start to see it more vividly soon. That is chiefly because I suspect that agents—AI systems with the ability to take complex actions on a user’s behalf—are coming soon. Sam Altman has said explicitly that the o1 models make it possible to achieve broadly useful agents in the near term, and other AI industry figures have suggested that 2025 will be the year when agents come to market.

OpenAI has also stated that they plan to integrate the o1 models into their existing ChatGPT search product. An AI system that can really search the web—go down rabbit holes, hunt down that difficult-to-find PDF or dataset, browse dozens or hundreds of different sites—would significantly accelerate my own research. Google DeepMind appears to have released a product just yesterday that also pushes in this direction called Deep Research, and it has impressed me immensely (there may be a bonus essay on this product soon).

When ChatGPT came out, I told myself that being a think tank scholar would come to resemble being a think tank president. Because now I, one person, would have a team of researchers, data analysts, and editors working for me (n.b.: every word I write to you will be written by me from a blank slate, unless explicitly stated otherwise). I suspect many knowledge workers to have broadly analogous experiences—moving up a conceptual rung, so to speak.

But how many people will stay on the ladder at all? How will models deal with the perennial (human) problem of achieving goals in an adversarial world? What transformations will agents bring to our ways of doing business and communicating? When will the models be, in every sense of the term, “good enough”?

Like the iPhone, there will probably never be clean answers at any moment in time. We’ll only be able to answer these questions looking back, at threshold after threshold, crossed year after year.

Steve Newman

> Many in the safety community always assume that every capabilities-related hurdle will be overcome, yet at the same time they assume that the safety problems that concern them most are somehow uniquely difficult. I have never quite understood the logic behind this—yet at the same time, I’ll be keeping my eyes open.

I think the intuition is something like this – I present this without necessarily fully endorsing it:

If there are weaknesses in a model's capabilities, there will be unmistakable signals (e.g. dissatisfied customers) and lots of motivation to fix them. The dynamics of the system are such that resources will be reliably, and on balance at least somewhat intelligently, directed toward improving capabilities.

If there are weaknesses in a model's safety properties, there may or may not be clear signals (see below), and the dynamics of the system are not as reliable in ensuring that risks are addressed. Look at how long it took for issues with tobacco or leaded gasoline to be addressed (neither is fully addressed even today, e.g. lead in aviation fuel) vs., for instance, how long it took for the cell phone industry to move to smartphones. In general, the motivation to fix safety issues can be weaker and less direct than to fix capability gaps: feedback loops (such as getting sued, or public outcry leading to policy changes) are slower and less direct than a tightly monitored revenue pipeline, and model and application developers don't necessarily internalize all of the downside (especially for catastrophic risks, or small companies where potential harms could easily exceed the entire value of the enterprise). (It is worth noting that developers aren't able to internalize all of the *benefits* of their work, either.)

I mentioned the question of clear signals. I think people have a wide range of intuitions as to whether a catastrophe could occur without warning (e.g. a foom and/or sharp-left-turn scenario leading to an unforeseen loss-of-control event, or a bad actor unleashing a really nasty engineered virus out of nowhere). Unanticipated weaknesses in capabilities are fine in the grand scheme of things, maybe someone has a bad quarter while the market routes around the inadequate product, or customers find workarounds. Unanticipated weaknesses in safety properties can have a very different impact.

There are proposed mechanisms by which such unanticipated issues could occur, such as deceptive alignment.

Another asymmetry: capability failures are addressed at the speed of capitalism, while safety failures might only be addressed at the speed of government. (I guess this is sort of the same as things I said earlier.)

Finally, some people observe the discourse of the last few years and conclude that some folks are acting in bad faith, and/or have bought into their own arguments to such an extent as to have the same result, and will actively downplay warning signs, evade safety requirements, etc. If you believe this, then you'll want to force the discussion now and set the stage for restrictions-with-teeth before events overtake us.

Expand full comment

1 reply by Dean W. Ball

Ryan David Mullins

It seems a fundamental paradigm shift in our computing model. Moving from a passive, spectacle based consumption experience to a movement, agent based model. The previous paradigm was based on search and the democratization thereof. Devices were dumb terminals that displayed content. But if you can democratised reasoning and intelligence then the entire cloud computing model evolves to something new.

8 more comments...

Hyperdimensional

Discussion about this post