Compute Thresholds are Ineffective

Why compute thresholds are a poor tool for policymakers

Apr 11, 2024

Quick Hits

A new AI music generation model called Udio was released this week. It is a notable improvement from previous models of this kind. I recommend this one in particular.
Researchers have enhanced DeepMind’s AlphaGeometry system with a geometric theorem-proving algorithm called Wu’s Method. DeepMind’s original AlphaGeometry paper, released in January of this year, scored 25/30 on the International Mathematical Olympiad, just short of human gold medalists in this competition for the world’s best high school math students. With Wu’s Method, the system scores 27/30, the first time an AI system has bested human gold medalists.
Apple unveiled a model called Ferret-UI, which is a multimodal language model intended to understand smartphone user interface screens (I wonder what the application could be). Apple has been releasing a lot of interesting research in recent months, almost all of which points to their efforts to make AI assistants run locally on smartphones, laptops, and VR headsets. This has substantial implications for energy use (no data center required), privacy (no need to send prompts to a third-party provider), and policy (many proposed regulatory schemes would effectively require AI developers to monitor their users’ activity and preserve the ability to shut it down, which would probably be impossible, or at least much thornier, if the model was running locally). Local AI can only be so powerful compared to the frontier models, so they are also rumored to be considering a partnership with Google to use Gemini.
Not AI, but it is apparently now possible to arrest tooth decay, perhaps permanently, using a CRISPR-modified bacteria. It’s being marketed as Lumina and available for purchase. Because it’s marketed as a probiotic supplement, it is subject to far less FDA scrutiny than a typical therapeutic. With CRISPR genome editing of bacteria (and soon, creation from scratch of novel forms of bacteria using AI) becoming more common, I wonder how long the FDA will allow this dynamic to persist.

The Main Idea

When generative AI systems hit the mainstream—and shocked the public—starting in 2022, the federal government was caught off guard. Policymakers felt a need to get a handle on AI, and to do that, they had to make AI ‘legible’ to the state.

The concept of state ‘legibility’ comes from James C. Scott’s classic book, Seeing Like a State. The basic idea is that, in attempting to manage a variegated, shifting, and fractally complex reality, states tend to impose simplistic measures from the top down. For example, in postrevolutionary France, rulers were faced with a perplexing array of agricultural land uses practices and units of measure, each of which had developed for complex historical, geographic, and economic reasons. The state, however, sought to unify this mess by codifying all trade under standardized units of measure. This is the origin of the kilogram, among other common units.

Scott’s point is not that legibility is good or bad—it’s that it happens, and that one cannot understand statecraft without understanding the process by which a state makes reality legible to itself.

Very quickly, western policymakers converged around so-called “compute thresholds” as a means of making AI administratively legible. The idea is straightforward: AI models are trained on large amounts of compute, and by creating thresholds based on this amount, the state can effectively regulate or monitor the more powerful models while taking a more laissez-faire approach to the less powerful ones. Specifically, most enacted and proposed policies use floating-point operations (flops), essentially the number of mathematical operations like multiplication or addition required to train the model.

The Biden Executive Order on AI imposes a reporting threshold for general-purpose foundation models like GPT-4 if they required more than 10^26 (one hundred septillion) flops to train, and for DNA or RNA-based models if they employ more than 10^23 (one hundred sextillion) flops. The European Union’s AI Act triggers an additional level of regulatory scrutiny at 10^25 flops. Other proposed bills at the state and federal level use a threshold in this ballpark.

The problem is that, as it turns out, compute thresholds may not be a durably useful means of making AI legible to policymakers.

The fundamental problem with compute thresholds is that, even absent government regulation and all else being equal, compute is a cost for AI developers. If they can achieve the same results for less compute, they will be heavily incentivized to do so. Some laws trigger additional levels of regulatory scrutiny using metrics like a company’s size. Dodd-Frank, for example, regulates banks differently depending on the amount of capital they hold. While this creates numerous incentive problems that economists have thoroughly documented, in the broadest sense, companies do want to grow. Firms do not in general try to become smaller.

Compute, however, is more complicated: It is a cost center. Businesses tend to like to reduce those. Yes, frontier model developers aim to use massive scale, but in general, even they are incentivized to make their models more efficient (and hence use less compute). Efficiency is even more important for medium and small-sized AI firms. Compute thresholds, therefore, swim against the current of AI development, and in fact make the current stronger: the existence of compute thresholds is yet another incentive to find ways to use less compute to avoid additional regulatory scrutiny.

Compute thresholds are thus being squeezed from both the high and low ends of the AI industry. Smaller or compute-limited players, which includes most startups, all global academia, and, because of US-led export controls, most Chinese companies and researchers, have an obvious incentive to use their compute more efficiently. Such improvements happen weekly and many are freely available as arXiv pre-prints. New model architectures (especially state space models such as Mamba) are showing substantial empirical performance gains in model training and inference.

Within the context of the more traditional transformer architecture (upon which almost all language models are based), clever tricks, such as the recently proposed Sophia pre-training optimization method, have been saidto reduce global compute usage by as much as 10%. Smaller tricks are proposed on an almost weekly basis: some of these ideas don’t replicate in other settings, and others are not appropriate for most models—others, however, can compound.

Tricks like this (and many, many others) are how Cohere, an AI company that makes language models aimed at businesses, was able to deliver an open-source model that outperforms the original version of GPT-4 from a year ago with roughly 1/18^th the number of parameters as that version of GPT-4 had. That’s the kind of efficiency that is possible to achieve in just one year in this fast-moving field.

And it gets even tougher. Sakana AI, a Japanese AI research company, released a novel approach of merging open-source models together using evolutionary algorithms. The researchers showed that the resultant merged model performed better on benchmarks than either of its constituent models, meaning something about the Frankensteinian operation improved the overall model. Because this method does not require training any model from scratch, it necessitates very little compute. There are thousands of open-source models on HuggingFace, so the possibilities here are limitless. We don’t yet know how well this approach generalizes, but if it does, it could be a game-changer.

At the high end, companies like OpenAI and Google have every incentive to increase efficiency, because even tiny gains can amount to millions of dollars saved. It is rumored that the current versions of GPT-4 and GPT-3.5 that OpenAI serves to customers are about 80-90% smaller than they were when they first shipped. These companies don’t always (or even usually, at this point) share their efficiency gains with the broader world, but the knowledge of how to do so exists within employees’ minds, and knowledge tends to spread over time.

The larger players also have so much compute that they need to innovate to find ways to use it all efficiently. Because Google, Anthropic (via Amazon), and OpenAI (via Microsoft) have many data centers across the country and the world at their disposal, they are seeking ways to train models across multiple physical locations—no easy feat. DeepMind recently published an approach called distributed path composition (DiPaCo), which pushes in this direction. While this benefits larger players, it could ultimately be used by smaller actors to distribute model training—for example, academic researchers at different institutions could pool their respective universities’ compute clusters to effectively double (or more) the amount of compute at their disposal.

And remember that these efficiency gains occur as the cost of compute decreases over time. Nvidia’s latest Blackwell architecture packs as much computing power into one rack of servers as existed in a state-of-the-art, building-sized supercomputer unveiled by the US government in 2022. Today’s “frontier compute” is tomorrow’s old news. Thus any compute threshold set for frontier AI today will cover a much broader range of the industry in just a few short years.

In the final analysis, then, there are two paths that regulating via compute thresholds could take. One would be gradually raise the compute threshold to match the frontier; this would effectively mean that only the largest domestic firms are subject to any kind of regulation. Because the same level of capability will be reached by smaller players using substantially less compute in short order (if recent history is any guide, at least), this does little to ensure safety. We know that the largest companies take AI safety seriously and perform substantial red-teaming on their models, so it’s not clear how much value a compute threshold that effectively applies only to those firms would create.

The other path would be to keep the current threshold where it is around 10^26 flops, effectively stating that any mathematical operations above that level are inherently dangerous. This will create incentives for continued efficiency improvements, but ultimately will cause much of the AI industry to be subject to government oversight. Given the lack of manifest harms from advanced AI systems so far and the substantial burdens that would impose on this geostrategically and economically crucial industry, this approach comes with huge costs.

A better approach would be to ditch the idea of compute thresholds for triggering model regulation at all. Reporting requirements could still be used at ‘frontier compute’ levels to give policymakers some peace of mind, but ultimately, we need a smarter and more durable way to manage AI risk.

To do this, policymakers should:

Start thinking seriously about how AI fits into existing law and make amendments to that law as appropriate;
Accelerate the creation of voluntary technical standards for model safety, reliability, and performance evaluation (via NIST for general-purpose systems or agencies like the FDA for industry-specific systems); by the way, regulatory pre-approval of models could very slow down this process, because open science and regulatory regimes tend not to be so compatible;
Create standards and best practices for model red-teaming and safety evaluation (led by an agency like CISA).

Many AI developers want to do the right thing and are seeking guidance and resources for how best to achieve that goal. The government can play an active role in giving them the tools they need to do their work as well as possible, and what’s more, they can do it without subjecting a general-purpose technology to a sweeping regulatory regime.

Hyperdimensional

Discussion about this post