Decentralized Training and the Fall of Compute Thresholds

Will decentralized training break compute thresholds for good?

Oct 10, 2024

Introduction

I’ve been skeptical of regulation based on “compute thresholds” since the idea first entered mainstream policy discussion. The threshold-based approach taken by the Biden Administration in its Executive Order on AI struck me as a plausible first draft, particularly given that the basic intention behind it was “make sure OpenAI, DeepMind, Anthropic, and Meta talk to the federal government about their AI development and safety plans.” Compute thresholds are fine for that contingent and transitory need. But for anything more robust, and especially for any regulation that imposes actual or potential costs on model developers, I concluded long ago that compute thresholds would probably be ineffective.

A few weeks ago, I wrote that OpenAI’s new o1 models challenge compute thresholds with the inference-based scaling paradigm (if you don’t understand this, I’ll explain more below). But there’s yet another new paradigm in AI development that could complicate compute thresholds as a governance mechanism: decentralized training. Let’s take a look.

The Problem with Compute Thresholds

First, to step back a bit, my reasoning for disfavoring compute thresholds is straightforward: Compute declines rapidly in cost, meaning you’ll cover a wider and wider swath of the industry over time. This, in turn, means that compute thresholds will have to be raised gradually to avoid placing burdensome regulations on startups and other small actors.

But if you raise the compute thresholds, you encounter a different problem: because compute is a major cost center, AI developers learn to wield it more efficiently over time. Thus, the capabilities you see in a large model today will be in a very small model in the near future. To illustrate this, consider that GPT-4 launched in March 2023 as a 1.8 trillion parameter mixture of experts model; 18 months later, there are 8 billion parameter models that matches the original GPT-4 on key benchmarks.

On top of that, compute thresholds may end up being brittle in the face of new development or deployment paradigms. I argue that we’ve already seen this with OpenAI’s o1 models, which are trained using reinforcement learning to reason about user prompts. If you give the model more time to reason (“test-time compute”), it seems to perform better—though we do not have anything like a full picture of how efficiently or far this approach scales.

The o1 methodology, which will surely be replicated by other American labs, foreign AI developers, and the open-source community in due time, means that smaller models can be even more capable. To wit: it seems likely that neither o1-preview nor o1-mini would have technically been considered a covered model by California’s now-vetoed SB 1047, whose explicit intention was to regulate next-generation frontier models. SB 1047 was therefore becoming outdated before it even became law. Double exponentials, as Sam Altman has said, “get away from you fast.”

Decentralized Training Enters the Scene

I suspect another new paradigm, this time in training models, could similarly begin to challenge the compute threshold model of AI regulation: decentralized training.

Simply put, decentralized training is when an AI developer trains a model across multiple different data centers. This is sometimes called “distributed training,” but I think this label is imprecise: all model training over multiple GPUs is a form of “distributed computing,” even if they are in a single data center.

Semantic distinctions aside, decentralized training matters for many policy-related reasons. First, by allowing frontier AI developers to spread a large training run across multiple data center campuses, it can lessen electricity demand on an individual jurisdiction. Instead of needing ten gigawatts in one place, developers might only need two gigawatts in five places. This is still a substantial infrastructure development challenge, but it distributes the burden at least to some extent.

More importantly, for our purposes, decentralized training also challenges compute thresholds. How many ways are there to slice up 10^26 floating-point operations? Quite a few, perhaps. Indeed, if decentralized training works very well, it is conceivable that models could be trained across GPUs in thousands or more of discrete locations. And in the most extreme scenario (which I think is unlikely, to be clear), AI training could even resemble Folding@Home, a decentralized computing project in which millions of people lent spare computing capacity on their personal devices to protein folding simulation.

There are many by benefits to be gained from decentralized training. But it could also allow developers to make their training runs less legible to the state by hiding the full scale of their compute usage. In the current market dynamic, I don’t worry terribly about OpenAI or any other leading AI company doing that. But what if some other actor did?

Imagine, for example, that we pass data center know your customer (KYC) rules. The idea here is sensible enough: government would like to know (or at least they would like data center owners to know) the identities of all the people training large-scale AI models using American computing infrastructure. If implemented, these proposals would require data centers to verify the identities and other information about people using their facilities. Fair enough.

Data center KYC requirements were part of SB 1047, and the Department of Commerce’s Bureau of Industry and Security floated a proposed rule earlier this year. In general, the authors of these proposals understand that making every data center customer go through KYC would be massively burdensome. So, they thoughtfully place limitations on what kinds of customers need to comply with the KYC process. And in general, how are those limitations defined?

You guessed it: compute thresholds. SB 1047’s data center KYC rules would have only applied “when a customer utilizes compute resources that would be sufficient to train a covered model.” The Commerce Department’s proposed rule made clear it was intended to apply to “large models with potential capabilities that could be used in malicious cyber-enabled activity,” though did not get any more specific than that (by the way, how is the hyperscaler supposed to know the “potential capabilities” of a model before it is trained? Would the developer even necessarily know this? But I digress). At some point, Commerce will finalize this rule, and it will almost certainly become public policy. When it does, there’s a good chance it will be a compute-based threshold.

And by doing so, what Commerce will have done is give anyone who wants to avoid KYC rules precise instructions for how to do so. Simply split up your training run into sufficiently small chunks (perhaps as few as 3 or 4; perhaps as many as millions), and remain below the KYC threshold for any one particular facility.

Decentralized training is a nontrivial technical challenge, requiring novel training techniques or even novel architectures. When you train an AI model, billions or even trillions of weights are being updated in real time in response to new training data. Models have long since passed the point where they can fit on a single GPU, so they or the training data (or both) have to be split up—sharded is the technical term. Then, changes to the weights must be exchanged between the GPUs. This is a major feat at frontier scale even in a single data center. With decentralized training, developers must precisely synchronize these rapidly changing numbers across different physical locations, as well. In these circumstances, the speed of light (in fiber optic cables) becomes an important bottleneck.

Jeff Amico, Chief Operating Officer at the AI startup Gensyn, has written an excellent technical explanation on the current state of play in the research, for those interested.

It is not obvious that the exact scenario I’ve sketched above will ever be possible. Training a model in a single facility is hard enough (read, for example, the “Reliability and Operational Challenges” section of Meta’s Llama 3.1 paper, which details the many ways in which training runs can go wrong better than any other work I’ve seen) and there are many unknowns associated with decentralized training. We don’t know that it would be possible to split up training across data centers owned by different companies, for instance. We don’t know over how many facilities we training can be divided, or how close they will need to be in physical space (again, the speed of light is a constraint in this context). And we don’t know that it would work across different types of compute hardware.

But we do know that nearly every player in the AI research field would benefit from decentralized training. Large AI companies (and their hyperscaler partners) have facilities spread across the country, and they would love to be able to pool them for a training run (not to mention the power supply constraints they’ll face in any individual jurisdiction). It should come as no surprise, then, that Google’s DeepMind has published the state-of-the-art research on this topic, or that OpenAI and Microsoft are rumored by Dylan Patel to have made major investments into decentralized training.

Startups and the open-source community, too, would gain immensely from decentralized training. Being able to share resources among themselves, as well as borrow spare capacity from data centers, could lead to lower prices and more efficient utilization of compute. The same basic thing is true of the academic community. Most universities have, at best, a few hundred GPUs that are shared among all faculty. But if these facilities could be combined, academics would have far more resources at their disposal. Unsurprisingly, these communities have pushed this research forward, as well.

And of course, on top of these incentives, the entire Chinese AI community has an additional reason to use more compute more efficiently: US export controls.

Thus, nearly all of the major players in AI research would benefit from decentralized training getting better. When virtually the entire AI community is incentivized to make progress on a particular research direction, it usually a safe bet that progress will be made. How far that progress will go is anyone’s guess.

Conclusion

Compute thresholds were policymakers’ first stab at AI policy. And as I mentioned, compute thresholds are a perfectly reasonable approximation for “the models we expect the four largest companies in AI to train over the next couple years.” But for anything intended to endure, I suspect we are going to need something different.

SB 1047’s original threshold was entirely based on compute (the same 10^26 flop figure that comes from the Biden Executive Order). In later versions of the bill, that was changed to a figure based on a combination of compute and cost of the training run ($100 million, for foundation models). This effectively raises the compute threshold over time using a mechanism based on the market price of compute.

But of course, simply having a coherent threshold for regulation implies nothing about the content of the regulation itself. Even with the best possible threshold, it remains the case that model-based regulation has serious downsides.

I’ve advocated that policymakers conceptualize AI as a discovery about nature, or a fact about the world, and not as a “product” whose development, dissemination, and use is under their control by default. Indeed, I suspect the opposite: policymakers’ default assumption should be that they do not control AI. Pretending as though they do control AI, or can, or should, is likely to result in a “War on Drugs”-esque dynamic, where policymakers end up driving much of AI development into the dark. In their efforts to bring AI under their control, they end up making it less under their control. It is a tale as old as public policy.

Instead, policymakers should create the institutions, capabilities, laws, and regulatory techniques that will be necessary to maintain order in light of this challenging new fact about reality. They should contend with this new epoch of human history along with the rest of us, rather than indulge the fantasy that their job is sit back and create “guardrails” for it. History, in the final analysis, has no guardrails.

This is, to be sure, the harder job (for us all), but I believe it is the job to be done. As policymakers contemplate this more difficult labor, they should remember that, if government can figure out how to adopt AI successfully, they will have far better capabilities to do all manner of things at their disposal soon. Indeed, ensuring that government uses those AI-enabled powers responsibly may well be among the greatest challenges of them all.

Nathan Lambert

Oct 10

I also don’t love the term decentralized training for multi data center training (what it should be).

Fully decentralized training on laptops is very different than that.

Expand full comment

1 reply by Dean W. Ball

Peter Mancini

Great article! I had not been following the regulator side of this. I'm so glad I read this now. You breakdown the issue very well. Thank you for sharing this.

4 more comments...

Hyperdimensional

Discussion about this post