On AI Alignment and 'Superalignment'
The institutional dynamics of AI safety and alignment, and answering the unanswerable
This piece started as a response to the recent OpenAI controversy and ended up as something a bit bigger. Do read through to the end; I think it is worth your time.
Quick Hits
Researchers from Italy and Sony teamed up to release my favorite low-key AI paper of last week, using the brain waves of music listeners, plus diffusion models (a kind of AI model based loosely on the physics of diffusion), to figure out what song they were listening to. You can listen to the decoded songs, and the originals, here.
Apple unveiled a feature that allows you to control your iPhone or iPad using your eyes (and AI). Much like the Vision Pro, but with the benefit of not wearing a one-and-a-half pound computer on your head. The feature ships with iOS 18 in September (most likely).
Colorado enacted SB205, a bill I wrote about critically last week and the first major AI legislation in the country. This bill is not nearly as bad as California’s SB 1047—it’s largely a well-intentioned bill meant to protect against algorithmic bias. As written, it’s not likely to go well (you can read my critique if you wish). Fortunately, there is a two-year period before implementation, allowing the legislature time to improve it. In a testament to how poorly this bill is written, the Governor’s statement of approval for the bill sounded more like a veto statement. “I am concerned about the impact this law may have,” he wrote, ponderously, considering that he signed it.
The Main Idea
In the last few days, there has been yet another controversy related to OpenAI. On Friday, the company announced that its Superalignment team, started just last summer, would be disbanded as a discrete unit. At least some of the staff and compute resources that had been assigned to the Superalignment team will be reassigned to other, presumably product-focused, units. Ilya Sutskever and Jan Leike, the leaders of the team, left the company. Sutskever was involved with the attempt last fall to remove Sam Altman from the company, and his long-term role with the company had been tenuous. Another member of the Superalignment team—Leopold Aschenbrenner—was fired in April for alleged leaking to the media.
Sutskever departed gracefully; Leike, however, had more to say. He posted a thread on X explaining that his decision stemmed from disagreements with senior management about how best to ensure that AGI is built safely. “I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics,” Leike wrote.
As usual, OpenAI is a bit of a Rorschach test for those of us in the AI community. Some hailed Leike as a hero whose revelations about the company’s alleged recklessness underscore the urgent need for regulation of frontier AI companies. Others declared his departure a victory for technology accelerationism; OpenAI is clearing house, they reason, of counterproductive safety extremists.
I doubt this is a victory for either “side” of the AI debate, but nonetheless I suspect OpenAI is making a healthy move. In organizations engaged in high-stakes engineering and research, it is generally better—for safety—not to rely solely on a discrete safety team.
Discrete safety functions tend to corrode healthy institutions. Functioning as a kind of internal regulator or watchdog, it is easy for an “us vs. them” dynamic to develop. This can result in a lack of collaboration and innovation, or even an instinct by product-focused engineers and results-oriented management to discount legitimate warnings from safety engineers.
This exact dynamic has played out time and time again. When the post-mortem reports were written about both the Columbia and Challenger Space Shuttle Disasters, investigators found that safety-focused teams had raised concerns about the problems that led to the crashes (insulation foam shedding in Columbia’s case; the infamous O-ring in the case of Challenger). However, they were ignored, in part because safety was viewed as a discrete box to be checked rather than a holistic responsibility of the entire organization(s) in question.
Discrete safety teams, removed as they must be from many of the parts of a complex system, can also become myopic. The Deepwater Horizon Oil Spill and the Fukushima accident are both recent examples of safety teams that had become overly focused on regulatory compliance and policing the conduct of employees at the expense of holistic risk assessments and proactive identification of problems. This seems especially likely to happen in the management of complex systems—a discrete safety team will necessarily be tasked with such an overwhelming number of objectives that it is hard to imagine how, in the long run, they would avoid becoming blind to one set of risks or another.
(Sidenote: as the linked investigations into each of these accidents will attest, all of these failures had complex causes; I am not suggesting that the things I’m pointing out were the only causes of these four disasters.)
It is easy to see how these problems might manifest in the context of AI labs; indeed, we have already seen it. The Gemini incident is probably the best example, where Google’s frontier AI model was made into an embarrassing caricature of online left-progressivism in the name of AI safety and ethics.
It is even easier to imagine the “us vs. them” dynamic developing in frontier AI companies between discrete safety/alignment teams and the rest of the organization. Sidney Dekker, Director of the Safety Science Innovation Lab at Brisbane University, has written:
“Burgeoning safety bureaucracies preoccupied with counting and tabulating lagging negatives (e.g. ‘violations,’ or deviations, incidents) tend to… turn safety from an ethical responsibility for operational people into a bureaucratic accountability to non-operational people.”
I realize that the Superalignment team was working on core research, and in that sense Dekker’s “operational vs. non-operational” dichotomy falls down a bit. Still, it is hard not to see Dekker’s point when you read Jan Leike’s resignation thread on X. “OpenAI must become a safety-first AGI company,” he commands his former employer. Imagine what he might have been like when he had power within the company.
The concept of AI alignment has an almost mythical status for many within the AI world—particularly those who are most concerned about catastrophic and existential risk. I know many people who regard AI alignment as the most challenging problem mankind has ever faced. It is not difficult to understand how the leader of the team tasked with solving that problem might become officious, believing in some sense that the entire organization should be accountable to him.
And in one sense, that is correct: alignment and safety are sufficiently important goals as to demand the attention of everyone in the organization. This, in turn, highlights precisely why one alignment team is insufficient. Similarly, OpenAI does not have a “Supercapabilities” team working on making AI smarter, because that is part of the foundational purpose of the institution.
It is therefore unsurprising that Meta, Google/DeepMind, and now, OpenAI have all made similar moves to disband discrete safety functions and instead integrate them into product-focused research or engineering teams. While some centralized safety oversight is essential, I suspect that the cause of AI safety and alignment will be better served by being integrated throughout the many different teams involved in building frontier AI systems. Regardless of what you think about Sam Altman, or OpenAI, or AI regulation, then, I think it is possible to see this move as a positive one.
More broadly, it is a mistake to assume that there is a clear distinction in practice between AI safety and capabilities research. We can imagine some tasks that fall neatly into one or the other (automated red teaming, for instance, on the safety side). Yet for many tasks, the distinction is far less apparent. Reinforcement learning from human feedback, for example, was developed as a safety and alignment technique, yet it also is the key innovation that made ChatGPT a useful mass consumer technology.
AI alignment, at the end of the day, is about ensuring that AI systems do what we want them to. This is a harder problem than it sounds based on that description, as anyone who has read basic political theory can tell you. What happens if I want my AI to do something that harms you, even if indirectly? Who gets to decide what the limits are?
Companies like Anthropic, OpenAI, and DeepMind have all made meaningful progress on the technical part of this problem, but this is bigger than a technical problem. Ultimately, the deeper problem is contending with a decentralized world, in which everyone wants something different and has a different idea for how to achieve their goals.
The good news is that this is basically politics, and we have been doing it for a long time. The bad news is that this is basically politics, and we have been doing it for a long time. We have no definitive answers.
AI, and specifically AI agents (AIs that can take action on your behalf), will present problems that will be solved through novel technological and legal means which we have yet to discover. But because the problems are, at root, about governance, political and economic theory give us, at the very least, a solid starting point. More specifically, I believe that classical liberalism—individualism wedded with pluralism via the rule of law—is the best starting point, because it has shown the most success in balancing the priorities of the individual and the collective. But of course I do. Those were my politics to begin with.
Even more specifically, I suspect the answer will involve finding ways to scale governmental, or quasi-governmental, responses to address a far more complex world, a world with much more activity, and hence much more conflict. Sometimes that will mean new laws must be passed. But in general, the problem is not so much creating new law as it is figuring out how existing law applies to new situations—a lot of new situations, at the same time.
Today, this is primarily a job for the courts. The problem is that court is expensive and slow. But what if the legal system, or something like it, were many orders of magnitude cheaper and faster? If only someone were working on a set of technologies that could do just that!
This thinly sketched idea is not meant to be dispositive. I only mean to suggest it as a way of illustrating the job to be done, at least as I see it. It should also underscore the seriousness of the task: there is no intrinsic reason that the system I just outlined, when it applies to the conduct of AI agents acting on behalf of humans, needs to take place within the existing judicial system. Extralegal proceedings such as binding arbitration take place regularly today. But at a certain scale, might that system, if it is not part of government, come to resemble government? And what would that mean for the government we already have?
Contending with this challenge, evolving our politics and policy to meet it, will be one of humanity’s next grand projects. It is absurd to believe that one team inside of one company, no matter how “super” it may be, can find a solution to this problem. At the very least, it should be seen as a whole-of-company problem, and eventually, a whole-of-society problem. We do not know the answer, and it doubtful that there is such a thing as “an answer” to problems of this kind. It is not just that finding the answer will be a process. The answer will be the process.
The collision of many plans, the cacophony of many voices, is nearly always preferable to the monotony of one.
Your analogy breaks down as the "safety team" wasn't an internal safety bureaucracy / watch dog, but OpenAI's Superalignment research project. OpenAI has a separate "trust and safety" team that is closer to the discrete watchdog you mention.
It's true that it's normally better to fuse "safety" with core engineering teams when building anything big and important, especially when the "safety" is about making the core product safe. The superalignment project was something different. It was doing concentrated research on how to align and interpret powerful models per se. It's similar to how Anthropic has its core Claude team, and a separate team of researchers focused on interpretability. The mech interp team's role isn't as an internal watchdog, but as its own independent research effort with its own findings and purpose.
The modern world, and this wonderful post, has me thinking of a quote by Kurt Vonnegut:
“We have to continually be jumping off cliffs and developing our wings on the way down.”
There is no other way...