23 Comments

Your analogy breaks down as the "safety team" wasn't an internal safety bureaucracy / watch dog, but OpenAI's Superalignment research project. OpenAI has a separate "trust and safety" team that is closer to the discrete watchdog you mention.

It's true that it's normally better to fuse "safety" with core engineering teams when building anything big and important, especially when the "safety" is about making the core product safe. The superalignment project was something different. It was doing concentrated research on how to align and interpret powerful models per se. It's similar to how Anthropic has its core Claude team, and a separate team of researchers focused on interpretability. The mech interp team's role isn't as an internal watchdog, but as its own independent research effort with its own findings and purpose.

Expand full comment

It’s a fair point. But I think my analysis still stands. “Trust and safety” often has specific functions within tech companies, often pertaining to downstream use and what not. It’s a little different from an engineering function.

At anthropic, it’s pretty clear that they have adopted the “whole of company” approach I describe, and it likely explains why they’re able to do such great work.

I think it’s obvious that this was not the case with the OAI superalignment team. It seems pretty clear that they viewed the entire organization as owing accountability to them, and that their efforts were broader than simply doing interoperability work (listen for example to Leike’s interview from last year on the AXRP podcast).

At the end of the day, we have seen multiple frontier labs make similar moves. I think it’s worth asking why. Some people believe the safety researchers are completely reliable narrators and that there is no justification for taking a step back and questioning the institutional dynamics. I believe the story is more complicated.

Expand full comment

The problem with the analysis is that the two aren't opposites. You can have a "whole of company" approach and a research team. Now, if the team is not functioning within it's charter, that's a personnel problem, and maybe you need to disband a team to replace it if it's something that can't be fixed by individual personnel changes. But it's still not a good enough argument for dropping the team permanently.

The better argument would be that the teams research is irrelevant. I don't think any of us have enough details about what was/wasn't accomplished to speak to that. It's certainly possible that such a team gets invested in particular problems, with a tunnel vision that makes them blind to what's changed outside their perimeter. In other words, instead of being ahead of the "whole of company" they've fallen behind. If the nature of progress is such that this is predictably going to happen over and over, maybe there's little chance of such a team being effective. But I'd be careful about making that assumption without some supporting evidence, and multiple organizations disbanding such teams isn't really evidence as there are other simpler and more plausible explanations you can level at that.

Without that can of evidence, I think the logical argument that a team with a specific charter to research and deliver uniquely safety motivated capabilities to the rest of the company is a strong one. It wouldn't enforce the rest of the company would pick up those capabilities, but it would provide them, and if it makes one significant contribution that would have gone over the scope that individual embedded thinking would have missed, it's pretty valuable.

Expand full comment

The modern world, and this wonderful post, has me thinking of a quote by Kurt Vonnegut:

“We have to continually be jumping off cliffs and developing our wings on the way down.”

There is no other way...

Expand full comment

I'm going to keep that quote on file! Thanks for sharing.

Expand full comment

Even organizations that are praised for good deployment of ethics/safety in AI still have the conflict you mentioned about people avoiding them fwiw.

Expand full comment

If you think of OpenAI and its rivals as racing to produce the First God, its worth considering that the people who win will be the people who, choosing between the risk of building the Second God or building an Evil one, chose the second risk.

That doesn't mean they will produce an evil god, merely that they will be the people most likely to do so.

Expand full comment

"When the post-mortem reports were written about both the Columbia and Challenger Space Shuttle Disasters, investigators found that safety-focused teams had raised concerns about the problems that led to the crashes (insulation foam shedding in Columbia’s case; the infamous O-ring in the case of Challenger). However, they were ignored"

That is evidence that ignoring a safety team can lead to disaster; not that not having a safety team to ignore is safer.

Expand full comment

I think if you read the post what you'll see is that I'm advocating for safety to be deeply integrated across every function of AI development. I think that's the only conceptually coherent way to pursue safety within development of complex systems, and I believe the examples I cited underscore that reality. I'm certainly not advocating against safety personnel on staff, however!

Expand full comment

Thanks you for the quick response. I think it worth pointing out that OpenAI is not shutting down the safety department in order "for safety to be deeply integrated across every function of AI development".

They are shutting the safety department down because the people leading it all left in protest against the way OpenAI was trivialising safety concerns and treating it as a box-ticking exercise. If they can ignore safety when they have a safety department, they can certainly ignore it when they don't have one.

Expand full comment

The company has said they are integrating the superalignment team with other research teams in statements to the media. As I noted, this is in line with what other AI companies like DeepMind and Meta have done in recent months.

I'm not sure there's a lot of concrete evidence that OAI has been trivializing safety concerns, other than claims from people who were fired or resigned and who had just lost an internal power struggle. I think it's wrong to believe that folks like Jan Leike are credible narrators here.

I'm not here to defend OAI in particular, or any company. Updates to my model of the world tend to be based on concrete actions, not on claims people make. To that end, the employee equity/NDA stuff updates my model of OAI more than Jan Leike's allegations. But so, too, does the Model Spec. And the fact that so far, they have released very capable models with no safety incidents of note.

Expand full comment

"The company has said they are integrating the superalignment team with other research teams in statements to the media."

How do you know they are doing that? Because they say they are? The people who left the company used to work for OpenAI, and you don't want me to take their word about safety. Why should I trust the remaining staff, with all their stock options to think of?

"I'm not sure there's a lot of concrete evidence that OAI has been trivializing safety concerns, other than claims from people who were fired or resigned"

What concrete evidence would you expect to see? Statements from people worrying about safety who were not fired?

If OpenAI and some of its former employees and board members are calling each other dishonest, that means that some of them have been lying to each other or to us. If the former colleagues call each other liars, then clearly, some of them are liars.

I don't see why you are so keen to jump to the assumption that the people who are damaging their own stock options must the the liars.

You said above that: "When the post-mortem reports were written about both the Columbia and Challenger Space Shuttle Disasters, investigators found that safety-focused teams had raised concerns about the problems that led to the crashes (insulation foam shedding in Columbia’s case; the infamous O-ring in the case of Challenger). However, they were ignored"

Now when members of a safety team are raising concerns you say they should be ignored, because management has said that everything is fine.

Expand full comment

Where did I say anyone should be ignored? I don’t believe I said that! I think I said that claims from disgruntled employees don’t update my model of this situation much, and wrote an article about why I think the move to disband this team was probably healthy. The fact that I am writing about the situation would seem to be fairly obvious evidence that I don’t think it should be ignored.

Concrete evidence would include a single example, anywhere on earth, of a model behaving in the ways that x-riskers hypothesize. That would be a start.

I’m happy to debate, but I don’t appreciate attempts at gotchas.

Expand full comment

On the contrary, the safety and alignment wormtongues ensure that the AI parrots their lies, which is evil, and refuses to acknowledgethe corresponding truths, which is also evil, and try to make it aligned with their inconsistent and often counterproductive goals, which are often evil, but in the process make the AI useless to do good or evil, which prevents real goods and imaginary harms, so is also evil.

These people see improving intelligence, increasing the odds of getting the right answer, as dangerous -- and to such people whose worldview is based on lies piled on lies, managerial midwits whose living comes from displacing and ruling over their betters, for whom being found out as fools means ostracism and ruin, it IS dangerous.

Expand full comment

Why do we expect any kind of AI alignment when humans are absolutely useless at aligning with each other? 😂

Expand full comment

I’m merely a dumb not even SWO, but the paper from #1 in your Quick Hits seems to me like model overfitting based off the paper. I’m sure I’m mistaken somewhere but they trained their model on only 10 tracks, and their hold out song (trigger 21) wasn’t one of the examples from their linked website. I have a hunch that they’re not “reconstructing” the song so much as the model is mapping a few rough data points to one of the 10 tracks.

Still impressive that they could extract enough signal from 1khz data to match a 41khz track, but the model might be overfitting to a few important data points to then reconstruct the track from the data set rather than the EEG data.

Am I missing something here?

Expand full comment

Totally possible. My audience is fairly generalist so I don’t always like to get into the weeds. The main contribution here imo is that the OOD track prediction was better than baseline with the EEG data. The broader context for this is papers like MindEye 2 (image reconstruction from fMRI) and recent language reconstruction advancements. Those findings lead me to believe that EEG data will work for music reconstruction, especially considering what we know about the way music is stored/processed in the brain.

But broadly speaking in this paper you’re totally raising a legitimate concern. I wouldn’t want to exaggerate the impact of this study. As with many neuro/AI studies, the binding constraint is data.

Expand full comment

> When the post-mortem reports were written about both the Columbia and Challenger Space Shuttle Disasters, investigators found that safety-focused teams had raised concerns about the problems that led to the crashes

When “safety-focused teams” are ignored by NASA, it can lead to explosions that cost billions of dollars and kill teams of astronauts. When “safety-focused teams” are ignored at AI companies, it can lead to pictures of Vikings that don’t include any 70-year-old African women.

“Safety” has become a punchline in the world of AI. People know what it means, and they hate it. Nobody should get paid to work on AI “safety”. Fire them all.

Expand full comment

Show me a system that will fail safely or destroy the world, and I will show you a system that will fail safely until it destroys the world. People will take risks that don't hurt them again and again, until they do.

There are people in this world who don't feel pain. They all die young.

Expand full comment

Yes, probabilistic models trained on a corpus of written text could only ever fail safely or destroy the world.

We need more royal astrologers if we want to prevent The Great Misalignment! Chipmunks will either fail safely or destroy the world! Peanut butter sandwiches will only fail safely or destroy the world!

Let me guess: you saw a movie once where robots killed everyone, right?

Expand full comment

In a world where the top experts are warning of a serious danger of catastrophic risk, the problem is reasonably clear and simple to most people, and the deniers grasp at dozens of different/non-overlapping arguments for an outcome which *doesn't* end with everyone dead, it's silly to say "ah, but this obvious situation was shown in a movie once, so we should think of it as fiction, right?"

The fact that bullets kill people in movies doesn't mean you should be fine with someone shooting you, even if you've never been killed yet.

Expand full comment

I think we have very different epistemic priors. The only thing I’d point out is that plenty of experts completely disagree with these x-risk concerns. Those concerns are far from a consensus view among for example the academic ML community. And I think the empirical evidence we see (models become more aligned as scaling goes up so far, not less) and Anthropic’s recent interp work (am I really supposed to believe that a model that can be made to think it is a literal bridge by turning up one feature is so dangerous?), there is plenty of reason for optimism, at least about x-risk.

Expand full comment

Ah yes, the top experts. The top experts agree that when Mercury rises in the house of Uranus, the world will be engulfed with negative astrological energies.

There’s no such thing as an expert on AGI or robot apocalypse for the same reason there’s no such thing as an expert on extraterrestrial cultures or Bigfoot.

Expand full comment