5 Comments

> Many in the safety community always assume that every capabilities-related hurdle will be overcome, yet at the same time they assume that the safety problems that concern them most are somehow uniquely difficult. I have never quite understood the logic behind this—yet at the same time, I’ll be keeping my eyes open.

I think the intuition is something like this – I present this without necessarily fully endorsing it:

If there are weaknesses in a model's capabilities, there will be unmistakable signals (e.g. dissatisfied customers) and lots of motivation to fix them. The dynamics of the system are such that resources will be reliably, and on balance at least somewhat intelligently, directed toward improving capabilities.

If there are weaknesses in a model's safety properties, there may or may not be clear signals (see below), and the dynamics of the system are not as reliable in ensuring that risks are addressed. Look at how long it took for issues with tobacco or leaded gasoline to be addressed (neither is fully addressed even today, e.g. lead in aviation fuel) vs., for instance, how long it took for the cell phone industry to move to smartphones. In general, the motivation to fix safety issues can be weaker and less direct than to fix capability gaps: feedback loops (such as getting sued, or public outcry leading to policy changes) are slower and less direct than a tightly monitored revenue pipeline, and model and application developers don't necessarily internalize all of the downside (especially for catastrophic risks, or small companies where potential harms could easily exceed the entire value of the enterprise). (It is worth noting that developers aren't able to internalize all of the *benefits* of their work, either.)

I mentioned the question of clear signals. I think people have a wide range of intuitions as to whether a catastrophe could occur without warning (e.g. a foom and/or sharp-left-turn scenario leading to an unforeseen loss-of-control event, or a bad actor unleashing a really nasty engineered virus out of nowhere). Unanticipated weaknesses in capabilities are fine in the grand scheme of things, maybe someone has a bad quarter while the market routes around the inadequate product, or customers find workarounds. Unanticipated weaknesses in safety properties can have a very different impact.

There are proposed mechanisms by which such unanticipated issues could occur, such as deceptive alignment.

Another asymmetry: capability failures are addressed at the speed of capitalism, while safety failures might only be addressed at the speed of government. (I guess this is sort of the same as things I said earlier.)

Finally, some people observe the discourse of the last few years and conclude that some folks are acting in bad faith, and/or have bought into their own arguments to such an extent as to have the same result, and will actively downplay warning signs, evade safety requirements, etc. If you believe this, then you'll want to force the discussion now and set the stage for restrictions-with-teeth before events overtake us.

Expand full comment

I think this analysis suggests something like "governments and other third parties should do pre-deployment testing of frontier models for catastrophic risk potential," but the kind of ambient alignment problems the Apollo Research addresses would manifest themselves as capabilities flaws. The models would be unreliable in a host of different settings. So I do think that market mechanisms basically solve that problem.

Expand full comment

It seems a fundamental paradigm shift in our computing model. Moving from a passive, spectacle based consumption experience to a movement, agent based model. The previous paradigm was based on search and the democratization thereof. Devices were dumb terminals that displayed content. But if you can democratised reasoning and intelligence then the entire cloud computing model evolves to something new.

Expand full comment

Nice analysis. Timothy Lee (author of the Understanding AI blog) has written about another example of a threshold effect – somewhere in the last year or so Waymo seems to have crossed a threshold and is now growing exponentially: https://www.understandingai.org/p/waymo-is-growing-exponentially.

Expand full comment

How did you know Nabeel? I only know of his faith pronouncements. Do you also know David Wood?

Harry Lewis

Expand full comment