When the government conducts meticulous scrutiny, the common folk will become devious and inscrutable…
Correctness turns into perversion.
Goodness turns into deviancy.
The people’s confusion has certainly lasted a long time.
Laozi, Tao-te ching (interp. Wang Bi, trans. Richard John Lynn)
Introduction
Many policymaking and adjacent communities complain about the “black box” nature of AI systems. In the Senate’s Roadmap for AI Policy, for example, the authors write:
Some AI systems have been referred to as “black boxes” which may raise questions about whether companies with such systems are appropriately abiding by existing laws. Thus, in cases where U.S. law requires a clear understanding of how an automated system operates, the opaque nature of some AI systems may be unacceptable. We encourage the relevant committees to consider… requirements on high-risk uses of AI, such as requirements around transparency, explainability, and testing and evaluation.
While I understand the Senators’ desire for “explainability,” I think it is important to be realistic about how “explainable” or “auditable” AI systems will ever be. It may well be the case that neural networks will never be perfectly explainable, at least not in a way that satisfies human rationality. The reason for this is that neural networks are prototypical examples of emergent orders—things which arise as a result of human action but not of human design. Neural networks are not so much designed as they are grown. Deep learning was not so much invented as it was discovered.
There is a nascent field called mechanistic interpretability that seeks to uncover the inner mechanisms of neural networks. I applaud this research; indeed, I find it to be some of the most interesting research happening in AI today. Yet everything I know about emergent orders makes me think that there will be limits to the explainability of any neural network—even one far simpler than modern frontier AI models.
Regulators’ desire for perfect, rational explainability may come into tension with the reality of emergent orders. We should not view that as a flaw of neural networks, though; instead, we should view the presumption that complex systems are rationally explainable as one of many 20th century, high modernist ideals that we would be wise to discard from contemporary thought. Let me show you what I mean.
On Emergent Orders
An “emergent” or “spontaneous” order is an organized system that develops with no central organizer. I was introduced to the concept of emergent orders by the work of Friedrich Hayek, largely in the context of free markets. But almost immediately, I had the intuition that emergent orders described far more than just free market economies. Why does inorganic matter seem to “want” to form itself into crystals? Why does evolution work? Why do brains think? And it turns out, indeed, that all these systems are forms of emergent order.
Emergent orders are powerful. They’re self-adaptive, which means that they can adapt themselves to novel circumstances without any conscious “design.” Yet at the same time, they’re tricky for humans to understand. The standard human intuition is that actions are caused by actors, or that the observable universe is the result of some set of “laws of nature,” or that humans have their characteristics because of the “code of life” embedded in our genomes. It’s very hard to grapple with the idea that none of this is quite true, that instead the most exquisite order can simply happen without anyone or anything willing it to be.
Another problem with emergent orders is that they are almost always inscrutable. They tend not to elude rational interpretation. They come to be as a result of searching over vast possibility spaces for “what works,” and they iterate gradually on “what works” over time and without a coherent plan.
One way to phrase this is to say that emergent orders are “computationally irreducible,” a concept pioneered most notably by Stephen Wolfram. A system that is computationally reducible is one in which it is possible to write some program to simplify the computations to be done. A computationally irreducible system, on the other hand, is one in which such simplifications are impossible.
This ends up having profound implications for all sorts of complex systems, including much of what we observe in nature. Here’s how Wolfram describes it (emphasis added):
In traditional science it has usually been assumed that if one can succeed in finding definite underlying rules for a system then this means that ultimately there will always be a fairly easy way to predict how the system will behave.
But now computational irreducibility leads to a much more fundamental problem with prediction. For it implies that even if in principle one has all the information one needs to work out how some particular system will behave, it can still take an irreducible amount of computational work actually to do this.
That is, even with perfect information, you cannot predict how a complex system will behave without doing the work that system would otherwise have done. Now, let’s see how these ideas apply to neural networks.
Neural Networks as Emergent Orders
I’ve long suspected that neural networks would remain fundamentally inscrutable to human reason. As I wrote in “Software’s Romantic Era”:
Ultimately, speculation, including my own, does not help advance our concrete understanding of what is going on here. I welcome much more scientific inquiry into this, particularly from Anthropic, which has meaningfully moved the ball forward on mechanistic interpretability in the recent past… I suspect that our understanding will advance meaningfully in the coming years, yet I also suspect that this fundamental illegibility will remain.
Neural networks are emergent orders in the sense that no one designed their functionality. Instead, developers of neural networks feed data through various mathematical architectures (the transformer’s attention block, the multilayer perceptron, etc.). To make a long story short: the idea is to turn the training data into numbers, which are then multiplied (or otherwise mathematically processed) by other numbers, the latter of which are known as the model’s “weights.” During training, models are made to make predictions about their training data (for example, the next word—token, actually, but we’ll go with word—in a given sequence of internet text). The weights are initially set randomly. When they fail to accurately predict the next word, they adjust their weights until they begin to make more accurate predictions.
These weights encode most of the information the model “knows.” You might think that it stores this information in a straightforward way. Indeed, in early interpretability studies of very small language models, OpenAI discovered a “Canada” neuron that seemed to contain everything the model knew about “Canada,” along with other neurons that mapped 1-to-1 to simple concepts. But because there is far more discrete information in the world than there are parameters in a model, in practice the model must compress its training data to fit within its weights. In theory, one could make a sufficiently large language model that each discrete unit of information could be stored within a dedicated computational unit—but then, you’d simply have created a mirror of your original training data! In other words, the compression that a neural network does is its intelligence.
So how does a neural network achieve this compression? How does it pack concepts from an exceedingly complex world into a finite mathematical artifact? There is no clean, rational way to do this, but there are two important concepts from mechanistic interpretability: superposition and polysemanticity. Superposition is when a model packs multiple “features” (concepts from the training data) into one neuron; and polysemanticity is when a neuron “activates” in the presence of multiple different features, instead of just one.
The current state-of-the-art interpretability is to use another kind of AI model called a sparse autoencoder to pick apart these complex relationships. This has worked surprisingly well—better, frankly, than I’d have guessed two years ago. Tens of millions of features from current language models like Claude and GPT-4have been discovered in just the past few months using this approach.
Yet we are just scratching the surface. These findings, though significant, represent a small fraction of the meaning encoded within a large neural network. Some of it may—and I believe will—simply be “uninterpretable,” that is, it will be discoverable, but it will have no basis in rational explanation. It’s not just my gut that tells me that. Chris Olah, a researcher working at Anthropic, widely regarded as the industry leader in mechanistic interpretability, recently wrote (no emphasis added):
…it may be that neural networks have exceptionally rare and sparse features. It's possible these are the vast majority of features (although less important, since they're uncommon). And, barring significant breakthroughs, it may be effectively impossible for us to resolve features rarer than some level.
From this perspective, these rare features may be a kind of "dark matter" of interpretability… it may be that a large fraction of the neural network universe is effectively unobservable dark matter.
In an important new essay, Stephen Wolfram seems to concur with this intuition:
Like biological evolution, machine learning is fundamentally about finding things that work—without the constraint of “understandability” that’s forced on us when we as humans explicitly engineer things step by step. Could one imagine constraining machine learning to make things understandable? To do so would effectively prevent machine learning from having access to the power of computationally irreducible processes, and from the evidence here it seems unlikely that with this constraint the kind of successes we’ve seen in machine learning would be possible…
Yes, we can make general statements—strongly based on computational irreducibility—about things like the findability of such processes, say by adaptive evolution. But if we ask “How in detail does the system work?”, there won’t be much of an answer to that. Of course we can trace all its computational steps and see that it behaves in a certain way. But we can’t expect what amounts to a “global human-level explanation” of what it’s doing. Rather, we’ll basically just be reduced to looking at some computationally irreducible process and observing that it “happens to work”—and we won’t have a high-level explanation of “why”.
Perhaps it will be the case that neural networks are interpretable enough in their current form that the regulators and the auditors and the lawyers we will be satisfied, assuming mechanistic interpretability continues to advance. But it should be clear: if Wolfram is correct, there would seem to be a tradeoff between how human interpretable a neural network is and how powerful it is. With diligent engineering, we might—and I believe we will—make these machines fairly predictable, maybe even very predictable. But it is possible that we can never make them perfectly predictable, because the machines are intelligent, and intelligent entities are inherently unpredictable and need to be granted the freedom to behave as such in order to be intelligent (a “natural law,” if you will).
On one level, the policy implication is, “regulations that expect perfect explainability from neural networks are unlikely to age well.” And that is all well and good. But there is something much deeper here. Neural networks are complex systems that we can observe and experiment on with perfect information—absolutely everything about the inner workings of a neural network can be measured with perfect precision. And yet even with that being the case, these emergent orders still elude our rational understanding. Perhaps that is the far more important takeaway for public policy, and indeed, for us all.
AI natation of this post:
https://askwhocastsai.substack.com/p/on-ai-black-boxes-by-by-dean-w-ball
Regulators are completely out of their depth here. In their efforts to “control” or “safety-ize” AI, they will struggle and then increasingly try explicitly or inadvertently to strangle innovation. And they will fail.