Free as in Speech, or Free as in Beer?

Why Open-Source AI is Essential

Jan 17, 2024

“It seems, however, that the crucial elements [of the Industrial Revolution] were neither brilliant individuals nor the impersonal forces governing the masses, but a small group of at most a few thousand people who formed a creative community based on the exchange of knowledge. Engineers, mechanics, chemists, physicians, and natural philosophers formed circles in which access to knowledge was the primary objective. Paired with the appreciation that such knowledge could be the base of ever-expanding prosperity, these elite networks were indispensable, even if individual members were not.”

-Joel Mokyr, The Gifts of Athena: Historical Origins of the Knowledge Economy

Few issues divide the AI community like the future of open-source AI. Some, especially those concerned about catastrophic risks from AI, favor banning it outright, or at the least substantially curtailing it. Others, particularly longtime AI researchers, believe it is essential to our future.

President Biden’s Executive Order on AI recognizes both the “substantial security risks” and “substantial benefits to innovation” of open-source AI, and directs the Commerce Secretary to “submit a report” with “policy and regulatory recommendations” on the matter.

What is Open Source?

Open source may sound like a wonky issue, especially to the less technically inclined. It is anything but. In fact, I believe eliminating or weakening open-source AI significantly increases the odds of a bad outcome with the technology more broadly. But what is open-source software? Why is it important? What are its risks, and how can we curtail them?

Open-source software, broadly speaking, is software whose underlying code is made publicly available for free. Anyone around the world can contribute to it or copy the code to create their own version. This enables a bottom-up, consensus-based approach to innovation. Open source is a particularly valuable approach for the basic infrastructure of digital life; because it is not “owned” by any one company, software companies and individuals alike have an incentive to maintain it collectively.

Indeed, almost all websites and software tools today employ open-source software in some way. The web browser or email client you’re reading this in is almost certainly based heavily on open-source technologies such as WebKit. Your smartphone’s operating system (if it is Android or iOS) has open-source software at its core. The servers that host much of the Internet are based on Linux—an open-source operating system. Much of the foundational software for websites you visit every day is open source. ChatGPT would not be possible without open-source software. Even software that you pay for probably uses open-source software in some ways to get its job done. After all, programming languages themselves are, more often than not, open source.

In other words, our digital ecosystem is unimaginable without open-source technology. Why is this the case? After all, with so much money to be made in software, why is so much of it given away for free?

Early computing culture formed in academic circles, where sharing knowledge and collaboration were seen as essential. Starting in the 1980s, this culture coalesced into the “free software movement.” “Free,” in this context, meant more than just the cost of the software. Richard Stallman, a founding figure of the movement, famously said that open-source software was “free as in speech,” not “free as in beer.”

This ideology seems well at home in the utopian 80s and 90s computing landscape. It is perhaps more surprising, though, that it has stood the test of time. Why has it?

Few people, whether working on their own or in companies, have the resources to develop all the ingredients of a modern website or software tool on their own. Fewer still have the resources to update that software and keep it secure over time. Open source, with its global community of contributors, functions as a kind of continuous and dynamic peer review for software.

Furthermore, popular open-source tools are well-understood by a large number of programmers, allowing them to get up to speed on new projects much more quickly than if those projects employed entirely proprietary software.

Finally, open source allows businesses in particular to add software capabilities and infrastructure without being locked into long-term contracts with proprietary software vendors. Businesses can switch between different open-source tools or even modify the tool itself to fit their specific needs.

In this sense, there are deep Coasean reasons that open-source software has persisted. Open source lowers transaction costs in software development, allowing both firms and individuals to benefit from the collective knowledge gained from decades of work by millions of developers around the world.

There are also cultural reasons that open source remains so widespread today. Sharing one’s work is simply part of the culture and incentive system of the Internet. I am writing this Substack because I want to contribute to the discourse on AI and emerging technology, and because I want to build a personal reputation as a thoughtful source on these issues. Making money simply isn’t a direct priority. Similarly, developers add to, edit, and create open-source projects because they want to contribute to the purpose of the project—and because it’s a crucial way to establish one’s reputation.

Open Source and AI

The definition of open source is a little different—and more contentious—in the context of AI. Unlike most software, AI models aren’t primarily based on code. Instead, the core of an AI model is its “weights,” billions (trillions, in the case of the largest models) of numbers that define the model’s mathematical representation of its training data. These days, an open-source AI model is one whose weights are freely available and modifiable. Some in the AI research community argue that simply releasing the weights isn’t enough to be deemed open source, and that a truly open-source model also needs to publicly release its training data. For our purposes, though, I’ll stick to the wider definition.

Like almost all other software, open source has been key to the development of AI. For years, open sourcing frontier AI models was considered the norm, even among larger firms. There is a reason that OpenAI is called “OpenAI”—the original idea, from Elon Musk and other founders, was that its models would be open source. GPT-2 was the last language model from OpenAI that could reasonably be called open source. Since then, they’ve opted to make most of their models accessible only via their API (their Whisper model, which transcribes audio to text, is a notable exception), meaning developers and users send prompts to the system, but can’t look under the hood.

Many other language models, however, continue to be open source. Open-source models haven’t quite caught up to the performance of GPT-4, but they are getting closer each month. Meta’s Llama series, Microsoft’s Orca and Phi models, and French startup Mistral’s models, are among the leaders.

Regardless of the specific technical definition, there is no reason to think that the dynamics described above won’t apply to open source AI.

Consider the perspective of a firm that wants to use LLMs to improve their internal business processes. Perhaps they want to modify a model’s behavior in ways that are impossible with a closed model like GPT-4, either because of technical or corporate policy limitations. Or perhaps they want to guarantee that a model won’t change over time without their consent. Closed models are often updated by their manufacturers in ways that consumers asking for recipe ideas aren’t like to notice, but a company producing millions of highly specific words per day might. Perhaps they are uncomfortable entrusting their sensitive internal data to another company. Perhaps they worry about entering into a long-term dependency on a specific company—after all, companies can change their policies or have leadership struggles.

Maybe a closed model maker will alleviate these and the many other concerns a company might have. They often do, because the closed model maker’s business depends on it. But simply observing the software choices of firms today reveals that closed source often loses: the low transaction (and literal) costs of open source are often hard to resist.

Open-source models can offer flexibility to more than just businesses. Consider a historian who wants to use language models to analyze large numbers of archival documents. Those documents may contain language which we see today as sexist, racist, or otherwise problematic. The leading closed models, with their heavy-handed content filters, do not generally handle this kind of task productively.

The most important benefits of technology do not come from innovation alone. Instead, diffusion—widespread use throughout society—is how technology improves lives and grows economies. Because of its ease of distribution and flexibility, open source has been key to the diffusion of the web and of software more broadly, and this will likely remain true for AI. Without open-source software, then, the US may have the most innovative AI sector, but less to show for it than it otherwise could. In this sense, open source is crucial for our long-term national competitiveness.

The other benefits of open source will apply to AI as well. Already, we are seeing important advances in AI interpretability, alignment, and safety research made using open-source models. Because open access to weights allows researchers to examine the inner workings of AI models, they can derive insights that are simply impossible with a closed model.

Finally, though open source thrives today for practical and economic reasons, the ideological reasons remain. AI will enable new and enhanced forms of self-expression, and in this sense “freedom as in speech, not as in beer,” seems more relevant than ever.

OpenAI’s ChatGPT (at least the version based on GPT-3.5) is free as in beer, but not always free as in speech. Just recently, a venture capitalist named Patrick Blumenthal made a custom version of GPT-4 (using OpenAI’s new ability to create your own “GPT”) that allowed users to ask questions about the latest Jeffrey Epstein-related court documents. OpenAI quickly blocked this, offering no explanation as to why. Regardless of one’s opinion of the Jeffrey Epstein affair, surely these documents are of legitimate public interest.

This brings up that pesky issue of the Constitution. There is a good argument, based on existing precedent, that AI models themselves are a form of protected self-expression under the First Amendment. AI models are simply the fruits of mathematical knowledge, and it is far from clear that the government can broadly restrict their distribution at will.

An earlier version of this conflict played out with cryptography. In the 1990s, the federal government deemed cryptography to be a munition under the United States International Traffic in Arms Regulations. People who made cryptographic algorithms were forced to register for a license in order to “export” them. This was deemed unconstitutional by the Ninth Circuit Court of Appeals in 1996. Soon after, the restrictions were relaxed. Today, widespread access to cryptographic tools lies at the heart of our digital security.

Despite the benefits, there are undoubtedly risks that come from open-source AI. Most model makers, open or closed, put guardrails in place to prevent their models from saying harmful things (racism, instructions for how to make a bomb, hacking, etc.). With open models, these restrictions can be lifted. It’s not necessarily trivial to do so, but it can be done—so we should assume that it will be done at least sometimes.

We shouldn’t assume that lifting these restrictions will always be a bad thing. As in the example of the historian or the AI researcher trying to learn more about how models work, people might have legitimate reasons for doing so.

At a deeper level, though, while companies should be allowed to place reasonable guardrails on their models, it is our laws, and not AI companies, that ultimately define the bounds of acceptable conduct. It is perfectly legal for someone to say that 9/11 was an inside job, or that the COVID vaccines are an Illuminati conspiracy. It is illegal, however, to defame or impersonate somebody—depending on the context.

There are plenty of perfectly legal (and in some cases, true) things that OpenAI’s models will refuse to say (at least sometimes—one should never assume that the answer they received from an LLM is the only type of answer it will give). Earlier versions of ChatGPT, for example, would happily write essays about why Roe v. Wade was a good judicial decision, but refuse to do the same about Dobbs v. Jackson Women’s Health Organization (though it will now usually do so for both prompts, to be fair). OpenAI and other AI firms are free to make whatever decisions they think will lead to the best product and the lowest reputational risk. It is far from clear, however, that those decisions should be dispositive for all uses of language models.

People will commit criminal and malicious acts using open-source AI, just as they do with other kinds of open-source software (if you’ve ever used a Wi-Fi network without permission, or shared a login to a paywalled news source—digital forms of jaywalking—you’ve quite possibly committed a crime using open-source software). We don’t need to regulate open-source models per se, we need to regulate illicit uses of AI. I’ll be going into much more depth on this subject in a forthcoming piece in National Affairs, so I’ll wait until that comes out to say more.

At least for now, using AI at scale isn’t something most people can do at home—unless they happen to have racks of GPUs (graphics processing units—the most popular computers for running AI models) in their basement. This means most consumers and businesses use frontier AI models (including open-source ones) hosted by hyperscalers like Amazon Web Services or Microsoft Azure. Kevin Xu has thoughtfully suggested that such services should be required to implement Know Your Customer (KYC) procedures. This is a smart way to allow law enforcement to police illicit uses of AI (and other digital tools). In fact, I think KYC, if implemented well, can be useful for a whole host of new infrastructural services that will blossom over the coming years; I hope to say more about that in the near future.

Open source is a uniquely American intellectual invention. Had the Internet been born in China, it is hard to imagine such a thing flourishing. It allows rapid exchange of and iteration upon useful knowledge between millions of people all over the world. We need this model of technological development now more than ever, as we face what could be the most difficult technological transition in human history. Open source’s benefits—openness, freedom, and technology diffusion—are all things that have undergirded America’s success since our founding. We abandon them at our peril.

Those of you who know this world well will say that I am ignoring a crucial risk of open-source: the creation of novel chemical and biological weapons. I plan to dive into that can of worms next week.

Hyperdimensional

Discussion about this post