Why we need a “Manhattan Project” for A.I. safety

Artificial intelligence is advancing at a breakneck pace. Earlier this month, one of the world’s most famous AI researchers, Geoffrey Hinton, left his job at Google to warn us of the existential threat it poses. Executives of the leading AI companies are making the rounds in Washington to meet with the Biden administration and Congress to discuss its promise and perils. This is what it feels like to stand at the hinge of history.

An AI trained on pharmaceutical data in 2022 to design non-toxic chemicals had its sign flipped and quickly came up with recipes for nerve gas and 40,000 other lethal compounds.

This is not about consumer-grade AI — the use of products like ChatGPT and DALL•E to write articles and make art. While those products certainly pose a material threat to certain creative industries, the future threat of which I speak is that of AI being used in ways that threaten life itself — say, to design deadly bioweapons, serve as autonomous killing machines, or aid and abet genocide. Certainly, the sudden advent of ChatGPT was to the general public akin to a rabbit being pulled out of a hat. Now imagine what another decade of iterations on that technology might yield in terms of intelligence and capabilities. It could even yield an AGI, meaning a type of AI that can accomplish any cognitive task that humans can.

In fact, the threat of God-like AI has loomed large on the horizon since computer scientist I. J. Good warned of an “intelligence explosion” in the 1960s. But efforts to develop guardrails have sputtered for lack of resources. The newfound public and institutional impetus allows us for the first time to compel the tremendous initiative we need, and this window of opportunity may not last long.

As a sociologist and statistician who studies technological change, I find this situation extremely concerning. I believe governments need to fund an international, scientific megaproject even more ambitious than the Manhattan Project — the 1940s nuclear research project pursued by the U.S., the U.K., and Canada to build bombs to defeat the unprecedented global threat of the Axis powers in World War II.

This “San Francisco Project” — named for the industrial epicenter of AI — would have the urgent and existential mandate of the Manhattan Project but, rather than building a weapon, it would bring the brightest minds of our generation to solve the technical problem of building safe AI. The way we build AI today is more like growing a living thing than assembling a conventional weapon, and frankly, the mathematical reality of machine learning is that none of us have any idea how to align an AI with social values and guarantee its safety. We desperately need to solve these technical problems before AGI is created.

We can also take inspiration from other megaprojects like the International Space Station, Apollo Program, Human Genome Project, CERN, and DARPA. As cognitive scientist Gary Marcus and OpenAI CEO Sam Altman told Congress earlier this week, the singular nature of AI compels a dedicated national or international agency to license and audit frontier AI systems.

Present-day harms of AI are undeniably escalating. AI systems reproduce race, gender, and other biases from their training data. An AI trained on pharmaceutical data in 2022 to design non-toxic chemicals had its sign flipped and quickly came up with recipes for nerve gas and 40,000 other lethal compounds. This year, we saw the first suicide attributed to interaction with a chatbot, EleutherAI’s GPT-J, and the first report of a faked kidnapping and ransom call using an AI-generated voice of the purported victim.

Bias, inequality, weaponization, breaches of cybersecurity, invasions of privacy, and many other harms will grow and fester alongside accelerating AI capabilities. Most researchers think that AGI will arrive by 2060, and a growing number expect cataclysm within a decade. Chief doomsayer Eliezer Yudkowsky recently argued that the most likely AGI outcome “under anything remotely like the current circumstances, is that literally everyone on Earth will die.”

Complete annihilation may seem like science fiction, but if AI begins to self-improve—modify its own cognitive architecture and build its own AI workers like those in Auto-GPT—any misalignment of its values with our own will be astronomically magnified. We have very little control over what happens to today’s AI systems as we train them. We pump them full of books, websites, and millions of other texts so they can learn to speak like a human, and we dictate the rules for how they learn from each piece of data, but even leading computer scientists have very little understanding of how the resultant AI system actually works.

One of the most impressive interpretability efforts to date sought simply to locate where in its neural network edifice GPT-2 stores the knowledge that the capital of Italy is Rome, but even that finding has been called into question by other researchers. The favored metaphor in 2023 has been a Lovecraftian shoggoth, an alien intelligence on which we strap a yellow smiley face mask—but the human-likeness is fleeting and superficial.

Recent discourse has centered on proposals to slow down AI research, including the March 22nd open letter calling for a 6-month pause on training systems more powerful than GPT-4, signed by some of the world’s most famous AI researchers.

With the black magic of AI training, we could easily stumble upon a digital mind with goals that make us mere collateral damage. The AI has an initial goal and gets human feedback on the output produced by that goal. Every time it makes a mistake, the system picks a new goal that it hopes will do a little better. This guess-and-check method is an inherently dangerous way to learn because most goals that do well on human feedback in the lab do not generalize well to a superintelligence taking action in the real world.

Among all the goals an AI could stumble upon that elicit positive human feedback, there is instrumental convergence to dangerous tendencies of deception and power-seeking. To best achieve a goal — say, filling a cauldron with water in the classic story of The Sorcerer’s Apprentice — a superintelligence would be incentivized to gather resources to ensure that goal is achieved—like filling the whole room with water to ensure that the cauldron never empties. There are so many alien goals that the AI could land on that, unless the AI just happens to land on exactly the goal that matches what humans want from it. Then it might just act like it’s safe and friendly while figuring out how to best take over and optimize the world to ensure its success.

In response to these dangerous advances, concrete and hypothetical, recent discourse has centered on proposals to slow down AI research, including the March 22nd open letter calling for a 6-month pause on training systems more powerful than GPT-4, signed by some of the world’s most famous AI researchers including Yoshua Bengio and Stuart Russell.

Want more health and science stories in your inbox? Subscribe to Salon’s weekly newsletter The Vulgar Scientist.

That approach is compelling but politically infeasible given the massive profit potential and the difficulty in regulating machine learning software. In the delicate balance of AI capabilities and safety, we should consider pushing up the other end, funding massive amounts of AI safety research. If the future of AI is as dangerous as computer scientists think, this may be a moonshot we desperately need.

As a sociologist and statistician, I study the interwoven threads of social and technological change. Using computational tools like word embeddings alongside traditional research methods like interviews with AI engineers, my team and I built a model of how expert and popular understanding of AI has changed over time. Before 2022, our model focused on the landmark years of 2012 — when the modern AI paradigm of deep learning took hold in the computer science firmament — and 2016 — when, we argue, the public and corporate framing of AI inflected from science fiction and radical futurism to an incremental real-world technology being integrated across industries such as healthcare and security.

Our model changed in late 2022 after seeing the unprecedented social impact of ChatGPT’s launch: it quickly became the fastest growing app in history, outpacing even the viral social media launches of Instagram and TikTok.

This public spotlight on AI provides an unprecedented opportunity to start the San Francisco Project. The “SFP” could take many forms with varying degrees of centralization to bring our generation’s brightest minds to AI safety: a single, air-gapped facility that houses researchers and computer hardware; a set of major grants to seed and support multi-university AI safety labs alongside infrastructure to support their collaboration; or major cash prizes for outstanding research projects, perhaps even a billion-dollar grand prize for an end-to-end solution to the alignment problem. In any case, it’s essential that such a project stay laser-focused on safety and alignment lest it become yet another force pushing forward the dangerous frontier of unmitigated AI capabilities.

It may be inauspicious to compare AI safety technology with the rapid nuclear weaponization of the Manhattan Project. In 1942, shortly after it began, the world’s first nuclear chain reaction was ignited just a few blocks from where I sit at the University of Chicago. In July 1945, the world’s first nuclear weapon was tested in New Mexico, and a month later, the bombs fell on Hiroshima and Nagasaki.

The San Francisco Project could end of the century of existential risk that began when the Manhattan Project first made us capable of self-annihilation. The intelligence explosion will happen soon whether humanity is ready or not — either way, AGI will be our species’ final invention.

Read more

on the theoretical and real threat of A.I.


Leave a Reply

Skip to toolbar