How can you know that a super-intelligent AI is trying to do what you asked it to do?
The answer, it turns out, is: not easily. And unfortunately, an increasing number of AI safety researchers are warning that this is a problem we’re going to have to solve sooner rather than later, if we want to avoid bad outcomes — which may include a species-level catastrophe.
The type of failure mode whereby AIs optimize for things other than those we ask them to is known as an inner alignment failure in the context of AI safety. It’s distinct from outer alignment failure, which is what happens when you ask your AI to do something that turns out to be dangerous, and it was only recognized by AI safety researchers as its own category of risk in 2019. And the researcher who led that effort is my guest for this episode of the podcast, Evan Hubinger.
Evan is an AI safety veteran who’s done research at leading AI labs like OpenAI, and whose experience also includes stints at Google, Ripple and Yelp. He currently works at the Machine Intelligence Research Institute (MIRI) as a Research Fellow, and joined me to talk about his views on AI safety, the alignment problem, and whether humanity is likely to survive the advent of superintelligent AI.