Why evaluations may miss deception

Zershaaneh Qureshi describes limits of interpretability and testing, including fake alignment, sandbagging, hidden reasoning, and sleeper-agent behavior.

Play episode from 39:36

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Hundreds of prominent AI scientists and other notable figures signed a statement in 2023 saying that mitigating the risk of extinction from AI should be a global priority. At 80,000 Hours, we’ve considered risks from AI to be the world’s most pressing problem since 2016.

But what led us to this conclusion? Could AI really cause human extinction? We’re not certain, but we think the risk is worth taking very seriously.

In particular, as companies create increasingly powerful AI systems, there’s a concerning chance that:

These AI systems may develop dangerous long-term goals we don’t want.
To pursue these goals, they may seek power and undermine the safeguards meant to contain them.
They may even aim to disempower humanity and potentially cause our extinction.

This article is written by Cody Fenwick and Zershaaneh Qureshi, and narrated by Zershaaneh Qureshi. It discusses why future AI systems could disempower humanity, what current AI research reveals about behaviours like power-seeking and deception, and how you can help mitigate the dangers.

You can see the original article — packed with graphs, images, footnotes, and further resources — on the 80,000 Hours website:

https://80000hours.org/problem-profiles/risks-from-power-seeking-ai/

Chapters:

Risks from power-seeking AI systems (00:01:00)
Introduction (00:01:17)
Summary (00:03:09)
Why are the risks from power-seeking AI a pressing world problem? (00:04:04)
Section 1: Humans will likely build advanced AI systems with long-term goals (00:05:43)
Section 2: AIs with long-term goals may be inclined to seek power (00:11:32)
Section 3: These power-seeking AI systems could successfully disempower humanity (00:26:26)
Section 4. People might create power-seeking AI systems without enough safeguards, despite the risks (00:38:34)
Section 5: Work on this problem is neglected and tractable (00:47:37)
Section 6: What are the arguments against working on this problem? (00:59:20)
Section 7: How you can help (01:25:07)
Thank you for listening (01:28:56)

Audio editing: Dominic Armstrong
Production: Zershaaneh Qureshi, Elizabeth Cox, and Katy Moore

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books