A Machine Learning Paradime Isn't Enough to Train Alignment

You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators. You would need to somehow generalize optimization for alignment you did in safe conditions across a big distributional shift to dangerous conditions. If cognitive machinery doesn't generalize far out of the dist ution where you did tons of training, it can't solve problems on the order of build nano technology. There's no known case where you can train a safe level of ability on a safe environment and then deploy that capability to save the world.

Play episode from 21:22

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

"AGI Ruin: A List of Lethalities" by Eliezer Yudkowsky

LessWrong (Curated & Popular)

A Machine Learning Paradime Isn't Enough to Train Alignment

Preamble:

The AI-powered Podcast Player