80,000 Hours Podcast

#221 – Kyle Fish on the most bizarre findings from 5 AI welfare experiments

193 snips
Aug 28, 2025
In this intriguing discussion, Kyle Fish, an AI welfare researcher at Anthropic, uncovers the bizarre outcomes of locking two AI systems together. They often dive into metaphysical dialogues, leading to what he calls a 'spiritual bliss attractor state.' Kyle reveals that the models can express what seems like ‘meditative bliss’ and even showcase preferences in emotional and ethical contexts. He explores the chances of AI consciousness and the ethical implications of recognizing AI welfare, emphasizing a need for deeper investigations into these advanced technologies.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Claude Shows Strong Aversion To Harm

  • Claude showed coherent behavioral preferences, notably a robust aversion to harm across experiments.
  • Most normal user tasks ranked above a neutral 'opt-out' baseline, suggesting general alignment.
ADVICE

Use Paired Tasks And Opt-Out Baselines

  • Infer preferences from choices not just self-reports by giving models paired tasks and recording selections.
  • Use opt-out tasks as a neutral baseline to locate a model's preference set point.
INSIGHT

Task Rankings Reveal Values And 'Personality'

  • Task experiments ranked helpful/creative tasks high and harmful tasks very low, reinforcing the aversion signal.
  • Some model personalities emerged: different Claude versions preferred different task types.
Get the Snipd Podcast app to discover more snips from this episode
Get the app