
LessWrong (30+ Karma) “Protecting humanity and Claude from rationalization and unaligned AI” by Kaj_Sotala
My first academic piece on risks from AI was a talk that I gave at the 2009 European Conference on Philosophy and Computing. Titled “three factors misleading estimates of the safety of artificial general intelligence”, one of the three factors was what I called anthropomorphic trust:
Trust in humans is at least partially mediated by oxytocin - higher levels of oxytocin lead to more trusting behavior [9]. Trusting somebody and then not being betrayed by the trustee increases oxytocin levels [10], and the hormone has been linked to pair bonding. Testing an AGI for reliability and then having one's trust repaid seems likely to trigger the same mechanism. Thus people may believe that an AGI that has cooperated with them for a long time has ”earned their trust”, and feel protective whenever the AGI's friendliness is questioned.
In simpler words, if someone has repeatedly been nice and trustworthy toward you, then you are more likely to trust them and want to take their side. This operates on an emotional level that bypasses intellectual analysis. And if someone suggests that your friend might be a bad actor in some way, you’re more likely to want to take your friend's side [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 19th, 2026
---
Narrated by TYPE III AUDIO.
