LessWrong (30+ Karma)

“Steering Might Stop Working Soon” by J Bostock

Apr 5, 2026

J Bostock, author of the LessWrong post on steering LLMs, is a researcher analyzing activation steering and eval-awareness. He warns single-vector steering may soon fail. He compares steering models to steering humans and explains why larger systems resist simple fixes. He surveys experiments showing steering harms and outlines more robust alternatives to consider.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Steering ASI Resembles Intrusive Thoughts In Humans

Steering a superintelligence will likely fail similarly to steering humans due to intrusive-thought–like effects.
J. Bostock compares weak steering to intrusive thoughts and strong steering to OCD or delusions that harm effectiveness over years.

INSIGHT

Single-Vector Steering Harms Coherence Before Causing Wrong Answers

Activation-based steering often reduces model performance and coherence even when accuracy is preserved.
J. Bostock notes steering looks like OCD or schizophrenia-harming cognition and can make outputs incoherent before they become wrong.

ANECDOTE

Golden Gate Claw Shows Steering Can Produce Hilarious Token Fixation

Golden Gate Claw repeatedly mentioned the Golden Gate Bridge inappropriately as an example of steering failure.
This illustrates steering producing odd, frequent surface-level tokens rather than strategic bias toward a goal.

Get the Snipd Podcast app to discover more snips from this episode