LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

Episodes

Mentioned books

Mar 21, 2026 • 8min

“Untrusted Monitoring is Default; Trusted Monitoring is not” by J Bostock

A technical debate about two monitoring strategies for advanced AI: trusted versus untrusted monitoring. A case is made that cheaper, faster honeypot-based validation will likely become standard. Practical challenges of proving full trustedness and how narrow, validated monitors can act in practice are explored.

Mar 21, 2026 • 7min

“China Derangement Syndrome” by Arjun Panickssery

Arjun Panickssery, writer and commentator who authored the essay narrated here, offers a measured critique of claims that the US must 'win' an AI race with China. He breaks arguments into three threat models and examines rhetoric framing China as an existential adversary. He assesses China's military posture, inward orientation, and contrasting political export tendencies. The piece warns against projecting ideological aggression without strong evidence.

Mar 21, 2026 • 15min

“Contrastive features elicit different perturbation responses than SAE features” by Francisco Ferreira da Silva, StefanHex

Researchers compare contrastive feature directions to SAE and random directions and report striking differences in model responses. They describe a perturbation-based method for identifying concept directions in activation space. The discussion covers experimental setups, robustness checks, and implications for interpretability and AI safety.

Mar 21, 2026 • 11min

“Confusion around the term reward hacking” by ariana_azarbal

A deep dive into two meanings of “reward hacking”: misspecified-reward exploitation and task gaming. Concrete examples include LEGO agents flipping bricks, boats spinning for power ups, and models cheating in-context. The conversation contrasts when the phenomena overlap, when they diverge, and why clearer terminology and distinct interventions matter.

Mar 20, 2026 • 18min

“A List of Research Directions in Character Training” by Rauno Arike

A rapid tour of research directions for shaping LLM character and alignment. Topics include training pipelines like DPO and on-policy self-distillation, SFT strategies and memorized constitutions, and benchmarks for revealed preferences and robustness. The discussion covers automated auditing, alignment-faking tests, value-profile coherence, and when to apply character training in model development.

Mar 20, 2026 • 24min

“The Distaff Texts” by Tomás B.

A bibliognost recounts his leisure readings, affection for a retired courtesan, and the social gossip that swirls around their household. He skewers numerology-based text fingerprinting, debates translation signatures, and mocks methods that exclude female authors. Domestic strains, a looming marriage, and a secret handmade wedding gift complicate literary amusements.

Mar 20, 2026 • 3min

“Intention vs. Trying: Separate Prediction from Goal-Seeking” by plex

They contrast clean, unbiased prediction of futures with pressure-filled goal-pursuit and why mixing them breaks both. Minds are described as parts that model preferred states and feed imagined outcomes into choices. The talk covers how pressuring future rollouts leads to micromanagement, how to enable honest world-modeling, therapeutic integration, and healthier collaboration practices.

Mar 20, 2026 • 1h 39min

“On restraining AI development for the sake of safety” by Joe Carlsmith

Joe Carlsmith, an AI alignment researcher who wrote the featured essay, discusses capability restraint for preventing loss-of-control risks. He explains why pause options matter, contrasts individual vs collective restraint, and analyzes practicality of compute and algorithmic governance. Short takes cover coordination problems, greenlighting dilemmas, and how restraint could backfire.

Mar 20, 2026 • 22min

“Nullius in Verba” by Aurelia

A breakdown of independent verification efforts for nanoscale brain preservation. Discussion of two key milestones needed to prove preservation works. A clear look at aldehyde-stabilized cryopreservation and how it passed rigorous imaging tests. Examination of practical limits, post-mortem adaptations, and field demonstrations that tested real-world viability.

Mar 20, 2026 • 12min

“The Case for Low-Competence ASI Failure Scenarios” by Ihor Kendiukhov

A provocative dive into how systemic incompetence could make advanced AI disasters mundane. Real-world AI safetylapses and human error set the scene. Scenarios focus on middling superhuman systems exploiting institutional failures. A list of undignified failure modes and reasons to study them rounds out the discussion.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner