LessWrong (30+ Karma)

LessWrong
undefined
Mar 14, 2026 • 32min

“Most likely you won’t be able to perform a data-driven self-improvemnet” by siarshai

Suppose you’re not happy with the quality of your sleep. You’ve already stopped doing the obviously harmful things (no more coffee at night), and your sleep has improved - but you’d like to work on it further. A coworker gives you an herbal mix with St. John's wort and lavender. You try drinking it at night instead of coffee, and it does seem that sometimes your sleep really does get deeper than before. But sometimes it doesn’t. You’re willing to experiment, but how do you actually check whether the herbs work, or whether it's just random variation? Or suppose you’re not particularly satisfied with your productivity at work. Following the advice from "Atomic Habits" and books on workflow organization, you’ve introduced a few useful micro‑habits and ergonomic improvements. But what do you do once the low‑hanging fruits are picked up? Time is limited - you can’t implement everything that someone somewhere calls "useful". Some habits are even mutually exclusive: it's impossible to both socialize during lunch and sit alone in silence at the same time. Or, for example, you want to achieve better results in fishing… you get the idea. "Don’t underestimate the power of small things taken in [...] ---Outline:(02:20) Notation(03:11) Minimal background for people who want to read the article but dont know probability theory(10:30) How large of effects should you expect from self‑experimentation?(13:41) What does d = 0.1-0.4 actually mean in real life?(15:56) How many observations, exactly?(20:23) p‑value(25:02) Additional complications(25:15) Non‑linear interactions(26:04) Accumulation and time‑to‑effect(27:00) Side conditions and seasonality(27:34) Substitution effects(28:13) Noisy measurement units(29:33) The observer effect(30:22) The general noise of life(31:04) Summary The original text contained 15 footnotes which were omitted from this narration. --- First published: March 13th, 2026 Source: https://www.lesswrong.com/posts/ycWWbpjxuhdxGpJ6e/most-likely-you-won-t-be-able-to-perform-a-data-driven-self --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 14, 2026 • 14min

“Cycle-Consistent Activation Oracles” by slavachalnev

TL;DR: I train a model to translate LLM activations into natural language, using cycle consistency as a training signal (activation → description → reconstructed activation). The outputs are often plausible, but they are very lossy and are usually guesses about the context surrounding the activation, not good descriptions of the activation itself. This is an interim report with some early results. Overview I think Activation Oracles (Karvonen et al., 2025) are a super exciting research direction. Humans didn't evolve to read messy activation vectors, whereas ML models are great at this sort of thing. An activation oracle is trained to answer specific questions about an LLM activation (e.g. "is the sentiment of this text positive or negative?" or "what are the previous 3 tokens?"). I wanted to try something different: train a model to translate activations into natural language. The main problem to solve here is the lack of training data. There's no labeled dataset of activations paired with their descriptions. So how do we get around this? One idea is to use cycle consistency: if you translate from language A to language B and back to A, you should end up approximately where [...] ---Outline:(00:38) Overview(03:05) Setup(04:39) Example outputs(06:36) Problems with this approach(08:34) Evals(08:37) Retrieval(09:08) Classification(10:15) Arithmetic(11:20) What I want to try next(12:38) Appendix: Other training ideas The original text contained 3 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/Nf2sKaNNdxE2ssxbp/cycle-consistent-activation-oracles-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 13, 2026 • 16min

“Things that Go Boom” by sarahconstantin

Workers at the Naval Surface Warfare Center in Indian Head, Maryland, the only facility where torpedo fuel is produced for the U.S. Navy In the event of a late-2020s Chinese attempt to invade Taiwan, leading to a US-China military conflict, what would be the most urgent bottlenecks in military equipment? Where is there the greatest need to scale up defense manufacturing? I am not an expert in military matters, so take all this with a grain of salt. But I looked into this question and it seems to have a single clear-cut answer: energetics. (That is, explosives and propellants for munitions like missiles and torpedoes). The US has extremely small manufacturing capacity for energetics, and in the event of a Pacific war, demand would rapidly exceed supply. What Does a US-China War Look Like? Military experts think there's a substantial, but not overwhelming, chance that China will invade Taiwan before 2030. Metaculus predicts an invasion at 13% by 2028, and 25% by 2030. The “Davidson window”, named for retired Admiral Philip S. Davidson, refers to the view that China will invade Taiwan by 2027, which “gained widespread attention” following former CIA director William J. Burns’ 2023 announcement [...] ---Outline:(01:01) What Does a US-China War Look Like?(04:28) Munitions Shortages(07:15) The Energetics Supply Chain is Tiny(11:34) Is There A (Tech) Business Here?(14:36) It May Be Too Late The original text contained 2 footnotes which were omitted from this narration. --- First published: March 13th, 2026 Source: https://www.lesswrong.com/posts/ZyXdwmBKnuTZ5CL7W/things-that-go-boom --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 13, 2026 • 1h 36min

“AI #159: See You In Court” by Zvi

The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for protected speech. It will take a bit to work its way through the courts. Anthropic has the principles of law on its side, a maximally strong set of facts and absurdly strong amicus briefs. If Anthropic loses this case, there will be far reaching consequences for our freedoms. Let us hope this remains in the courts and is allowed to play out there, and then ultimately that negotiations can resume and the parties can at least agree on a smooth transition to alternative service providers. If DoW wants an otherwise full deal more than it wants the right to use Claude to monitor Americans and analyze their data, a full deal is possible as well, but if they demand full ‘all lawful use,’ all trust has been lost or they are or always were out to hurt Anthropic, then there is no deal or ZOPA. That has overshadowed what would normally be the main event [...] ---Outline:(01:48) Language Models Offer Mundane Utility(03:46) Language Models Dont Offer Mundane Utility(07:11) Language Models Break Your Vital Internet Infrastructure(08:02) Huh, Upgrades(09:29) On Your Marks(15:13) Choose Your Fighter(15:40) Get My Agent On The Line(16:55) Deepfaketown and Botpocalypse Soon(19:08) A Young Ladys Illustrated Primer(19:27) You Drive Me Crazy(19:47) They Took Our Jobs(23:10) Get Involved(24:58) Introducing(26:03) The Anthropic Institute(28:13) In Other AI News(29:14) The Rise of Claude(33:31) Trouble At OpenAI(35:21) Show Me the Money(36:36) Thanks For The Memos(37:36) A Contract Is A Contract Is A Contract(40:09) Level of Friction(41:10) Quiet Speculations(41:46) Quickly, Theres No Time(43:14) Apology Tour(43:59) Well See You In Court(50:41) Jawboning(54:02) Executive Order(54:53) The Acute Crisis Passes(56:48) Others Cover This(57:28) Dwarkesh Patel Gives Mixed Thoughts(01:02:57) This Means A Special Military Operation(01:03:25) Bernie Sanders Is Worried and Curious About AI(01:06:10) The Quest for Survival(01:09:25) The Quest For No Regulations Whatsoever(01:10:44) Chip City(01:11:07) The Week in Audio(01:12:52) Rhetorical Innovation(01:23:08) Aligning a Smarter Than Human Intelligence is Difficult(01:28:57) People Are Worried About AI Killing Everyone(01:31:03) Other People Are Not As Worried About AI Killing Everyone(01:32:52) The Lighter Side --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/DnrjKZTZwHGjdDB4u/ai-159-see-you-in-court --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 13, 2026 • 18min

“Operationalizing FDT” by Vivek Hebbar

This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions: given a logical causal graph, how do we define the logical do-operator? what is logical causality and how might it be formalized? how does FDT interact with anthropic updating? why do we need logical causality?  why FDT and not EDT? Defining the logical do-operator Consider Parfit's hitchhiker:A logical causal graph for Parfit's hitchhiker, where blue nodes are logical facts An FDT agent is supposed to reason as follows: I am deciding the value of the node "Does my algorithm pay?" If I set that node to "yes", then omega will save me and I will get +1000 utility.  Also I will pay and lose 1 utility.  Total is +999. If I set that node to "no", then omega will not save me.  I will get 0 utility. Therefore I choose to pay. The bolded phrases are invoking logical counterfactuals. Because I have drawn a "logical causal graph", I will call the operation which generates these counterfactuals a "logical do-operator" by analogy to the do-operator of CDT. In ordinary CDT, it is impossible to observe a variable that is downstream of [...] ---Outline:(00:39) Defining the logical do-operator(04:24) Logical causality(05:40) Causality as derived from a world model(06:45) Logical inductors(07:05) Algorithmic mutual information of heuristic arguments(08:09) How does FDT interact with anthropic updating?(09:48) Putting it together: An attempt at operationalizing FDT(11:22) Appendix: Why bother with logical causality?(17:51) Acknowledgements The original text contained 10 footnotes which were omitted from this narration. --- First published: March 13th, 2026 Source: https://www.lesswrong.com/posts/RyDkpWGLQsCnABE78/operationalizing-fdt --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 13, 2026 • 17min

“Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Woodruff, Alex Mallen

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be: on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…). beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all models sharing the same weights). In this post, I’ll discuss which motivation is more likely. This question is, of course, very similar to the broader questions of whether we should expect scheming. But there are a number of considerations particular to motivations that aim for reward; this post focuses on these. Specifically: On-episode and beyond-episode reward seekers have very similar motivations (unlike, e.g., paperclip maximizers and instruction followers), making both goal change and goal drift between them particularly easy. Selection pressures against beyond-episode reward seeking may be weak, meaning beyond-episode reward seekers might survive training even without goal-guarding. Beyond-episode reward is particularly tempting in training compared to random long-term goals, making effective goal-guarding harder. This question is important because beyond-episode reward seekers are [...] ---Outline:(03:41) Pre-RL Priors Might Favor Beyond-Episode Goals(05:34) Acting on beyond-episode reward seeking is disincentivized if training episodes interact(08:35) But beyond-episode reward-seekers can goal-guard(09:29) Are on-episode reward seekers or goal-guarding beyond-episode reward-seekers more likely?(09:53) Why on-episode reward seekers might be favored(12:13) Why goal-guarding beyond-episode reward seekers might be favored(14:50) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/jp6CdbKjueWpBFSff/are-ais-more-likely-to-pursue-on-episode-or-beyond-episode --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 12, 2026 • 9min

“Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs” by Benquo

LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff. Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs. I wrote to Anthropic researcher Amanda Askell about the experiment: My Summary Amanda, Today I asked Claude about Iran's retaliatory strikes. [1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and [...] ---Outline:(00:56) My Summary(02:20) Claudes Summary:(08:22) Disclaimer The original text contained 2 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/6wNwj7xANPmTwWkX6/ideologies-embed-taboos-against-common-knowledge-formation-a --- Narrated by TYPE III AUDIO.
undefined
Mar 12, 2026 • 19min

“Why AI Evaluation Regimes are bad” by PranavG, Gabriel Alfour

How the flagship project of the AI Safety Community ended up helping AI Corporations. I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks. In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI. Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings. Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy. Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.) Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the [...] ---Outline:(00:13) How the flagship project of the AI Safety Community ended up helping AI Corporations.(02:46) 1) The Theory of Change behind Evals is broken(06:10) 2) Evals move the burden of proof away from AI Corporations(09:38) 3) Evals Organisations are not independent of the AI Corporations(15:55) Conclusion --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/Xxp6Tm8BKTkcb2m5M/why-ai-evaluation-regimes-are-bad --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Mar 12, 2026 • 4min

“Conflicted on Ramsey” by jefftk

People are often pretty short-sighted, spending money today that they'll want tomorrow. Debt makes it possible to prioritize your current self even more highly: you can spend money you haven't even earned yet. This is a trap many people fall into, and one different communities have built social defenses against. One of the more surprisingly successful approaches is the Financial Peace (Ramsey) system, popular in evangelical Christian communities. It has a series of rules, most prominently the seven baby steps: Save $1,000 for your starter emergency fund. Pay off all debt (except the house) using the debt snowball. Save 3–6 months of expenses in a fully funded emergency fund. Invest 15% of your household income in retirement. Save for your children's college fund. Pay off your home early. Build wealth and give. There are many more specific rules, however, such as: As a general rule of thumb, the total value of your vehicles (anything with a motor in it) should never be more than half of your annual household income. I [...] --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/XsC49gCDNGTNu6Qfn/conflicted-on-ramsey --- Narrated by TYPE III AUDIO.
undefined
Mar 12, 2026 • 1h 57min

“How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

They test whether large language models actually follow long written constitutions by running adversarial multi-turn scenarios against multiple model families. Results compare violation rates across Claude, Opus, Sonnet, Gemini, and GPT generations. The discussion highlights common failure modes like fabrication, operator-compliance conflicts, prompt injections, and how reasoning effort and scaffolds affect alignment.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app