LessWrong (Curated & Popular)

LessWrong
undefined
Sep 15, 2022 • 9min

"Humans are not automatically strategic" by Anna Salamon

https://www.lesswrong.com/posts/PBRWb2Em5SNeWYwwB/humans-are-not-automatically-strategicReply to: A "Failure to Evaluate Return-on-Time" FallacyLionhearted writes:[A] large majority of otherwise smart people spend time doing semi-productive things, when there are massively productive opportunities untapped.A somewhat silly example: Let's say someone aspires to be a comedian, the best comedian ever, and to make a living doing comedy. He wants nothing else, it is his purpose. And he decides that in order to become a better comedian, he will watch re-runs of the old television cartoon 'Garfield and Friends' that was on TV from 1988 to 1995....I’m curious as to why.
undefined
Sep 14, 2022 • 13min

"Moral strategies at different capability levels" by Richard Ngo

https://www.lesswrong.com/posts/jDQm7YJxLnMnSNHFu/moral-strategies-at-different-capability-levelsCrossposted from the AI Alignment Forum. May contain more technical jargon than usual.Let’s consider three ways you can be altruistic towards another agent:You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:
undefined
Sep 11, 2022 • 24min

"Worlds Where Iterative Design Fails" by John Wentworth

https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-failsCrossposted from the AI Alignment Forum. May contain more technical jargon than usual.In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.
undefined
Sep 11, 2022 • 1h 35min

"(My understanding of) What Everyone in Technical Alignment is Doing and Why" by Thomas Larsen & Eli Lifland

https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-isDespite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.[1] The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences between approaches and Thomas’s big picture view on alignment. The opinions written are Thomas and Eli’s independent impressions, many of which have low resilience. Our all-things-considered views are significantly more uncertain. This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer. 
undefined
Sep 9, 2022 • 46min

"Unifying Bargaining Notions (1/2)" by Diffractor

https://www.lesswrong.com/posts/rYDas2DDGGDRc8gGB/unifying-bargaining-notions-1-2Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a two-part sequence of posts, in the ancient LessWrong tradition of decision-theory-posting. This first part will introduce various concepts of bargaining solutions and dividing gains from trade, which the reader may or may not already be familiar with.The upcoming part will be about how all introduced concepts from this post are secretly just different facets of the same underlying notion, as originally discovered by John Harsanyi back in 1963 and rediscovered by me from a completely different direction. The fact that the various different solution concepts in cooperative game theory are all merely special cases of a General Bargaining Solution for arbitrary games, is, as far as I can tell, not common knowledge on Less Wrong.Bargaining GamesLet's say there's a couple with a set of available restaurant options. Neither of them wants to go without the other, and if they fail to come to an agreement, the fallback is eating a cold canned soup dinner at home, the worst of all the options. However, they have different restaurant preferences. What's the fair way to split the gains from trade?Well, it depends on their restaurant preferences, and preferences are typically encoded with utility functions. Since both sides agree that the disagreement outcome is the worst, they might as well index that as 0 utility, and their favorite respective restaurants as 1 utility, and denominate all the other options in terms of what probability mix between a cold canned dinner and their favorite restaurant would make them indifferent. If there's something that scores 0.9 utility for both, it's probably a pretty good pick!
undefined
Sep 5, 2022 • 1h 48min

'Simulators' by Janus

https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators#fncrt8wagfir9SummaryTL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them.I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.
undefined
Aug 8, 2022 • 23min

"Humans provide an untapped wealth of evidence about alignment" by TurnTrout & Quintin Pope

https://www.lesswrong.com/posts/CjFZeDD6iCnNubDoS/humans-provide-an-untapped-wealth-of-evidence-about#fnref7a5ti4623qb Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.  For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds. “Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic. 
undefined
10 snips
Jul 30, 2022 • 22min

"Changing the world through slack & hobbies" by Steven Byrnes

https://www.lesswrong.com/posts/DdDt5NXkfuxAnAvGJ/changing-the-world-through-slack-and-hobbies   Introduction In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are (1) “direct work” in a job that furthers a very important cause; (2) “earning to give”; (3) earning “career capital” that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML. By contrast, there’s not much talk of: (4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important.  But that last one is really important!
undefined
Jul 28, 2022 • 19min

"«Boundaries», Part 1: a key missing concept from utility theory" by Andrew Critch

https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is Part 1 of my «Boundaries» Sequence on LessWrong. Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts). 1. Boundaries (of living systems)  Epistemic status: me describing what I mean. With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries. When I say boundary, I don't just mean an arbitrary constraint or social norm.  I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself.  Here are some examples that I hope will convey my meaning:
undefined
Jul 24, 2022 • 13min

"ITT-passing and civility are good; "charity" is bad; steelmanning is niche" by Rob Bensinger

https://www.lesswrong.com/posts/MdZyLnLHuaHrCskjy/itt-passing-and-civility-are-good-charity-is-bad I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years. My current view is: Steelmanning ("the art of addressing the best form of the other person’s argument, even if it’s not the one they presented") is a useful niche skill, but I don't think it should be a standard thing you bring out in most arguments, even if it's an argument with someone you strongly disagree with. Instead, arguments should mostly be organized around things like: Object-level learning and truth-seeking, with the conversation as a convenient excuse to improve your own model of something you're curious about. Trying to pass each other's Ideological Turing Test (ITT), or some generalization thereof. The ability to pass ITTs is the ability "to state opposing views as clearly and persuasively as their proponents". The version of "ITT" I care about is one where you understand the substance of someone's view well enough to be able to correctly describe their beliefs and reasoning; I don't care about whether you can imitate their speech patterns, jargon, etc. Trying to identify and resolve cruxes: things that would make one or the other of you (or both) change your mind about the topic under discussion. Argumentative charity is a complete mess of a concept⁠—people use it to mean a wide variety of things, and many of those things are actively bad, or liable to cause severe epistemic distortion and miscommunication. Some version of civility and/or friendliness and/or a spirit of camaraderie and goodwill seems like a useful ingredient in many discussions. I'm not sure how best to achieve this in ways that are emotionally honest ("pretending to be cheerful and warm when you don't feel that way" sounds like the wrong move to me), or how to achieve this without steering away from candor, openness, "realness", etc.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app