

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Dec 1, 2023 • 12min
EA - Doing Good Effectively is Unusual by Richard Y Chappell
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Doing Good Effectively is Unusual, published by Richard Y Chappell on December 1, 2023 on The Effective Altruism Forum.
tl;dr: It actually seems pretty rare for people to care about the general good as such (i.e., optimizing cause-agnostic impartial well-being), as we can see by prejudged dismissals of EA concern for non-standard beneficiaries and for doing good via indirect means.
Introduction
Moral truisms may still be widely ignored. The moral truism underlying Effective Altruism is that we have strong reasons to do more good, and it's worth adopting the efficient promotion of the impartial good among one's life projects. (One can do this in a "non-totalizing" way, i.e. without it being one's only project.) Anyone who personally adopts that project (to any non-trivial extent) counts, in my book, as an effective altruist (whatever their opinion of the EA movement and its institutions).
Many people don't adopt this explicit goal as a personal priority to any degree, but still do significant good via more particular commitments (to more specific communities, causes, or individuals). That's fine by me, but I do think that even people who aren't themselves effective altruists should recognize the EA project as a good one. We should all generally want people to be more motivated by efficient impartial beneficence (on the margins), even if you don't think it's the only thing that matters.
A popular (but silly) criticism of effective altruism is that it is entirely vacuous. As Freddie deBoer writes:
[T]his sounds like so obvious and general a project that it can hardly denote a specific philosophy or project at all… [T]his is an utterly banal set of goals that are shared by literally everyone who sincerely tries to act charitably.
This is clearly false. As Bentham's Bulldog replies, most people give lip service to doing good effectively. But then they go and donate to local children's hospitals and puppy shelters, while showing no interest in learning about neglected tropical diseases or improving factory-farmed animal welfare.
DeBoer himself dismisses without argument "weird" concerns about shrimp welfare and existential risk reduction, which one very clearly cannot just dismiss as a priori irrelevant if one actually cares about promoting the impartial good. The latter entails a very unusual degree of open-mindedness.
The fact is: open-minded, cause-agnostic concern for promoting the impartial good is vanishingly rare. As a result, the few people who sincerely have and act upon this concern end up striking everyone else as extremely weird. We all know that the way you're supposed to behave is to be a good ally to your social group, do normal socially-approved things that signal conformity and loyalty (and perhaps a non-threatening degree of generosity towards socially-approved recipients).
"Literally everyone" does this much, I guess. But what sort of weirdo starts looking into numbers, and argues on that basis that chickens are a higher priority than puppies? Horrible utilitarian nerds, that's who! Or so the normie social defense mechanism seems to be (never mind that efficient impartial beneficence is not exclusively utilitarian, and ought rather to be a significant component of any reasonable moral view).
Let's be honest
Everyone is motivated to rationalize what they're antecedently inclined to do. I know I do plenty of suboptimal things, due to both (i) failing to care as much as would be objectively warranted about many things (from non-cute animals to distant people), and (ii) being akratic and failing to be sufficiently moved even by things I value, like my own health and well-being. But I try to be honest about it, and recognize that (like everyone) I'm just irrational in a lot of ways, and that's OK, even if it isn't ideal.
Vegans care more about animals than I ...

Dec 1, 2023 • 20min
AF - Thoughts on "AI is easy to control" by Pope & Belrose by Steve Byrnes
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "AI is easy to control" by Pope & Belrose, published by Steve Byrnes on December 1, 2023 on The AI Alignment Forum.
Quintin Pope & Nora Belrose have a new
"AI Optimists" website, along with a new essay "
AI is easy to control", arguing that the risk of human extinction due to future AI ("AI x-risk") is a mere 1% ("a tail risk worth considering, but not the dominant source of risk in the world"). (I'm much more pessimistic.) It makes lots of interesting arguments, and I'm happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 & 4 are the main reasons that I'm much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.
Section 1 argues that even if controllable AI has an "easy" technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.
Section 2 talks about the terms "black box" versus "white box".
Section 3 talks about what if anything we learn from "human alignment", including some background on how I think about human innate drives.
Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of "brain-like AGI" (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn't apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.
1. Even if controllable AI has an "easy" technical solution, I'd still be pessimistic about AI takeover
Most of Pope & Belrose's essay is on the narrow question of whether the AI control problem has an easy technical solution. That's great! I'm strongly in favor of arguing about narrow questions. And after this section I'll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don't use it.
So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.
Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony - including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on.
instrumental convergence.
But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:
People have been trying to do exactly that since the dawn of AI.
Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS pape...

Dec 1, 2023 • 5min
EA - Effektiv Spenden's Impact Evaluation 2019-2023 (exec. summary) by Sebastian Schienle
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Effektiv Spenden's Impact Evaluation 2019-2023 (exec. summary), published by Sebastian Schienle on December 1, 2023 on The Effective Altruism Forum.
effektiv-spenden.org is an effective giving platform in Germany and Switzerland that was founded in 2019. To reflect on our past impact, we examine Effektiv Spenden's cost-effectiveness as a "giving multiplier" from 2019 to 2022 in terms of how much money is directed to highly effective charities due to our work. We have two primary reasons for this analysis:
To provide past and future donors with transparent information about our cost-effectiveness;
To hold ourselves accountable, particularly in a situation where we are investing in further growth of our platform.
We provide both a simple multiple (or "leverage ratio") of donations raised for highly effective charities compared to our operating costs, as well as an analysis of the counterfactual (i.e. what would have happened had we never existed).
Our analysis complements our Annual Review 2022 (in German) and builds on previous updates and annual reviews, such as, amongst others, our reviews of 2021 and 2019. In both instances, we also included initial perspectives on our counterfactual impact. Since then, the
investigation of
Founders Pledge into giving multipliers as well as Giving What We Can (GWWC)'s recent
impact evaluation
have provided further methodological refinements. In line with GWWC's approach, we shift to 3-year time horizons, which we feel better represents our impact over time and avoids short-term distortions.
However, our attempt to quantify our "
giving multiplier" deviates in some parts from the methodologies and assumptions applied by Founders Pledge and GWWC and is an initial, shallow analysis only that we intend to develop further in the future.
Below, we share the key results of our analysis. We invite you to share any comments or takeaways you may have, either by directly commenting or by reaching out to
sebastian.schienle@effektiv-spenden.org
Key results
In 2022, we moved 15.3 million to highly effective charities, amounting to
37 million in total donations raised since Effektiv Spenden was founded in 2019.
Our leverage ratio, i.e. the money moved to highly effective charities per 1 spent on our operations was 55.7 and 40.8 for the 2019-2021 and 2020-2022 time periods respectively.[1]
Our best-guess counterfactual giving multiplier is 17.9 and 13.0 for those two time periods, robustly exceeding 10x. This means that for every 1 spent on Effektiv Spenden between 2019-2022, we are confident to have facilitated more than 10 to support highly effective charities which would not have materialized had Effektiv Spenden not existed.
Our conservative counterfactual giving multiplier is 10.4 for 2019-2021, and 7.5 for 2020-2022.
The decline of our multiplier over time is driven by the investment into our team. Over the last year, our team has grown substantially to enable further growth. While this negatively impacts our giving multiplier in the short term, we consider it a necessary prerequisite for further growth.
Our ambition is to return to a best-guess counterfactual multiplier of at least 15x in the coming years. That said, ultimately our goal is not to maximize the multiplier, but to maximize counterfactually raised funds for highly effective charities. (As long as our work remains above a reasonable cost-effectiveness bar.)
How to interpret our results
We consider our analysis an important stocktake of our impact, and a further contribution to the growing body of giving multiplier analyses in the effective giving space. That said, we also recognize the limitations of our approach and want to call out some caveats to guide interpretation of these results.
Our analysis is largely retrospective, i.e. it compares our past money moved with operating ...

Dec 1, 2023 • 11min
AF - How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs"), published by Joe Carlsmith on December 1, 2023 on The AI Alignment Forum.
This is Section 2.2.4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
How much useful, alignment-relevant cognitive work can be done using AIs with short-term goals?
So overall, I think that training our models to pursue long-term goals - whether via long episodes, or via short episodes aimed at inducing long-term optimization - makes the sort of beyond-episode goals that motivate scheming more likely to arise. So this raises the question: do we need to train our models to pursue long-term goals?
Plausibly, there will be strong general incentives to do this. That is: people want optimization power specifically applied to long-term goals like "my company being as profitable as possible in a year." So, plausibly, they'll try to train AIs that optimize in this way. (Though note that this isn't the same as saying that there are strong incentives to create AIs that optimize the state of the galaxies in the year five trillion.)
Indeed, there's a case to be made that even our alignment work, today, is specifically pushing towards the creation of models with long-term - and indeed, beyond-episode - goals. Thus, for example, when a lab trains a model to be "harmless," then even though it is plausibly using fairly "short-episode" training (e.g., RLHF on user interactions), it intends a form of "harmlessness" that extends quite far into the future, rather than cutting off the horizon of its concern after e.g.
an interaction with the user is complete.
That is: if a user asks for help building a bomb, the lab wants the model to refuse, even if the bomb in question won't be set off for a decade.[1] And this example is emblematic of a broader dynamic: namely, that even when we aren't actively optimizing for a specific long-term outcome (e.g., "my company makes a lot of money by next year"), we often have in mind a wide variety of long-term outcomes that we want to avoid (e.g., "the drinking water in a century is not poisoned"), and which it
wouldn't be acceptable to cause in the course of accomplishing some short-term task.
Humans, after all, care about the state of the future for at least decades in advance (and for some humans: much longer), and we'll want artificial optimization to reflect this concern.
So overall, I think there is indeed quite a bit of pressure to steer our AIs towards various forms of long-term optimization. However, suppose that we're not blindly following this pressure. Rather, we're specifically trying to use our AIs to perform the sort of alignment-relevant cognitive work I discussed above - e.g., work on interpretability, scalable oversight, monitoring, control, coordination amongst humans, the general science of deep learning, alternative (and more controllable/interpretable) AI paradigms, and the like.
In many cases, I think the answer is no. In particular: I think that a lot of this sort of alignment-relevant work can be performed by models that are e.g. generating research papers in response to human+AI supervision over fairly short timescales, suggesting/conducting relatively short-term experiments, looking over a codebase and pointing out bugs, conducting relatively short-term security tests and red-teaming attempts, and so on.
We can talk about whether it will be possible to generate rewar...

Dec 1, 2023 • 8min
EA - My Personal Priorities, Charity, Judaism, and Effective Altruism by Davidmanheim
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Personal Priorities, Charity, Judaism, and Effective Altruism, published by Davidmanheim on December 1, 2023 on The Effective Altruism Forum.
I've thought a lot about charitable giving over the past decade, both from a universalist and from a Jewish standpoint. I have a few thoughts, including about how my views have evolved over time. This is a very different perspective than many in Effective Altruism, but I think it's important as a member of a community that benefits from being diverse rather than monolithic for those who dissent from community consensus make it clear that it's acceptable to do so. Hopefully, this can be useful both to other people who are interested in a more Jewish perspective, and for everyone else interested in thinking about balancing different personal views with effective giving.
Background
To start, there is a strong Jewish tradition, and a legal requirement in the Shulchan Aruch, the code of Jewish law, for giving at least ten percent of your income to the poor and to community organizations - and for those who can afford it, ideally, a fifth of their income. (For some reason, no-one ever points out that second part.)
So I always gave a tenth of my income to charity, even before starting my first post-college job, per Jewish customary law. My parents inculcated this as a value since childhood, and a norm, and it's one I am grateful for. (One thing I did differently than most, and credit my sister with suggesting, is putting 10% of my paycheck directly in a second account which was exclusively for charity.
My giving as a child, and as a young adult, largely centered on local Jewish organizations, poverty assistance for local poor people and the poor in Israel, and community organizations I interacted with. In the following years, I started thinking more critically about my giving, and charity to community organizations seemed in tension with a more universalist impulse, what you might call "Tikkun Olam"- a directive to improve the world as a whole. I was very conflicted about this for quite some time, but have come to some tentative conclusions, and I wanted to outline my current views, informed by a combination of the Jewish sources and my other beliefs.
Judaism vs. Utilitarians
I am lucky enough, like most people I know personally, to have significantly more money than is strictly needed to feed, clothe, and house myself and my family. The rest of the money, however, needs to be allocated - for savings, for entertainment, for community, and for charity. And my conclusion, after reflection about the question, is that those last two are separate both conceptually and as a matter of Jewish conception of charity.
My synagogue is a wonderful community institution that I benefit from, and I believe it is proper to pay my fair share. And in Halacha, Jewish law, community organizations are valid recipients of charity. But there is also a strong justification for prioritizing giving to those most in need.
Utilitarian philosophers have advocated for a giving on an impartial basis, seeing a contradiction between universalism and their "selfish" impulse to justify keeping more than a minimal amount of their own money. To maximize global utility, all money over a bare minimum should go to those most in need, or otherwise be maximally impactful. In contrast, Halacha is clear that you and your family come first, and giving more than a token amount of charity must wait until your family's needs are met.
More than that, it is clearly opposed to giving more than 20% of your income under usual circumstances, i.e. short of significant excess wealth. And once you are giving to charity, Jewish sources suggest progressively growing moral circles, first giving to family in need, then neighbors, then the community. In contrast to this, Jewish law also contai...

Dec 1, 2023 • 49min
EA - ALLFED's 2023 Highlights by Sonia Cassidy
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ALLFED's 2023 Highlights, published by Sonia Cassidy on December 1, 2023 on The Effective Altruism Forum.
Executive Summary
Welcome to ALLFED's 2023 Highlights, our annual update on what Alliance to Feed the Earth in Disasters has been up to this year. From advising Open Philanthropy on food security, to 6 new papers submitted for peer review, to writing preparedness/response plans for 3 governments, we have made substantial strides towards our mission to increase resilience to global food catastrophes.
There is much more we could do, as we are currently funding constrained. If you like what you read in this post, please see also our
2023 ALLFED Marginal Funding Appeal and consider donating to us
via our website this giving season.
Increasing geopolitical tensions have presented an opportunity to translate ALLFED's scientific research into actionable policy proposals with endeavors such as writing national preparedness and response plans against abrupt sunlight reduction scenarios (ASRS, e.g. volcanic or nuclear winter) for various countries such as the United States, Australia, and Argentina.
We continue to explore further options for governments to plan and develop technology through pilots. We have worked towards producing the evidence base needed to inform decision making prior to and during global catastrophe, with 6 new core ALLFED papers submitted for peer review. We have also redoubled efforts in studying responses to potential mass infrastructure collapse scenarios, such as from large scale nuclear electromagnetic pulse, AI-powered cyberattacks, or extreme pandemics (e.g.
high transmissibility and mortality causing mass absenteeism). On this topic, we have produced around half a dozen papers over the years (including one this year).
Here is what we you can read about in these 2023 Highlights:
We kick off with a strategy section and some insights into our top-level thinking and ALLFED's Theory of Change.
We then report on our research, including 6 new papers submitted for peer review and some contraptions we have engineered. According to an
analysis of the
Cambridge Centre for Existential Risk paper database, ALLFED team members are the second, third, fourteenth, and twenty-first most prolific X-risk academic researchers in the world.
We talk about our policy work next, focusing on engagements with the governments of Australia and Argentina (through partnership with
the Spanish speaking GCR org) as well as the United States policy engagement (which included
endorsement of Senator Edward Markey's Health Impacts of Nuclear War Act).
We then move to communications, especially our GCR field-building and science communications. It has been gratifying to see ALLFED's work propagating and an increasing use of our field-defining terminology, which we give examples of here.
We follow up with events, circa 20 presentations and an account of a recent workshop that we gave at EAGx Australia.
We then move to operations, the backbone of ALLFED's day-to-day activities, and an important element of our organizational resilience for response in a GCR (one modality of our Theory of Change).
Our team section comes next, where we celebrate our team. ALLFED's multilingual team members are located around the globe and can talk about our work, and deliver workshops and presentations in a number of languages, including Spanish, German, French, Russian, Czech, Polish, Kannada, Tamil, Hindi, Filipino, Yoruba and more. In the team section, we also share with you a fun seaweed-eating experiment some of our team members participated in to experience a 10% seaweed diet.
We close with thanks and acknowledgements, to all our donors, collaborators and supporters. We would like to take this opportunity to especially thank Greg Colbourn and the Centre for Enabling EA Learning & Research (CEEALA...

Dec 1, 2023 • 39min
LW - How useful is mechanistic interpretability? by ryan greenblatt
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful is mechanistic interpretability?, published by ryan greenblatt on December 1, 2023 on LessWrong.
Opening positions
I'm somewhat skeptical about mech interp (bottom-up or substantial reverse engineering style interp):
Current work seems very far from being useful (it isn't currently useful) or explaining much what's going on inside of models in key cases. But it's hard to be very confident that a new field won't work! And things can be far from useful, but become useful via slowly becoming more powerful, etc.
In particular, current work fails to explain much of the performance of models which makes me think that it's quite far from ambitious success and likely also usefulness. I think this even after seeing recent results like dictionary learning results (though results along these lines were a positive update for me overall).
There isn't a story which-makes-much-sense-and-seems-that-plausible-to-me for how mech interp allows for strongly solving core problems like auditing for deception or being able to supervise superhuman models which carry out actions we don't understand (e.g. ELK).
That said, all things considered, mech interp seems like a reasonable bet to put some resources in.
I'm excited about various mech interp projects which either:
Aim to more directly measure and iterate on key metrics of usefulness for mech interp
Try to use mech interp to do something useful and compare to other methods (I'm fine with substantial mech interp industrial policy, but we do actually care about the final comparison. By industrial policy, I mean subsidizing current work even if mech interp isn't competitve yet because it seems promising.)
I'm excited about two main outcomes from this dialogue:
Figuring out whether or not we agree on the core claims I wrote above. (Either get consensus or find crux ideally)
Figuring out which projects we'd be excited about which would substantially positively update us about mech interp.
Maybe another question which is interesting: even if mech interp isn't that good for safety, maybe it's pretty close to stuff which is great and is good practice.
Another outcome that I'm interested in is personally figuring out how to better articulate and communicate various takes around mech interp.
By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
I feel pretty on board with this definition,
Our arguments here do in fact have immediate implications for your research, and the research of your scholars, implying that you should prioritize projects of the following forms:
Doing immediately useful stuff with mech interp (and probably non-mech interp), to get us closer to model-internals-based techniques adding value. This would improve the health of the field, because it's much better for a field to be able to evaluate work in simple ways.
Work which tries to establish the core ambitious hopes for mech interp, rather than work which scales up mediocre-quality results to be more complicated or on bigger models.
What I want from this dialogue:
Mostly an excuse to form more coherent takes on why mech interp matters, limitations, priorities, etc
I'd be excited if this results in us identifying concrete cruxes
I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)
I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)
I'd like to explicitly note I'm excited to find great concrete projects!
Stream of ...

Nov 30, 2023 • 22min
AF - FixDT by Abram Demski
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: FixDT, published by Abram Demski on November 30, 2023 on The AI Alignment Forum.
FixDT is not a very new decision theory, but little has been written about it afaict, and it's interesting. So I'm going to write about it.
TJ asked me to write this article to "offset" not engaging with Active Inference more. The name "fixDT" is due to Scott Garrabrant, and stands for "fixed-point decision theory". Ideas here are due to Scott Garrabrant, Sam Eisenstat, me, Daniel Hermann, TJ, Sahil, and Martin Soto, in roughly that priority order; but heavily filtered through my own lens.
This post may provide some useful formalism for thinking about issues raised in The Parable of Predict-O-Matic.
Self-fulfilling prophecies & other spooky map-territory connections.
A common trope is for magic to work only when you believe in it. For example, in Harry Potter, you can only get to the magical train platform 934 if you believe that you can pass through the wall to get there.
A plausible normative-rationality rule, when faced with such problems: if you want the magic to work, you should believe that it will work (and you should not believe it will work, if you want it not to work).
Can we sketch a formal decision theory which handles such problems?
We can't start by imagining that the agent has a prior probability distribution, like we normally would, since the agent would already be stuck -- either it lucked into a prior which believed the magic could work, or, it didn't.
Instead, the "beliefs" of the agent start out as maps from probability distributions to probability distributions. I'll use "P" as the type for probability distributions (little p for a specific probability distribution). So the type of "beliefs", B, is a function type: b:PP (little b for a specific belief). You can think of these as "map-territory connections": b is a (causal?) story about what actually happens, if we believe p. A "normal" prior, where we don't think our beliefs influence the world, would just be a constant function: it always outputs the same p no matter what the input is.
Given a belief b, the agent then somehow settles on a probability distribution p. We can now formalize our rationality criteria:
Epistemic Constraint: The probability distribution p which the agent settles on cannot be self-refuting according to the beliefs. It must be a fixed point of b: a p such that b(p)=p.
Instrumental Constraint: Out of the options allowed by the epistemic constraint, p should be as good as possible; that is, it should maximize expected utility. p:=argmaxp such that b(p)=pEpU
We can also require that b be a continuous function, to guarantee the existence of a fixed point[1], so that the agent is definitely able to satisfy these requirements. This might seem like an arbitrary requirement, from the perspective where b is a story about map-territory connections; why should they be required to be continuous? But remember that b is representing the subjective belief-formation process of the agent, not a true objective story. Continuity can be thought of as a limit to the agent's own self-knowledge.
For example, the self-referential statement X: "p(X)<12" suggests an "objectively true" belief which maps p(X) to 1 if it's below 1/2, and maps it to 0 if it's above or equal to 1/2. But this belief has no fixed-point; an agent with this belief cannot satisfy the epistemic constraint on its rationality. If we require b to be continuous, we can only approximate the "objectively true" belief function, by rapidly but not instantly transitioning from 1 to 0 as p(X) rises from slightly less that 1/2 to slightly more.
These "beliefs" are a lot like "trading strategies" from Garrabrant Induction.
We can also replace the continuity requirement with a Kakutani requirement, to get something more like Paul's self-referential probabili...

Nov 30, 2023 • 16min
LW - What's next for the field of Agent Foundations? by Nora Ammann
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's next for the field of Agent Foundations?, published by Nora Ammann on November 30, 2023 on LessWrong.
Alexander, Matt and I want to chat about the field of Agent Foundations (AF), where it's at and how to strengthen and grow it going forward.
We will kick off by each of us making a first message outlining some of our key beliefs and open questions at the moment. Rather than giving a comprehensive take, the idea is to pick out 1-3 things we each care about/think are important, and/or that we are confused about/would like to discuss. We may respond to some subset of the following prompts:
Where is the field of AF at in your view? How do you see the role of AF in the larger alignment landscape/with respect to making AI futures go well? Where would you like to see it go? What do yo use as some of the key bottlenecks for getting there? What are some ideas you have about how we might overcome them?
Before we launch in properly, just a few things that seem worth clarifying:
By Agent Foundations, we mean roughly speaking conceptual and formal work towards understanding the foundations of agency, intelligent behavior and alignment. In particular, we mean something broader than what one might call "old-school MIRI-type Agent Foundations", typically informed by fields such as decision theory and logic.
We will not specifically be discussing the value or theory of change behind Agent Foundations research in general. We think these are important conversations to have, but in this specific dialogue, our goal is a different one, namely: assuming AF is valuable, how can we strengthen the field?
Should it look more like a normal research field?
The main question I'm interested in about agent foundations at the moment is whether it should continue in its idiosyncratic current form, or whether it should start to look more like an ordinary academic field.
I'm also interested in discussing theories of change, to the extent it has bearing on the other question.
Why agent foundations?
My own reasoning for foundational work on agency being a potentially fruitful direction for alignment research is:
Most misalignment threat models are about agents pursuing goals that we'd prefer they didn't pursue (I think this is not controversial)
Existing formalisms about agency don't seem all that useful for understanding or avoiding those threats (again probably not that controversial)
Developing new and more useful ones seems tractable (this is probably more controversial)
The main reason I think it might be tractable is that so far not that many person-hours have gone into trying to do it. A priori it seems like the sort of thing you can get a nice mathematical formalism for, and so far I don't think that we've collected much evidence that you can't.
So I think I'd like to get a large number of people with various different areas of expertise thinking about it, and I'd hope that some small fraction of them discovered something fundamentally important. And a key question is whether the way the field currently works is conducive to that.
Does it need a new name?
Does Agent Foundations-in-the-broad-sense need a new name?
Is the name 'Agent Foundations' cursed?
Suggestions I've heard are
'What are minds', 'what are agents'. 'mathematical alignment'. 'Agent Mechanics'
Epistemic Pluralism and Path to Impact
Some thought snippets:
(1) Clarifying and creating common knowledge about the scope of Agent Foundations and strengthening epistemic pluralism
I think it's important for the endeavors of meaningfully improving our understanding of such fundamental phenomena as agency, intelligent behavior, etc. that one has a relatively pluralistic portfolio of angles on it. The world is very detailed, phenomena like agency/intelligent behavior/etc. seem like maybe particularly "messy"/detailed phenomena. Insofar ...

Nov 30, 2023 • 12min
LW - Scaling laws for dominant assurance contracts by jessicata
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scaling laws for dominant assurance contracts, published by jessicata on November 30, 2023 on LessWrong.
(note: this post is high in economics math, probably of narrow interest)
Dominant assurance contracts are a mechanism proposed by Alex Tabarrok for funding public goods. The following summarizes a 2012 class paper of mine on dominant assurance contracts. Mainly, I will be determining how much the amount of money a dominant assurance contract can raise as a function of how much value is created for how many parties, under uncertainty about how much different parties value the public good. Briefly, the conclusion is that, while Tabarrok asserts that the entrepreneur's profit is proportional to the number of consumers under some assumptions, I find it is proportional to the square root of the number of consumers under these same assumptions.
The basic idea of assurance contracts is easy to explain. Suppose there are N people ("consumers") who would each benefit by more than $S > 0 from a given public good (say, a piece of public domain music) being created, e.g. a park (note that we are assuming linear utility in money, which is approximately true on the margin, but can't be true at limits). An entrepreneur who is considering creating the public good can then make an offer to these consumers. They say, everyone has the option of signing a contract; this contract states that, if each other consumer signs the contract, then every consumer pays $S, and the entrepreneur creates the public good, which presumably costs no more than $NS to build (so the entrepreneur does not take a loss).
Under these assumptions, there is a Nash equilibrium of the game, in which each consumer signs the contract. To show this is a Nash equilibrium, consider whether a single consumer would benefit by unilaterally deciding not to sign the contract in a case where everyone else signs it. They would save $S by not signing the contract. However, since they don't sign the contract, the public good will not be created, and so they will lose over $S of value.
Therefore, everyone signing is a Nash equilibrium. Everyone can rationally believe themselves to be pivotal: the good is created if and only if they sign the contract, creating a strong incentive to sign.
Tabarrok seeks to solve the problem that, while this is a Nash equilibrium, signing the contract is not a dominant strategy. A dominant strategy is one where one would benefit by choosing that strategy (signing or not signing) regardless of what strategy everyone else takes. Even if it would be best for everyone if everyone signed, signing won't make a difference if at least one other person doesn't sign.
Tabarrok solves this by setting a failure payment $F > 0, and modifying the contract so that if the public good is not created, the entrepreneur pays every consumer who signed the contract $F. This requires the entrepreneur to take on risk, although that risk may be small if consumers have a sufficient incentive for signing the contract.
Here's the argument that signing the contract is a dominant strategy for each consumer. Pick out a single consumer and suppose everyone else signs the contract. Then the remaining consumer benefits by signing, by the previous logic (the failure payment is irrelevant, since the public good is created whenever the remaining consumer signs the contract).
Now consider a case where not everyone else signs the contract. Then by signing the contract, the remaining consumer gains $F, since the public good is not created. If they don't sign the contract, they get nothing and the public good is still not created. This is still better for them. Therefore, signing the contract is a dominant strategy.
What if there is uncertainty about how much the different consumers value the public good? This can be modeled as a Bayesi...


