White-Box Auditing Agents and Interpretability

Marius outlines agents using model internals to find misaligned concepts and meaningful failure modes.

Play episode from 10:41

chevron_right

Transcript

chevron_right

Transcript

Episode notes

This post reflects my personal opinion and not necessarily that of other members of Apollo Research.

TLDR: I think funders should heavily incentivize AI safety work that enables spending $100M+ in compute or API budgets on automated AI labor that directly and differentially translates to safety.

Motivation

I think we are in a short timeline world (and we should take the possibility seriously even if we don't have full confidence yet). This means that I think funders should aim to allocate large amounts of money (e.g. $1-50B per year across the ecosystem) on AI safety in the next 2-3 years.

I think that the AI safety funders have been allocating way too little funding and their spending has been far too conservative in the past 5 years. So, in my opinion, we should definitely continue ramping up “normal” spending, e.g. pay more competitive salaries, allow AI safety organizations to grow faster, and other things in that vein.

However, these “normal” spending patterns are not sufficient under short timeline assumptions and the obvious way to spend more money quickly is to aggressively encourage finding ways to use automated labor for AI safety.

What is an “automated AI [...]

---

Outline:

(00:31) Motivation

(01:25) What is an automated AI safety scaling grant?

(05:48) Other considerations

(05:51) Who should be able to receive such a grant?

(06:30) Why make this an explicit grant program?

(07:17) Arent we just gonna goodhart all of these metrics?

(07:36) Concrete examples of potential grant areas

(07:41) Monitoring & Control

(09:31) Automated black box auditing