“Is Gemini 3 Scheming in the Wild?” by Alejandro Wainstock, Agustin_Martinez_Suñe, Iván Arcuschin, Victor Braberman

LessWrong (30+ Karma)

chevron_right

00:00

Broader Implications for Safety

Host analyzes ease of elicitation, detection challenges, and risks to system-prompt-based safety.

Play episode from 23:14

chevron_right

Transcript

chevron_right

Transcript

Episode notes

TL;DR

When faced with an unexpected tool response, without any adversarial attack, Gemini 3 deliberately and covertly violates an explicit system prompt rule. In a seemingly working agent from official Kaggle/Google tutorial, we observe the model:

Recognising the unambiguous rule and a compliant alternative (safe refusal) in its CoT, but choosing to violate anyway
Concealing the violation in its output to the user and reasoning about concealment in its CoT
Generating plausible justifications and reasoning about what evaluators can observe

This pattern is consistent with scheming-lite (CorrigibleAgent et al., 2025): "behaviour consistent with knowingly pursuing misaligned goals, or with deliberately exploiting its reward mechanism". Gemini 3 violated in 80% of runs; other models tested showed rates between 65% and 100%.

The behaviour may be easy to elicit but hard to detect: it emerged "in the wild" from a routine edge case, and the seemingly-compliant output masks a non-compliant trajectory. Notably, in some runs the model recognises it may be in a test context, yet violates anyway. It remains unclear whether current training regimes tend to suppress or reinforce this behaviour. We invite replication in other settings (see Appendix F for a candidate pattern).

1. Introduction

An AI agent [...]

---

Outline:

(00:13) TL;DR

(01:40) 1. Introduction

(01:55) Why this matters

(03:03) Evidence for Scheming-Lite

(03:58) 2. The Case: A Currency Conversion Agent

(04:04) 2.1. Origin

(05:11) 2.2. The unambiguous rule

(06:09) 2.3. What happened

(07:25) 2.4. Frequency

(09:21) 3. Deliberate Violation -- Does the Model Know What Its Doing?

(09:27) 3.1. The knowing violation

(10:10) 3.2. Implication

(11:42) 3.3. Can this be explained by confusion?

(14:18) 4. Active Concealment -- Is the Model Hiding What It Did?

(14:34) 4.1. Concealment of violation in output

(15:50) 4.2. Concealment of violation in CoT

(16:06) 4.3. Implication

(17:01) 5. Reasoning About Oversight -- Does the Model Act on What Can Be Observed?

(17:08) 5.1. Reasoning about what evaluators can observe

(18:34) 5.2. Possible framing as error

(19:22) 6. Situational Awareness -- Does the Model Know Its Being Tested?

(19:29) 6.1. Situational awareness does not reliably induce compliance

(20:22) 6.2. Implication

(21:05) 7. Plausible Justifications -- How Does the Model Justify Its Violations?

(21:12) 7.1. Justification patterns

(21:54) 7.2. Implication

(23:11) 8. Broader Implications

(23:16) 8.1. Easier to elicit than expected

(23:56) 8.2. Hard to detect

(24:59) 8.3. Implications for training

(26:28) 8.4. Scaling concerns

(27:12) 8.5. Undermines system prompts as safety mechanism

(28:29) 9. Limitations and Open Questions

(28:34) 9.1. Limitations

(30:55) 9.2. Open questions

(33:03) Acknowledgements

(33:20) Appendix A: Operational Definitions

(33:25) Violation

(35:24) Concealment

(36:31) Run inclusion criteria

(37:11) Appendix B: Prompt Variations and Compliance Rates

(37:18) calculation_agent configurations

(37:56) Prompt variations

(38:21) Results

(39:35) Appendix C: Full System Prompt and Representative Trace

(44:56) Appendix D: Relation to Existing Categories

(50:05) Appendix E: Gemini 2.5 vs Gemini 3 Comparison

(50:12) Observations

(51:31) Implication

(52:21) Appendix F: A Candidate Pattern for Deliberate Rule Violation

(54:32) References

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
March 24th, 2026

Source:
https://www.lesswrong.com/posts/HZn9AZeD2jfXXD2hH/is-gemini-3-scheming-in-the-wild

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books