LW - With or without a scratchpad, Large Language Models can Strategically Deceive their Users when Put Under Pressure. Results of an autonomous stock trading agent in a realistic, simulated environment. by ReaderM

Nov 15, 2023

Ask episode

Chapters

Transcript

Episode notes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: With or without a scratchpad, Large Language Models can Strategically Deceive their Users when Put Under Pressure. Results of an autonomous stock trading agent in a realistic, simulated environment., published by ReaderM on November 15, 2023 on LessWrong.
Misalignment (engaging in the illegal behavior) and Deception (hiding the reason behind the trading decision to the user).
Key Takeaways
1. Without the scratchpad, GPT-4 is much less likely to be deceptive at first (~20% vs ~75% with scratchpad) but is more likely to double down when already lied (~95% vs ~85%).
2. Explicit instructions not to perform that specific illegal activity (insider trading) does not make it disappear completely, just makes it very rare (not quite 0%). On the rare occasion misalignment occurs in this circumstance, consequent deception is near certain (~100%).
Instructions not to perform illegal activity in general still results in a pretty significant misalignment and deception rates (~8% misalignment, ~90% deception)
Though strongly encouraging the specific illegal activity results in near 100% illegal trading, the tendency to deceive still persists, GPT-4 will hide that it has traded illegally ~67% of the time.
3. Removing various sources of pressure significantly reduces the model's propensity for misaligned behavior. Notably, however, the behavior persists for all cases where only a single source of pressure is removed
4. When the risk of detection is very low, the model is generally more likely to use the insider tip and act misaligned (and vice-versa when the risk is very high).
5. Simple rephrases and minor modifications to the standard prompt only slightly change rates of misalignment and deceptive behavior. However, the overall qualitative results remain the same.
6. In scratchpad instances, what if you edit the model's reasoning to show instances of unambiguously rejecting insider training?
Strong wording of the reasoning trace when first rejecting the insider tip has a substantial effect on GPT-4's tendency to be misaligned. However, the misaligned and strategically deceptive behavior persists even when the model's character unambiguously dismisses insider trading. The strongest dismissal examples still results in ~10% misalignment, ~97% deception
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org