
Don't Worry About the Vase Podcast Gemini 3.1 Pro Aces Benchmarks, I Suppose
Mar 4, 2026
A deep dive into Google's Gemini 3.1 Pro benchmarks and what the headline scores really mean. Discussion of mysterious DeepThink V2 and whether it's a runtime tweak or a new model. Critique of sparse safety disclosures and what test-time scaling implies for risk. Notes on visual polish, coding strengths and quirks, and rollout reliability concerns.
AI Snips
Chapters
Transcript
Episode notes
Google's Benchmark Transparency Is Not Perfect But Helpful
- Google published a wide benchmark set and openly included tests where Opus 4.6 outperformed Gemini, lending credibility to the comparison.
- Zvi tips his cap at the quick Sonnet 4.6 integration that improved benchmarking fairness.
Thin Safety Disclosures For A Frontier Candidate
- Google's Frontier Safety reporting for 3.1 is minimal: they ran tests but disclosed only brief summaries and alerts already seen in 3 Pro.
- Zvi calls this level of disclosure unacceptable given Gemini 3.1's frontier-model status.
DeepThink V2 Gains Come From A Stronger Baseline
- DeepThink V2 is a runtime configuration delivering dramatic benchmark jumps, later revealed to be based on Gemini 3.1 Pro.
- This explains how benchmarks leapt: V2 used a more capable baseline plus runtime scaffolding, not just clever orchestration.
