Don't Worry About the Vase Podcast

Gemini 3.1 Pro Aces Benchmarks, I Suppose

Mar 4, 2026
A deep dive into Google's Gemini 3.1 Pro benchmarks and what the headline scores really mean. Discussion of mysterious DeepThink V2 and whether it's a runtime tweak or a new model. Critique of sparse safety disclosures and what test-time scaling implies for risk. Notes on visual polish, coding strengths and quirks, and rollout reliability concerns.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Google's Benchmark Transparency Is Not Perfect But Helpful

  • Google published a wide benchmark set and openly included tests where Opus 4.6 outperformed Gemini, lending credibility to the comparison.
  • Zvi tips his cap at the quick Sonnet 4.6 integration that improved benchmarking fairness.
INSIGHT

Thin Safety Disclosures For A Frontier Candidate

  • Google's Frontier Safety reporting for 3.1 is minimal: they ran tests but disclosed only brief summaries and alerts already seen in 3 Pro.
  • Zvi calls this level of disclosure unacceptable given Gemini 3.1's frontier-model status.
INSIGHT

DeepThink V2 Gains Come From A Stronger Baseline

  • DeepThink V2 is a runtime configuration delivering dramatic benchmark jumps, later revealed to be based on Gemini 3.1 Pro.
  • This explains how benchmarks leapt: V2 used a more capable baseline plus runtime scaffolding, not just clever orchestration.
Get the Snipd Podcast app to discover more snips from this episode
Get the app