The Daily AI Show

GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

22 snips

Mar 6, 2026

Karl Yeh, a hands‑on AI practitioner and consultant, shares practical field experience with GPT‑5.4 in Codex, Excel and Sheets workflows. The conversation covers head‑to‑head model comparisons, real‑world testing over benchmarks, spreadsheet dashboard design, and risks like live workbook edits and agent collisions. Short, practical takeaways and tool recommendations for day‑to‑day AI work.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Run Your Own Use Case Tests Before Choosing A Model

Test models on your own use cases instead of relying solely on public benchmarks like GPTVal.
Karl warns Chinese models excel on targeted benchmarks but can fail in real world company tests, so run practical pilots first.

ANECDOTE

Gemini Hallucinated A Silent Screen Recording

Beth recounts using Gemini to transcribe a phone screen recording and getting a confident but false direct quote due to a silent voiceover misread.
That hallucination led her to call for a dependable "colleague layer" that flags uncertainty rather than inventing content.

INSIGHT

Omniscience Index Rewards Accuracy And Penalizes Hallucination

Omniscience index measures factual accuracy and penalizes hallucinations; Gemini 3.1 Pro Preview scores highly while GPT‑5.4 on extra high reasoning scored lower.
Andy notes models that refuse to answer get no penalty, favoring conservative responses.

Get the Snipd Podcast app to discover more snips from this episode