
The Daily AI Show GPT 5.4 vs Gemini: Benchmarks, Codex, Excel
22 snips
Mar 6, 2026 Karl Yeh, a hands‑on AI practitioner and consultant, shares practical field experience with GPT‑5.4 in Codex, Excel and Sheets workflows. The conversation covers head‑to‑head model comparisons, real‑world testing over benchmarks, spreadsheet dashboard design, and risks like live workbook edits and agent collisions. Short, practical takeaways and tool recommendations for day‑to‑day AI work.
AI Snips
Chapters
Transcript
Episode notes
Run Your Own Use Case Tests Before Choosing A Model
- Test models on your own use cases instead of relying solely on public benchmarks like GPTVal.
- Karl warns Chinese models excel on targeted benchmarks but can fail in real world company tests, so run practical pilots first.
Gemini Hallucinated A Silent Screen Recording
- Beth recounts using Gemini to transcribe a phone screen recording and getting a confident but false direct quote due to a silent voiceover misread.
- That hallucination led her to call for a dependable "colleague layer" that flags uncertainty rather than inventing content.
Omniscience Index Rewards Accuracy And Penalizes Hallucination
- Omniscience index measures factual accuracy and penalizes hallucinations; Gemini 3.1 Pro Preview scores highly while GPT‑5.4 on extra high reasoning scored lower.
- Andy notes models that refuse to answer get no penalty, favoring conservative responses.
