Real‑World Testing Over Benchmarks

Karl Yeh urges testing models on real use cases and warns against overreliance on benchmark performance.

Play episode from 23:47

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Beth Lyons and Andy Halliday open the show with a focused breakdown of GPT-5.4, framing it less as a universal leap and more as a strong advance in white-collar knowledge work and real-world task performance. Much of the conversation compares GPT-5.4 with Gemini 3.1 Pro Preview, Claude models, Codex, and other systems across benchmarks like GPT-Val, coding, long-context reasoning, hallucination resistance, and visual reasoning, with repeated emphasis that users still need to pick models based on the actual job to be done. Beth also shares a practical complaint about Gemini hallucinating around silent screen recordings and uses that to argue for a more dependable “colleague layer” in agentic systems. Later, Karl Yeh joins to talk through hands-on experience with GPT-5.4 in Codex, comparisons with Claude in Excel and Gemini in Sheets, and where the new release feels genuinely useful in day-to-day work.

Key Points Discussed

00:00:18 Welcome and setup for a GPT-5.4-focused episode

00:02:47 GPT-Val and white-collar knowledge work framing

00:08:51 Benchmark comparison across GPT-5.4, Claude, Gemini, and others

00:16:26 Gemini strengths in video and visual reasoning

00:18:05 Beth’s Gemini transcription / hallucination workflow example

00:23:54 “Then we’ll move to more news” and handoff to Karl Yeh

00:24:24 Karl Yeh on real-world use cases over benchmarks

00:55:30 Closing recommendations: try GPT-5.4, use Codex, newsletter and community plug

The Daily AI Show Co Hosts: Beth Lyons, Andy Halliday, Karl Yeh

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books