Hypothesis vs. Hallucinations: Property Testing AI-Generated Code

Dec 10, 2025

David R. MacIver, creator of Hypothesis and PBT researcher, discusses making AI-generated code trustworthy. He explains why standard benchmarks mislead, how property-based testing uncovers hidden failures, and the challenges of getting developers to think in invariants. Conversations cover LLMs writing tests, Hypothesis’ UX for onboarding, shrinking failing cases, and workflows for continuous fuzzing and regression protection.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Benchmarks Miss Deep Failures

Standard coding benchmarks like HumanEval often accept superficially correct LLM outputs that fail more thorough tests.
Adding property-based or extra test cases reveals many hidden failures in model-generated code.

ADVICE

Generate Many Inputs, Not One

Use property-based testing to let the computer generate many inputs instead of handcrafting single examples.
Hypothesis exposes edge cases like Unicode, dates, and floats that simple examples often miss.

ADVICE

Test AI-Generated Code Rigorously

When using LLMs to write code, always run good tests on the output rather than trusting one-shot generations.
Prefer property-based tests because they detect far more bugs than typical unit tests.

Get the Snipd Podcast app to discover more snips from this episode

Get the app