
The BugBash Podcast Hypothesis vs. Hallucinations: Property Testing AI-Generated Code
Dec 10, 2025
David R. MacIver, creator of Hypothesis and PBT researcher, discusses making AI-generated code trustworthy. He explains why standard benchmarks mislead, how property-based testing uncovers hidden failures, and the challenges of getting developers to think in invariants. Conversations cover LLMs writing tests, Hypothesis’ UX for onboarding, shrinking failing cases, and workflows for continuous fuzzing and regression protection.
AI Snips
Chapters
Transcript
Episode notes
Benchmarks Miss Deep Failures
- Standard coding benchmarks like HumanEval often accept superficially correct LLM outputs that fail more thorough tests.
- Adding property-based or extra test cases reveals many hidden failures in model-generated code.
Generate Many Inputs, Not One
- Use property-based testing to let the computer generate many inputs instead of handcrafting single examples.
- Hypothesis exposes edge cases like Unicode, dates, and floats that simple examples often miss.
Test AI-Generated Code Rigorously
- When using LLMs to write code, always run good tests on the output rather than trusting one-shot generations.
- Prefer property-based tests because they detect far more bugs than typical unit tests.
