Evals are the new PRD. Here is the playbook with the CEO of the leader in the space (Ankur Goyal, Founder and CEO, Braintrust)

98 snips

Mar 20, 2026

Ankur Goyal, founder and CEO of Braintrust, leads a major eval platform used by top AI product teams. He explains why evals should start the product process, how to build data-task-score evals, and shows a live demo creating an eval from scratch. Conversation covers scoring design, connecting evals to tools like Linear, and using failing evals as a roadmap for improvement.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Vibe Checks Are Early Evals

Vibe checks are a form of evaluation using your brain as the scoring function.
Early-stage manual checks scale poorly as product usage grows and require software and process to run consistent evals.

INSIGHT

Evals Outlast Model Churn

Evals are a durable investment that outlast model and architectural churn.
Encoding user needs as evals preserves product intent across frequent model swaps and agent rewrites.

INSIGHT

Distance From Users Determines Eval Need

The farther you are from the end user, the more critical structured evals become.
Teams like Anthropic can rely on internal feedback loops; healthcare-focused apps cannot and need formal evals with domain experts.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Today’s episode

Most PMs treat evals like a quality gate. Something you run right before shipping, just to check the box.

That is backwards.

The best AI product teams treat evals as the starting point. They write the eval before the prompt. They iterate on the scoring function before the model. They use failing evals as a roadmap.

That shift is what today’s episode is about.

I sat down with Ankur Goyal, Founder and CEO of Braintrust. It is the eval platform used by Replit, Vercel, Airtable, Ramp, Zapier, and Notion. Braintrust just announced its Series B at an $800 million valuation.

Users are running 10x more evals than this time last year. People log more data per day now than they did in the entire first year the product existed.

In this episode, we build an eval entirely from scratch. Live. No pre-written prompts, no pre-written data. We connect to Linear’s MCP server, generate test data, write a scoring function, and iterate until the score goes from 0 to 0.75.

Plus, we cover the complete eval playbook for PMs:

If you want access to my AI tool stack - Dovetail, Arize, Linear, Descript, Reforge Build, DeepSky, Relay.app, Magic Patterns, Speechify, and Mobbin - grab Aakash’s bundle.

If you want my PM Operating System in Claude Code, click here.

----

Check out the conversation on Apple, Spotify, and YouTube.

Brought to you by:

* Kameleoon: Leading AI experimentation platform

* Testkube: Leading test orchestration platform

* Pendo: The #1 software experience management platform

* Bolt: Ship AI-powered products 10x faster

* Product Faculty: Get $550 off their #1 AI PM Certification with my link

----

Key Takeaways:

1. Vibe checks are evals - When you look at an AI output and intuit whether it is good or bad, you are using your brain as a scoring function. It is evaluation. It just does not scale past one person and a handful of examples.

2. Every eval has three parts - Data (a set of inputs), Task (generates an output), and Scores (rates the output between 0 and 1). That normalization forces comparability across time.

3. Evals are the new PRD - In 2015, a PRD was an unstructured document nobody followed. In 2026, the modern PRD is an eval the whole team can run to quantify product quality.

4. Start with imperfect data - Auto-generate test questions with a model. Do not spend a month building a golden data set. Jump in and iterate from your first experiment.

5. The distance principle - The farther you are from the end user, the more critical evals become. Anthropic can vibe check Claude Code because engineers are the users. Healthcare AI teams cannot.

6. Use categorical scoring, not freeform numbers - Give the scorer three clear options (full answer, partial, no answer) instead of asking an LLM to produce an arbitrary number.

7. Evals compound, prompts do not - Models and frameworks change every few months. If you encode what your users need as evals, that investment survives every model swap.

8. Have evals that fail - If everything passes, you have blind spots. Keep failing evals as a roadmap and rerun them every time a new model drops.

9. Build the offline-to-online flywheel - Offline evals test your hypothesis. Online evals run the same scorers on production logs. The gap between them is your improvement roadmap.

10. The best teams review production logs every morning - They find novel patterns, add them to the data set, and iterate all day. That morning ritual is what separates teams that ship blind from teams that ship with confidence.

----

Where to find Ankur Goyal

* LinkedIn

* Braintrust