Limitations of BLEU/ROUGE and Newer Judging

Ankit explains why traditional BLEU/ROUGE metrics often fail and why LLM-judge prompts are needed for nuance.

Play episode from 32:33

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Check out the conversation on Apple, Spotify and YouTube.

Brought to you by - Reforge:

Get 1 month free of Reforge Build (the AI prototyping tool built for PMs) with code BUILD

Today’s Episode

Ankit Shukla is BACK after his gangbusters episode, that is my #2 most popular of all time. This time he's diving deep on one of the most important new AI skills for PMs: Evals.

Whether you're working on AI features now or not, this is a skill you want to have an intuitive understanding of. So, I'm building on my library of eval episodes with today's drop.

I've never heard someone explain evals from first principles as intuitively as Ankit has with this one. Hope you enjoy as much as I did!

If you want access to my AI tool stack - Dovetail, Arize, Linear, Descript, Reforge Build, DeepSky, Relay.app, Magic Patterns, Speechify, and Mobbin - grab Aakash’s bundle.

Where to find Ankit Shukla

* HelloPM

* Twitter (X)

* LinkedIn

* YouTube

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books