JAMA+ AI Conversations

AI and "Do No Harm"

12 snips

Feb 26, 2026

Adam Rodman, a general internist who leads AI programs in clinical reasoning and education, and David Wu, an MD‑PhD bridging clinical medicine and data science, discuss safe clinical use of large language models. They explore a Do No Harm benchmark and live leaderboard, failure modes like omissions, model diversity and second opinions, and how clinicians should team with AI to avoid de‑skilling.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Omissions Are The Primary Harm In Medical LLMs

Diagnostic and management evaluation must include omissions as harms, not just incorrect recommendations.
Arise benchmark uses real specialist questions and rubrics that mark missing tests, follow-up, or treatments as harmful.

ADVICE

Evaluate Models With Specialist Curated Benchmarks

Use independent, specialist-curated benchmarks to evaluate medical LLMs before clinical use.
Arise created gold-standard answers and rubrics from specialists and released a public leaderboard ranking 30+ models.

INSIGHT

Management Reasoning Needs Contextual Benchmarks

Management decisions are harder to evaluate than diagnoses because context and trade-offs vary widely.
Arise scores management by consensus rubrics that specify what should be present or not missing in recommendations.

Get the Snipd Podcast app to discover more snips from this episode