JAMA+ AI Conversations

AI and "Do No Harm"

12 snips
Feb 26, 2026
Adam Rodman, a general internist who leads AI programs in clinical reasoning and education, and David Wu, an MD‑PhD bridging clinical medicine and data science, discuss safe clinical use of large language models. They explore a Do No Harm benchmark and live leaderboard, failure modes like omissions, model diversity and second opinions, and how clinicians should team with AI to avoid de‑skilling.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Omissions Are The Primary Harm In Medical LLMs

  • Diagnostic and management evaluation must include omissions as harms, not just incorrect recommendations.
  • Arise benchmark uses real specialist questions and rubrics that mark missing tests, follow-up, or treatments as harmful.
ADVICE

Evaluate Models With Specialist Curated Benchmarks

  • Use independent, specialist-curated benchmarks to evaluate medical LLMs before clinical use.
  • Arise created gold-standard answers and rubrics from specialists and released a public leaderboard ranking 30+ models.
INSIGHT

Management Reasoning Needs Contextual Benchmarks

  • Management decisions are harder to evaluate than diagnoses because context and trade-offs vary widely.
  • Arise scores management by consensus rubrics that specify what should be present or not missing in recommendations.
Get the Snipd Podcast app to discover more snips from this episode
Get the app