JAMA+ AI Conversations AI and "Do No Harm"
12 snips
Feb 26, 2026 Adam Rodman, a general internist who leads AI programs in clinical reasoning and education, and David Wu, an MD‑PhD bridging clinical medicine and data science, discuss safe clinical use of large language models. They explore a Do No Harm benchmark and live leaderboard, failure modes like omissions, model diversity and second opinions, and how clinicians should team with AI to avoid de‑skilling.
AI Snips
Chapters
Transcript
Episode notes
Omissions Are The Primary Harm In Medical LLMs
- Diagnostic and management evaluation must include omissions as harms, not just incorrect recommendations.
- Arise benchmark uses real specialist questions and rubrics that mark missing tests, follow-up, or treatments as harmful.
Evaluate Models With Specialist Curated Benchmarks
- Use independent, specialist-curated benchmarks to evaluate medical LLMs before clinical use.
- Arise created gold-standard answers and rubrics from specialists and released a public leaderboard ranking 30+ models.
Management Reasoning Needs Contextual Benchmarks
- Management decisions are harder to evaluate than diagnoses because context and trade-offs vary widely.
- Arise scores management by consensus rubrics that specify what should be present or not missing in recommendations.
