
706: Large Language Model Leaderboards and Benchmarks
Super Data Science: ML & AI Podcast with Jon Krohn
00:00
Evaluating Language Models with LAMA 2 and Benchmark Challenges
The chapter delves into the release of LAMA 2 and the comparison of different language models' performance, addressing challenges in evaluation due to contaminated data and testing on familiar data. It explores the evolution of benchmarks and the need for ongoing refinement to encompass accuracy, fairness, and lack of toxicity. The discussion also touches on evaluating models for new benchmarks as they advance, with insights on the Stanford University evaluation of language models across diverse scenarios and the introduction of a time horizon for standardized measurement.
Play episode from 03:05
Transcript


