The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mixture-of-Experts and Trends in Large-Scale Language Modeling with Irwan Bello - #569

10 snips
Apr 25, 2022
Irwan Bello, a research scientist formerly with Google Brain and now part of a stealth AI startup, dives into the world of sparse expert models. He discusses his recent work on designing effective architectures that improve performance while managing computational costs. The conversation uncovers how the mixture-of-experts technique can extend beyond NLP to various tasks, including vision. Bello also shares insights on enhancing alignment in language models through instruction tuning and the challenges of optimizing these large-scale systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Expert Layer Design

  • Use shallow, two-layer neural nets for expert layers within a deep network.
  • For optimal performance, place expert layers every four blocks of regular layers.
INSIGHT

Scaling and Stability

  • Scaling expert models based on computation, not just parameters, presented stability challenges.
  • Applying ZLoss to the router network's logits improves stability by ensuring smoother probability distributions for expert selection.
ANECDOTE

ZLoss Discovery

  • Initially, classic stability techniques like noise injection and activation clipping hurt model quality.
  • The team then repurposed ZLoss, realizing its potential to address rounding errors arising from lower precision formats like BFloat16.
Get the Snipd Podcast app to discover more snips from this episode
Get the app