The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Mixture-of-Experts and Trends in Large-Scale Language Modeling with Irwan Bello - #569

10 snips

Apr 25, 2022

Irwan Bello, a research scientist formerly with Google Brain and now part of a stealth AI startup, dives into the world of sparse expert models. He discusses his recent work on designing effective architectures that improve performance while managing computational costs. The conversation uncovers how the mixture-of-experts technique can extend beyond NLP to various tasks, including vision. Bello also shares insights on enhancing alignment in language models through instruction tuning and the challenges of optimizing these large-scale systems.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Expert Layer Design

Use shallow, two-layer neural nets for expert layers within a deep network.
For optimal performance, place expert layers every four blocks of regular layers.

INSIGHT

Scaling and Stability

Scaling expert models based on computation, not just parameters, presented stability challenges.
Applying ZLoss to the router network's logits improves stability by ensuring smoother probability distributions for expert selection.

ANECDOTE

ZLoss Discovery

Initially, classic stability techniques like noise injection and activation clipping hurt model quality.
The team then repurposed ZLoss, realizing its potential to address rounding errors arising from lower precision formats like BFloat16.

Get the Snipd Podcast app to discover more snips from this episode

Get the app