The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

9 snips
Dec 26, 2023
In this discussion, Markus Nagel, a research scientist at Qualcomm AI Research, shares insights from his recent papers at NeurIPS 2023, focusing on machine learning efficiency. He tackles the challenges of quantizing transformers, particularly in minimizing outlier issues in attention mechanisms. The conversation explores the pros and cons of pruning versus quantization for model weight compression and dives into innovative methods for multitask and multidomain learning. Additionally, the use of geometric algebra in enhancing algorithms for robotics is highlighted.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Outliers in Transformer Quantization

  • Transformers are difficult to quantize due to outliers in activations, impacting the trade-off between clipping and rounding errors.
  • These outliers stem from attention heads trying to perform a "no update" behavior, which is not explicitly represented.
INSIGHT

Root Cause of Outliers

  • Outliers in transformers arise from attention heads focusing on delimiter or background tokens with low information content.
  • Other tokens attend to these special tokens to achieve a "no update" behavior, indirectly causing outliers.
ADVICE

Implementing "No Update"

  • Implement an explicit "no update" behavior in attention heads to mitigate outliers and improve quantization.
  • This can be achieved through methods like clipped softmax or gated attention.
Get the Snipd Podcast app to discover more snips from this episode
Get the app