The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

9 snips

Dec 26, 2023

In this discussion, Markus Nagel, a research scientist at Qualcomm AI Research, shares insights from his recent papers at NeurIPS 2023, focusing on machine learning efficiency. He tackles the challenges of quantizing transformers, particularly in minimizing outlier issues in attention mechanisms. The conversation explores the pros and cons of pruning versus quantization for model weight compression and dives into innovative methods for multitask and multidomain learning. Additionally, the use of geometric algebra in enhancing algorithms for robotics is highlighted.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Outliers in Transformer Quantization

Transformers are difficult to quantize due to outliers in activations, impacting the trade-off between clipping and rounding errors.
These outliers stem from attention heads trying to perform a "no update" behavior, which is not explicitly represented.

INSIGHT

Root Cause of Outliers

Outliers in transformers arise from attention heads focusing on delimiter or background tokens with low information content.
Other tokens attend to these special tokens to achieve a "no update" behavior, indirectly causing outliers.

ADVICE

Implementing "No Update"

Implement an explicit "no update" behavior in attention heads to mitigate outliers and improve quantization.
This can be achieved through methods like clipped softmax or gated attention.

Get the Snipd Podcast app to discover more snips from this episode

Get the app