arxiv Preprint - Efficient Streaming Language Models with Attention Sinks

Oct 3, 2023

03:46

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

In this episode we discuss Efficient Streaming Language Models with Attention Sinks by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. The paper proposes StreamingLLM, a framework that allows Large Language Models (LLMs) to generalize to infinite sequence length without fine-tuning. By observing the phenomenon of attention sink, where initial tokens have a significant impact on performance, the authors show that caching the Key and Value states of these tokens enhances the efficiency and stability of window attention. The authors demonstrate that StreamingLLM outperforms the sliding window recomputation baseline in streaming applications with a speedup of up to 22.2x.

Home Top podcasts Popular guests Top books