
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750
205 snips
Oct 7, 2025 Jacob Buckman, co-founder and CEO of Manifest AI, dives deep into the world of long-context transformers. He discusses innovative techniques like windowed attention and the revolutionary Power Retention approach, which melds attention and recurrence for astonishing training speeds. Buckman also shares insights on Manifest AI's open-source tools, Vidrial and PowerCoder, and explores the significance of metrics in measuring context utility. Learn about the balance between state and weight for optimal compute architectures and the future potential of context length in AI.
AI Snips
Chapters
Transcript
Episode notes
Shrink State Along Different Axes
- Reduce state size along time, head, feature, or layer axes to lower compute for long context.
- Choose windowing, GQA, latent attention, or hybrid layers depending on which state axis you want to shrink.
RNN States Are Too Small Relative To Transformers
- RNN/SSM states are typically too small compared to transformer KV state, so they need larger state to match capability.
- Balancing state size with weights is essential; mismatched axes harm compute-optimal performance.
Balance Weight And State FLOPs
- The weight-state FLOP ratio (weight flops vs state flops) predicts compute-optimal architectures.
- Doubling the smaller axis often gives big performance gains for little runtime cost until flops are balanced.

