Infinite Curiosity Pod with Prateek Joshi

AI Infra for Long Context Model Training | Anna Patterson, founder of Ceramic AI

Jun 17, 2025

In this conversation, Anna Patterson, cofounder of Ceramic AI and former VP Engineering at Google, shares her insights on AI infrastructure for model training. She discusses achieving a 2.5x speed-up in long context training and the nuances between short, medium, and long contexts. Anna also dives into the importance of differentiating good data from bad, particularly in complex domains. She reflects on the significance of synthetic data and recent AI advancements, offering a glimpse into the future of personalized AI models.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Principles of Data Pruning

Data pruning focuses on removing semantically repetitive and low-query content like celebrity gossip.
Training benefits from promoting high-quality and topic-coherent content across long contexts.

INSIGHT

Grading Reasoning in Subjective Domains

Good data in subjective domains like legal relies on authoritative sources like case law.
Grading reasoning requires evaluating logical steps and preferences beyond just correctness.

ADVICE

Embrace Synthetic Data Growth

Synthetic data is a valuable starting point and will likely become the majority of training data.
Human data remains critical but is limited and cannot scale as quickly as synthetic data.

Get the Snipd Podcast app to discover more snips from this episode

Get the app