
Infinite Curiosity Pod with Prateek Joshi AI Infra for Long Context Model Training | Anna Patterson, founder of Ceramic AI
Jun 17, 2025
In this conversation, Anna Patterson, cofounder of Ceramic AI and former VP Engineering at Google, shares her insights on AI infrastructure for model training. She discusses achieving a 2.5x speed-up in long context training and the nuances between short, medium, and long contexts. Anna also dives into the importance of differentiating good data from bad, particularly in complex domains. She reflects on the significance of synthetic data and recent AI advancements, offering a glimpse into the future of personalized AI models.
AI Snips
Chapters
Transcript
Episode notes
Principles of Data Pruning
- Data pruning focuses on removing semantically repetitive and low-query content like celebrity gossip.
- Training benefits from promoting high-quality and topic-coherent content across long contexts.
Grading Reasoning in Subjective Domains
- Good data in subjective domains like legal relies on authoritative sources like case law.
- Grading reasoning requires evaluating logical steps and preferences beyond just correctness.
Embrace Synthetic Data Growth
- Synthetic data is a valuable starting point and will likely become the majority of training data.
- Human data remains critical but is limited and cannot scale as quickly as synthetic data.

