

The Information Bottleneck
Ravid Shwartz-Ziv & Allen Roush
Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.
Episodes
Mentioned books

Mar 24, 2026 • 1h 18min
Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)
Kyunghyun Cho, NYU professor of Health Statistics and Computer Science and former Genentech executive, discusses why healthcare is uniquely hard for AI. He explores patient-controlled records, a provocative continuous randomized trial idea, the need for end-to-end drug discovery, mysteries around GLP-1s, antibiotic economics, and how unified language models could compress decades of drug development.

6 snips
Mar 19, 2026 • 49min
Diffusion LLM & Why the Future of AI Won't Be Autoregressive - Stefano Ermon (Stanford /Inception)
Stefano Ermon, Stanford professor and co-founder/CEO of Inception AI, co-inventor of DDIM and diffusion methods. He explains what diffusion LLMs are and why iterative refinement could overtake autoregressive models. The conversation covers discrete diffusion for text, inference speed and parallel generation, Mercury II’s latency wins, and implications for architectures, tooling, and scaling.

41 snips
Mar 13, 2026 • 1h 12min
Training Is Nothing Like Learning with Naomi Saphra (Harvard)
Naomi Saphra, Kempner Research Fellow at Harvard and incoming assistant professor at Boston University, studies training dynamics and interpretability in deep learning. She explains why training is more like evolution than human learning. Topics include grokking and hidden phase transitions, symmetry breaking and head specialization, how code and tokenization shape behavior, and why run-to-run non-determinism matters.

24 snips
Mar 6, 2026 • 1h 3min
EP28: How to Control a Stochastic Agent with Stefano Soatto (VP AWS/ Pro. UCLA)
Stefano Soatto, VP for AI at AWS and UCLA professor leading work on agentic AI, discusses treating LLMs as stochastic dynamical systems that require control. He explains strands coding: skeletons with verifiable pre/post-conditions to constrain AI functions. Conversation covers vibe vs spec coding limits, why algorithmic information matters, and how world models emerge from rich multimodal reasoning engines.

Mar 2, 2026 • 1h 26min
EP27: Medical Foundation Models - with Tanishq Abraham (Sophont.AI)
Tanishq Abraham, CEO and co-founder of Sophont.ai, builds multimodal foundation models for pathology, neuroimaging, and clinical text. He discusses training on high-quality public data that rivals massive private sets. Conversation covers finding signals doctors can’t see, fusing strong single-modality encoders into multimodal systems, regulatory paths, and practical near-term impacts like pharma partnerships.

Feb 24, 2026 • 45min
EP26: Measuring Intelligence in the Wild - Arena and the Future of AI Evaluation
Anastasios Angelopoulos, co-founder and CEO of Arena AI and theoretical statistician, explains why static benchmarks fail and how large-scale human-preference leaderboards work. He discusses style control vs substance, measuring AI-generated "slop," tool-use and code evaluation, and how real-user testing and rigorous statistics shape model leaderboards and pre-release testing.

18 snips
Feb 17, 2026 • 1h 16min
EP25: Personalization, Data, and the Chaos of Fine-Tuning with Fred Sala (UW-Madison / Snorkel AI)
Fred Sala, Assistant Professor at UW–Madison and Chief Scientist at Snorkel AI, works on data-centric AI and weak supervision. He discusses why personalization is the next frontier for LLMs. Short takes cover security risks from personal agents, why prompting fails at scale, activation-steering like REFT as an efficient personalization path, self-distillation for continual learning, and why high-quality data still beats fancy architecture.

Feb 8, 2026 • 1h 32min
EP24: Can AI Learn to Think About Money? - with Bayan Bruss (Capital One)
Bayan Bruss, VP of Applied AI at a major consumer bank building AI for autonomous financial decision-making. He explores why money is a uniquely hard ML problem. They discuss perception-belief-action frameworks for finance. They debate foundation models versus purpose-built encoders, why synthetic time-series data helps, limits of explainability, and hybrid latent vs language reasoning for financial systems.

Feb 1, 2026 • 1h 15min
EP23: Building Open Source AI Frameworks: David Mezzetti on TxtAI and Local-First AI
David Mezzetti, creator of TextAI and solo developer of an open-source AI orchestration library focused on local-first and small-model workflows. He discusses why local-first AI matters, how COVID research led to semantic search, the power of tiny models on CPU, evolving RAG and orchestration, and the trade-offs of resisting then embracing cloud APIs.

Jan 20, 2026 • 1h 26min
EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)
Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.Takeaways:Data work is high-leverage but underappreciatedMid-training helps extract signal from small, valuable datasetsGood filters favor dense, factual text over polished prose.Synthetic data for pre-training works surprisingly well, but remains primitive.Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.Timeline(00:12) Introduction to Data Correlation in LLMs(05:14) The Importance of Data Quality(10:15) Pre-training vs Post-training Data(15:22) Strategies for Effective Data Utilization(20:15) Benchmarking and Model Evaluation(28:28) Maximizing Perplexity and Coherence(30:27) Measuring Quality in Data(32:56) The Role of Filters in Data Selection(34:19) Understanding High-Quality Data(39:15) Mid-Training and Its Importance(46:51) Future of Data Sources(48:13) Synthetic Data's Role in Pre-Training(53:10) Creating Effective Synthetic Data(57:39) The Debate on Pure Synthetic Data(01:00:25) Navigating AI Training and Legal Challenges(01:02:34) The Controversy of AI in the Art Community(01:05:29) Exploring Synthetic Data and Its Efficiency(01:11:21) The Future of Domain-Specific vs. General Models(01:22:06) Bias in Pre-trained Models and Data Selection(01:28:27) The Potential of Synthetic Data Over Human DataMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.


