The Information Bottleneck

Ravid Shwartz-Ziv & Allen Roush

Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.

Episodes

Mentioned books

Mar 24, 2026 • 1h 18min

Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)

Kyunghyun Cho, NYU professor of Health Statistics and Computer Science and former Genentech executive, discusses why healthcare is uniquely hard for AI. He explores patient-controlled records, a provocative continuous randomized trial idea, the need for end-to-end drug discovery, mysteries around GLP-1s, antibiotic economics, and how unified language models could compress decades of drug development.

6 snips

Mar 19, 2026 • 49min

Diffusion LLM & Why the Future of AI Won't Be Autoregressive - Stefano Ermon (Stanford /Inception)

Stefano Ermon, Stanford professor and co-founder/CEO of Inception AI, co-inventor of DDIM and diffusion methods. He explains what diffusion LLMs are and why iterative refinement could overtake autoregressive models. The conversation covers discrete diffusion for text, inference speed and parallel generation, Mercury II’s latency wins, and implications for architectures, tooling, and scaling.

41 snips

Mar 13, 2026 • 1h 12min

EP26: Measuring Intelligence in the Wild - Arena and the Future of AI Evaluation

Anastasios Angelopoulos, co-founder and CEO of Arena AI and theoretical statistician, explains why static benchmarks fail and how large-scale human-preference leaderboards work. He discusses style control vs substance, measuring AI-generated "slop," tool-use and code evaluation, and how real-user testing and rigorous statistics shape model leaderboards and pre-release testing.

18 snips

Feb 17, 2026 • 1h 16min

EP25: Personalization, Data, and the Chaos of Fine-Tuning with Fred Sala (UW-Madison / Snorkel AI)

Fred Sala, Assistant Professor at UW–Madison and Chief Scientist at Snorkel AI, works on data-centric AI and weak supervision. He discusses why personalization is the next frontier for LLMs. Short takes cover security risks from personal agents, why prompting fails at scale, activation-steering like REFT as an efficient personalization path, self-distillation for continual learning, and why high-quality data still beats fancy architecture.

Feb 8, 2026 • 1h 32min

EP24: Can AI Learn to Think About Money? - with Bayan Bruss (Capital One)

Bayan Bruss, VP of Applied AI at a major consumer bank building AI for autonomous financial decision-making. He explores why money is a uniquely hard ML problem. They discuss perception-belief-action frameworks for finance. They debate foundation models versus purpose-built encoders, why synthetic time-series data helps, limits of explainability, and hybrid latent vs language reasoning for financial systems.

Feb 1, 2026 • 1h 15min

EP23: Building Open Source AI Frameworks: David Mezzetti on TxtAI and Local-First AI

David Mezzetti, creator of TextAI and solo developer of an open-source AI orchestration library focused on local-first and small-model workflows. He discusses why local-first AI matters, how COVID research led to semantic search, the power of tiny models on CPU, evolving RAG and orchestration, and the trade-offs of resisting then embracing cloud APIs.

Jan 20, 2026 • 1h 26min

EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)

Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.Takeaways:Data work is high-leverage but underappreciatedMid-training helps extract signal from small, valuable datasetsGood filters favor dense, factual text over polished prose.Synthetic data for pre-training works surprisingly well, but remains primitive.Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.Timeline(00:12) Introduction to Data Correlation in LLMs(05:14) The Importance of Data Quality(10:15) Pre-training vs Post-training Data(15:22) Strategies for Effective Data Utilization(20:15) Benchmarking and Model Evaluation(28:28) Maximizing Perplexity and Coherence(30:27) Measuring Quality in Data(32:56) The Role of Filters in Data Selection(34:19) Understanding High-Quality Data(39:15) Mid-Training and Its Importance(46:51) Future of Data Sources(48:13) Synthetic Data's Role in Pre-Training(53:10) Creating Effective Synthetic Data(57:39) The Debate on Pure Synthetic Data(01:00:25) Navigating AI Training and Legal Challenges(01:02:34) The Controversy of AI in the Art Community(01:05:29) Exploring Synthetic Data and Its Efficiency(01:11:21) The Future of Domain-Specific vs. General Models(01:22:06) Bias in Pre-trained Models and Data Selection(01:28:27) The Potential of Synthetic Data Over Human DataMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app