Tech on the Rocks

From pandas to Arrow: Wes McKinney on the Future of Data Infrastructure

Dec 1, 2025
Wes McKinney, creator of pandas and co-creator of Apache Arrow and Ibis, is a long-time leader in the Python data ecosystem. He walks through pandas’ UX-driven origins and the move to columnar in-memory Arrow. Conversations cover Arrow vs Parquet, new GPU-friendly file encodings, big metadata and table formats, Rust query engines like DataFusion, and how AI agents are changing developer workflows.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Arrow Enables Zero Copy Cross Language Transfer

  • Apache Arrow defines an in-memory columnar layout with record batches and schemas to enable zero-copy cross-process and cross-language data transfer.
  • Arrow IPC uses a small metadata prefix and flatbuffers so receivers can map buffers into language objects without rehydration cost.
INSIGHT

Arrow Versus Parquet Tradeoffs

  • Parquet is optimized for compact on-disk storage via dictionary, RLE and general-purpose compression, requiring decoding work on read.
  • Arrow intentionally stores fully rehydrated memory layouts to avoid decode costs and favor modern high-bandwidth, parallel compute.
INSIGHT

Next Gen File Formats Tackle Parquet Limits

  • New file formats aim to replace Parquet's costly metadata and general-purpose compression with lightweight GPU/CPU-friendly encodings and better random access.
  • Problems include wide schemas, metadata deserialization, unknown memory needs and poor GPU decode predictability.
Get the Snipd Podcast app to discover more snips from this episode
Get the app