
Data Engineering Podcast Unfreezing The Data Lake: The Future-Proof File Format
13 snips
Dec 29, 2025 Xinyu Zeng, a PhD student and database researcher, dives deep into F3, the innovative 'future-proof file format' he’s developing. He highlights the limitations of existing formats like Parquet and ORC, tackling issues such as CPU-bound decoding and metadata overhead. By rethinking the layout and using WebAssembly for self-decoding, F3 aims to advance data handling. Xinyu discusses the importance of decoupling formats, supports multimodal data, and shares future directions, including integrating with existing technologies to enhance data lakes.
AI Snips
Chapters
Transcript
Episode notes
Research Journey To F3
- Xinyu traced F3 back to his PhD work and a benchmark paper showing Parquet's shortcomings.
- He initially tried a community effort but built F3 after consensus and legal hurdles slowed progress.
Decouple Table And File Layers
- Table formats emerged to add metadata, transactions, and versioning above files.
- Decoupling table and file formats enables independent evolution and mix-and-match combinations.
Make Decoding Pluggable With WASM
- Design file formats with an extensible layout and pluggable decoding kernels.
- Embed portable kernels (F3 uses WebAssembly) so new encodings deploy without waiting for all readers to upgrade.
