
How AI Is Built #009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack
11 snips
May 24, 2024 Jorrit Sandbrink, a data engineer, discusses lake house architecture blending data warehouse and lake, key components like Delta Lake and Apache Spark, optimizations with partitioning strategies, and data ingress with DLT. The podcast emphasizes open-source solutions, considerations in choosing tools, and the evolving data landscape.
AI Snips
Chapters
Transcript
Episode notes
Decouple Storage And Compute
- The lake house decouples storage and compute to combine warehouse and lake strengths.
- Table formats add metadata over Parquet to enable transactions and data management.
Pick A Standard Table Format
- Choose a supported table format like Delta, Iceberg, or Apache Hudi rather than inventing your own.
- Use Parquet as the file format and rely on table-format metadata for quality and transactions.
Lake Houses Don’t Require Spark
- You can use non-Spark compute engines for lake houses when data sizes are moderate.
- Engines like Polars let you run fast, single-node workloads and still access Delta tables.
