#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

11 snips

May 24, 2024

Jorrit Sandbrink, a data engineer, discusses lake house architecture blending data warehouse and lake, key components like Delta Lake and Apache Spark, optimizations with partitioning strategies, and data ingress with DLT. The podcast emphasizes open-source solutions, considerations in choosing tools, and the evolving data landscape.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Decouple Storage And Compute

The lake house decouples storage and compute to combine warehouse and lake strengths.
Table formats add metadata over Parquet to enable transactions and data management.

ADVICE

Pick A Standard Table Format

Choose a supported table format like Delta, Iceberg, or Apache Hudi rather than inventing your own.
Use Parquet as the file format and rely on table-format metadata for quality and transactions.

INSIGHT

Lake Houses Don’t Require Spark

You can use non-Spark compute engines for lake houses when data sizes are moderate.
Engines like Polars let you run fast, single-node workloads and still access Delta tables.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.

Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

Key Takeaways:

Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

Sound Bites

"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

Jorrit Sandbrink:

Nicolay Gerold:

Chapters

00:00 Introduction to the Lake House Architecture

03:59 Choosing Storage and Table Formats

06:19 Comparing Compute Engines

21:37 Simplifying Data Ingress

25:01 Building a Preferred Data Stack

lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage