The Data Engineering Show

The Firebolt Data Bros

The Data Engineering Show is a podcast for data engineering and BI practitioners to go beyond theory. Learn from the biggest influencers in tech about their practical day-to-day data challenges and solutions in a casual and fun setting.

SEASON 1 DATA BROS
Eldad and Boaz Farkash shared the same stuffed toys growing up as well as a big passion for data. After founding Sisense and building it to become a high-growth analytics unicorn, they moved on to their next venture, Firebolt, a leading high-performance cloud data warehouse.

SEASON 2 DATA BROS
In season 2 Eldad adopted a brilliant new little brother, and with their shared love for query processing, the connection was immediate. After excelling in his MS, Computer Science degree, Benjamin Wagner joined Firebolt to lead its query processing team and is a rising star in the data space.

For inquiries contact tamar@firebolt.io
Website: https://www.firebolt.io

Episodes

Mentioned books

Mar 24, 2026 • 18min

The Data Fusion Secret & Why Custom Query Engines Fail with Nikita Lapkov

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Nikita Lapkov, Senior Software Engineer at Cloudflare, to explore the architecture, design decisions, and future roadmap of R2 SQL- Cloudflare's new R2-based distributed query engine launched in September 2024.What You'll Learn:How to leverage existing query engines strategically: Why Cloudflare chose Apache Data Fusion for single-node query processing rather than building an analytical engine from scratch, freeing engineering resources for distributed orchestration challenges.The stateless architecture pattern for global infrastructure: How to design compute nodes that hold zero persistent state by storing all metadata in a distributed catalog (Iceberg), enabling per-query worker provisioning across 300+ geographically dispersed data centers.Why filter pushdown and metadata-driven pruning are non-negotiable optimizations: How to reduce data scanned from object storage before query execution begins by leveraging catalog statistics and range filtering - the foundation of R2 SQL's performance gains.How to solve version compatibility at infrastructure scale: Why backward compatibility matters more than cross-version support when you can't control individual node upgrade timing, and how this constraint drives architectural decisions.The shuffle strategy for point-to-point distributed joins: How to implement in-memory and disk-based shuffles within ephemeral worker clusters using network-addressable worker IDs, allowing stateless workers to forget completely after query completion.Why adaptive query execution is the next frontier for petabyte-scale analytics: How collecting runtime data distribution statistics mid-query execution enables mid-flight plan reconfiguration - a technique worth the overhead investment when queries run for minutes or hours rather than milliseconds.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-reviewAbout the Guest(s)Nikita is a Senior Software Engineer at Cloudflare, specializing in distributed query engines and data platform architecture. With extensive experience in database internals gained through roles at ClickHouse, Yandex, and MongoDB, Nikita has developed deep expertise in query optimization and system design at scale. At Cloudflare, he leads the development of R2 SQL, a distributed analytical query engine built on Apache Data Fusion, serving as a critical component of Cloudflare's data platform. In this episode, Nikita discusses the architecture, design decisions, and technical challenges of building a stateless, distributed SQL engine across Cloudflare's unique 300-location infrastructure, offering valuable insights for engineers working on large-scale data systems. Their work demonstrates how thoughtful architectural choices and infrastructure constraints drive innovation in distributed database systems.Quotes"It was my crash course into OS engineering. We encouraged every possible bug in this project. It was very painful and very hard." - Nikita Lapkov"Collecting a stack trace is very hidden, especially if you're not writing in C or C++. It is actually a very complicated and involved process." - Nikita Lapkov"What excites me is that it has free egress. Usually, you would pay per gigabyte to load your data. You don't have that with R2." - Nikita Lapkov"What we explicitly wanted to avoid when building R2 SQL is building an analytical query engine again. We would much rather use something off the shelf and work on the interesting distributed parts." - Nikita Lapkov"No matter how complex the query is, you can make a case that, with extreme cases, the throughput for a single load operation is relatively constant, no matter how complex the query is." - Nikita Lapkov"We try to be as stateless as possible. All our state lives in the catalog itself, so we only need what's in the catalog and the query that comes from the request." - Nikita Lapkov"The shuffles cannot really be reused unless you do some very fancy heuristics. Once we have picked the workers for a particular query, we can think of them as our little cluster." - Nikita Lapkov"Joins consume your entire roadmap, and this is pretty much what will be happening with us at some point. We need to make sure that distributed joins work really well, no matter what your data distribution is like." - Nikita Lapkov"We have potentially minutes to spare, and optimizing some even subparts of the query is worthy investigation because it could shave hours or something like that." - Nikita Lapkov"Finding the safe points for replanning and doing this distributed coordination while we have 50 different workers working on different parts of the query is definitely the area we want to look at in the coming year." - Nikita LapkovResourcesConnect on LinkedIn:Nikita Lapkov - https://www.linkedin.com/in/nikitalapkovBenjamin Wagner - https://www.linkedin.com/in/wagjaminWebsites:Firebolt – firebolt.ioCloudflare – cloudflare.comApache Arrow DataFusion – datafusion.apache.orgTools & Platforms:R2 SQL – Cloudflare's R2-based query engine for analytical queriesApache Arrow DataFusion – Analytical query engine used for single-node number crunchingArroyo – Rust-based streaming solution built on DataFusionR2 – S3-compatible object storage with free egressApache Iceberg – Catalog system for state managementThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Mar 10, 2026 • 24min

How Zipline AI Turns Weeks of Engineering Into Minutes of SQL Queries ft. Nikhil Simha

In this episode of The Data Engineering Show, host Benjamin sits down with Nikhil Simha, CTO of Zipline AI and co-author of Chronon, to explore how a declarative feature platform solves the speed-vs-scale paradox in modern ML infrastructure, from fraud detection at Airbnb to powering OpenAI's recommendation systems.What You'll Learn:How to eliminate the data scientist-to-ML engineer bottleneck by generating Spark, Flink, and orchestration pipelines automatically from simple SQL queries, enabling data scientists to ship features independently without waiting for engineering resourcesWhy fraud detection demands real-time feature iteration: The adversarial nature of fraud requires companies to build and deploy new detection models in days, not months- a timeline impossible with manual pipeline engineeringThe "precompute everything" optimization principle for serving latency: Chronon minimizes query response time by batching feature computation upstream through stream and batch processing, then delivering pre-aggregated signals to models in millisecondsHow to safely ship feature versions in production using dual-write strategies that keep old and new feature versions running simultaneously, enabling A/B testing and instant rollbacks without service disruptionWhy context engineering, not just RAG, powers modern LLM applications: ML model predictions (fraud risk scores, user signals, embeddings) feed directly into LLM prompts as structured context, improving decision quality for both human and AI agentsThe critical gap in open-source data infrastructure: Modern systems need query engines that scale seamlessly from single-machine to distributed clusters - today's choice between lightweight tools (DuckDB) and heavyweight platforms (Spark) leaves mid-scale and product-embedded analytics underservedIf you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-reviewAbout the Guest(s)Nikhil Simha is the CTO at Zipline AI, bringing extensive experience from leadership roles at Airbnb and Facebook. He is a co-author of Chronon, an open-source feature engineering platform that automates the generation of ML infrastructure from declarative queries. With deep expertise in real-time data systems, fraud detection, and feature engineering at scale, Nikhil has architected solutions powering recommendation systems and risk detection across billions of user interactions. In this episode, he shares insights on building scalable ML infrastructure, integrating LLMs with real-time feature contexts, and the evolving data engineering landscape. His work has directly impacted how organizations from early-stage startups to Fortune 500 companies approach feature engineering and real-time ML serving, making this conversation essential for engineers building production AI systems.Quotes"Fraud is adversarial. Right? Like, someone comes up with a new way to do fraud somewhere around the world, and people at Airbnb need to react to it very quickly." - Nikhil"Chronon, at its core, generates these systems from queries. So users write queries on Chronon, and we generate all of these under the hood." - Nikhil"Chronon allows data scientists to operate independently." - Nikhil"The main problem there was that the traditional model of data scientists writing some logic and ML engineers going and billing system out for that logic, that was too slow for fraud detection." - Nikhil"They have to come up with a new model in a matter of days. They don't have, like, this three to five month period where they can sit and create the new model, build all of these pipelines." - Nikhil"There is a real gap in the industry for an engine that goes all the way from single machine scale to thousands of machine scale seamlessly." - Nikhil"Most people, for ninety-five percent of their queries, don't need Spark in RPA. Right? But there is that 5% usually, like, a lot of ML falls into that." - Nikhil"We are handling query fragments. Right? We take query fragments, generate very specialized logic for that, and run that through Spark's distributed processing topologies." - Nikhil"The new trend in the industry would be, like, towards these engines that can work at any scale and be useful for interactive and large processing workloads." - Nikhil"I think Iceberg is great that way because you're not fragmenting to different proprietary data formats, different proprietary engines." - NikhilResources Connect on LinkedIn:Nikhil Simha - https://www.linkedin.com/in/nikhilsimhaBenjamin Wagner - https://www.linkedin.com/in/wagjaminWebsites:Zipline AI – zipline.aiFirebolt – firebolt.ioTools & Platforms:Chronon – Feature engineering and real-time ML infrastructure platform for generating data pipelines from queriesApache Spark – Distributed data processing engine for batch and large-scale processing workloadsApache Flink – Stream processing engine for real-time data transformationsRedis – In-memory key-value store for feature servingApache Iceberg – Open table format for data lake storageAirflow – Workflow orchestration platform for pipeline schedulingDuckDB – Open-source analytical database for single-machine to moderate-scale processingBigQuery – Google Cloud data warehouseSnowflake – Cloud-based data warehouse platformKubernetes – Container orchestration platformThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Feb 19, 2026 • 16min

The Geo-Data Problem Nobody Talks About And How Voi Solved It ft. Magnus Dahlbäck

In this episode of The Data Engineering Show, host Benjamin sits down with Magnus Dahlbäck, Senior Director of Data and Platform at Voi, to explore how a rapidly scaling European e-scooter company transformed its data infrastructure, adopted a metrics-first approach to analytics, and is now leveraging AI to solve real-time operational challenges across 150 cities and 150,000 vehicles.What You'll Learn:How to escape the "dashboard chaos" trap by adopting a metrics-first architecture with a semantic layer, reducing confusion from hundreds of conflicting dashboards to a single source of truth across the organizationWhy replacing Tableau with Steep (a metrics-centric BI tool) unlocked self-service analytics for non-technical users, empowering teams to answer their own data questions without waiting months for custom dashboard buildsThe real-world cost optimization challenge of managing Snowflake expenses that scale 1:1 with ride volume—and why data leaders must constantly rethink architecture to control FinOps in high-growth environmentsHow to architect for IoT at scale: processing billions of daily events from connected vehicles using micro-batch pipelines (5-minute intervals) while keeping real-time machine learning inference separate through cross-functional product teamsThe decision framework for choosing traditional ML vs. LLMs: use traditional methods for accuracy-critical workloads (supply-demand forecasting for vehicle positioning) and LLMs for pattern discovery where 100% precision isn't required (analyzing rider feedback) How to build proactive customer support powered by data and AI: leverage sensor data and ride telemetry to detect poor user experiences and reach out before customers complain, rather than waiting for refund requestsIf you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review.About the Guest(s)Magnus Dahlbäck is Senior Director of Data and Platform at Voi, a leading European micro-mobility company, where he oversees the data analytics team, platform infrastructure, and AI initiatives. With over four years at Voi, Magnus has scaled the data organization from three people to a comprehensive team of platform engineers, data analysts, and data scientists while architecting a modern data stack centered on metrics-first analytics and semantic layers. In this episode, Magnus shares insights on building scalable data platforms for IoT-heavy, real-world products, including strategies for managing billions of daily events, implementing self-service analytics, and balancing traditional machine learning with large language models. His work at Voi—where the data platform powers both internal analytics and customer-facing product features—demonstrates how thoughtful data architecture drives measurable business impact, making this conversation essential for data leaders navigating AI integration and data democratization.Quotes"There are hundreds of dashboards, and I'm looking for some data, some metrics, and there are 10 dashboards that contain that, and they all show different numbers." - Magnus"Metrics is a very natural way of interacting with data rather than dashboards that are named something randomly." - Magnus"We're basically throwing man hours on slicing and dicing data, trying to find patterns, anomalies that we often miss, right, because it just takes too much time." - Magnus"The way we work with data hasn't really changed that much in the last ten, twenty years to be completely fair, but now we're seeing new technologies, new approaches to it." - Magnus"It comes down to the use case. What's the accuracy we need?" - Magnus"We can see from the sensor data, from the IoT, from other data points during your ride if it was a good or bad experience, so why don't we reach out to you?" - Magnus"Building software around physical objects is really cool when you're a techie guy like me, working at a company where it's a combination of software, B to C, hardware, IoT." - Magnus"The biggest dataset that we process is IoT data—billions of events every day, basically, that we process." - Magnus"We have cross functional teams where all the product teams have everything from back end to front end to data people, designers, and so on." - Magnus"Metrics is kind of the business language that we use—we talk about rides, average ride charge, active vehicles—so metrics is a very natural way of interacting with data." - MagnusResources Connect on LinkedIn:Magnus Dahlbäck - https://www.linkedin.com/in/magnusdahlback/Benjamin Wagner - https://www.linkedin.com/in/wagjamin/Websites:Guest's Company: Voi Technologies Website (voi.com)Host's Company: Firebolt Website (firebolt.io)Tools & Platforms:Snowflake – Data warehouse for analytics and machine learning workloadsDBT (Data Build Tool) – Data transformation and modelingApache Airflow – Workflow orchestrationSteep – Metrics-first BI tool with semantic layer (Swedish startup)GCP Vertex AI – Machine learning platform for model training and deploymentThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Feb 3, 2026 • 29min

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

In this episode of The Data Engineering Show, Benjamin sits down with Artie CTO and co-founder Robin Tang, to explore the complexities of high-performance data movement. Robin shares his journey from building Maxwell at Zendesk to scaling data systems at Open Door, highlighting the gap between business-oriented SaaS connectors and the rigorous demands of production database replication.Robin dives deep into Artie’s architecture, explaining how they leverage a split-plane model (Control Plane and Data Plane) to provide a "Bring Your Own Cloud" (BYOC) experience that engineering teams actually trust. You’ll hear about the technical nuances of CDC, from handling Postgres TOAST columns to the "economy of scale" challenges of processing billions of rows for Substack, Artie’s first customer. Whether you're struggling with real-time ingestion costs or curious about the future of platform-agnostic partitioning, this conversation provides a masterclass in modern data movement.What You'll Learn:Why the data movement market is bifurcating: Managed vendors like Fivetran excel at SaaS integrations (hundreds of connectors), while specialized vendors like Artie focus on production databases at high volume - a fundamentally different job to be done requiring expertise in failure recovery, observability, and advanced use cases.How to design CDC architecture that doesn't break production databases: Use online backfill strategies (DB log framework) instead of long-running transactions that hold write locks; implement table-level parallelism so a single table error doesn't halt the entire pipeline.The split-plane architecture pattern for flexible deployment models: Build control plane and data plane separation from day one, allowing customers to choose between fully managed cloud deployments or bring-your-own-cloud (BYOC) without compromising UX or architecture.Why database-specific expertise matters more than breadth: SQL Server CDC requires reverse engineering undocumented code; Postgres has TOAST columns; MongoDB allows invalid timestamp values - each data source has hidden complexity that justifies deep specialization over connector sprawl.How to build trust with early-stage customers on mission-critical workloads: Walk prospects through architecture and failure modes before implementation; encourage them to stress-test with real data volumes; establish deep engineering partnerships where both teams debug problems together (not sales-driven relationships).The platform-specific optimization trap and how to solve it: Instead of requiring customers to understand nuances of BigQuery time partitioning vs. Snowflake's lack thereof, build platform-agnostic features (like soft partitioning) that work consistently across destinations while handling platform-specific optimizations under the hood.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.About the Guest(s)Robin is the CTO and cofounder of Artie, a data movement platform built for high-volume, low-latency production database replication. With over a decade of experience building large-scale data systems, including early work on Maxwell (an open-source CDC framework at Zendesk) and database architecture at venture-backed startups, Robin identified a critical gap: existing tools optimize for SaaS integrations, not production databases at scale. In this episode, Robin shares hard-won lessons from building mission-critical infrastructure, including architectural innovations that prevent data loss and failure modes that only surface under real-world production load. His work at Artie has powered reliable data replication for companies like Substack, making this conversation essential for engineering teams building or evaluating real-time data movement solutions.Quotes“Artie helps companies make data streaming accessible." - Robin"I didn't want to make any sort of compromises and it just turned out to be a really hard problem, so then we started a company around this." - Robin"The complexity is not just at the destination level, the complexity is also at the source level." - Robin"Every pipeline that we touch is mission critical for customers, or else they would just use either their existing pipeline or a managed vendor that's out there." - Robin"We handle the whole thing, whereas other vendors more or less provide a component and expect engineers to either build or attach additional pieces." - Robin"I think the biggest bottleneck for real time right now is accessibility. When people think about real time, they immediately think it's not worth it because they implicitly have a cost associated with it." - Robin"We use Kafka transactions, so we do not commit offsets until the destination tells us the data has actually been flushed." - Robin"There's so much nuance with every single data source that it becomes a whack-a-mole problem." - Robin"When there's sufficient pain on the other side and they buy into your vision, it's easier to overcome obstacles during technical implementation." - Robin"We're spending more time developing platform-agnostic solutions so customers don't have to understand platform nuances." - RobinResources Connect on LinkedIn:Robin Tang - https://www.linkedin.com/in/tang8330/Benjamin Wagner - https://www.linkedin.com/in/wagjamin/ Websites:Artie: https://www.artie.com/Fivetran: https://www.fivetran.comEstuary: https://www.estuary.devAirbyte: https://airbyte.comDebezium: https://debezium.ioTools & Platforms:Maxwell – Open source CDC framework for MySQL to read binlog into KafkaKafka – Distributed event streaming platform for data movementWarpStream – Cost-optimized Kafka alternative using object storageStreamsy – Kubernetes-native Kafka deployment toolApache Iceberg – Open table format for data lakehouse architectureDelta Live Tables – Databricks' data movement and transformation toolClickPipes – ClickHouse's native data ingestion platformSnowpipe Streaming – Snowflake's real-time data ingestion serviceGoogle Datastream – Google Cloud's CDC and data movement serviceAWS MSK Tiered Storage – Amazon managed Kafka with tiered storage capabilitiesThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Dec 16, 2025 • 26min

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

In this episode of the Data Engineering Show, host Benjamin Wagner sits down with Ritesh Varyani, Staff Software Engineer at Lyft, to explore how the company manages a sophisticated multi-engine data stack serving thousands of engineers, while simultaneously integrating AI across infrastructure and user-facing analytics.What You'll Learn:How to architect a polyglot data platform that serves fundamentally different workloads, Spark for ML training and massive parallel processing, Trino for dashboarding and medium-scale ETL, and ClickHouse for sub-second OLAP queries without creating operational chaosWhy unification matters more than expansion: Lyft's 2026 strategy prioritizes consolidating and simplifying the data stack rather than adding new tools, reducing maintenance burden and improving reliability for end usersThe dual-layer AI strategy that simultaneously enhances user analytics (semantic layer v2 with AI-native support) while automating platform operations (intelligent job failure diagnosis, adaptive resource allocation, and agentic workflow optimization)How to fund innovation from the bottom-up: Lyft's model encourages individual engineers to experiment with AI on their own time, prove business value through POCs, and secure leadership buy-in through demonstrated alignment with company strategyWhy vendor selection now includes AI explainability and debuggability as standard RFP requirements, even when AI isn't the primary driver of a purchasing decisionThe framework for deciding open-source investment vs. managed services: Prioritize business-critical goals first, then determine whether in-house ownership or vendor solutions accelerate that mission, AI becomes the accelerant, not the decision driverIf you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.About the Guest(s)Ritesh is a Staff Software Engineer at Lyft, bringing six years of experience architecting and scaling the company's data platform. With a background spanning Microsoft's data and cloud infrastructure, including work on Hadoop, Azure, and SaaS products. Ritesh leads Lyft's critical data systems including Trino, Spark, and ClickHouse. In this episode, Ritesh shares insights on building scalable, AI-native data platforms that serve diverse organizational needs, from batch processing and analytics to real-time marketplace operations. His strategic approach to unifying complex data stacks while integrating AI-driven reliability and user experience improvements provides actionable guidance for data engineers and platform leaders navigating infrastructure modernization at scale.Quotes"The goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are getting and take better data driven decisions." - Ritesh"We are a Hive format shop. We are going to be moving to other open table formats in the future, but at this point, we are a hive table format." - Ritesh"Our main goal at this point is primarily understanding how we see the data platform running five years from now, three years from now, and how we are able to future proof it." - Ritesh"In this world of AI, we should not be falling behind in any way, and bringing AI in the right places within our platform." - Ritesh"We want to make our semantic layer ready for the AI native side of things so that our teams are able to drive the best meaning possible from the data that they see." - Ritesh"Big data systems are distributed systems by nature, and where AI can help you is very clearly understand how the patterns are changing and what is a good action to take." - Ritesh"Rather than thinking of this as an AI versus an open source thing, it's about a question of what work is the most business critical and how do you go 100% behind it." - Ritesh"Not everybody is working on AI initiatives at this point, but where it makes sense according to our business strategy, if it aligns with it, then obviously we go and invest." - Ritesh"If you are the one who's going to take on the initiative, probably spend a few hours outside of what you're already working on, and that is how you will discover AI and the tooling for it." - Ritesh"We are trying to consolidate into a single direction of providing different kinds of models so that you are easily able to integrate and focus on the value you want to provide to your customers." - RiteshResources Connect on LinkedIn:Ritesh Varyani - https://www.linkedin.com/in/riteshvaryani/Benjamin Wagner - https://www.linkedin.com/in/wagjamin/Eldad Farkash - https://www.linkedin.com/in/eldadfarkash/Websites:Lyft - https://www.lyft.comTools & Platforms:Apache Spark – Batch processing engine for ML training jobs, large-scale data processing, and GDPR operationsTrino – Query engine for BI dashboarding, ETL workflows, and SQL-based data accessClickHouse – Columnar database for sub-second query latency and real-time analyticsAmazon S3 – Data lake storage for parquet tables and offline data processingAWS EKS (Elastic Kubernetes Service) – Kubernetes infrastructure for hosting Spark and TrinoClickHouse Cloud – Managed ClickHouse offering used by LyftHive Table Format – Current table format for organizing parquet files in S3Kubernetes Operators – Infrastructure for managing ClickHouse deploymentsThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

8 snips

Nov 19, 2025 • 20min

60 Billion Predictions Daily: Inside Credit Karma’s Agentic Data Layer with Maddie Daianu

Maddie Daianu, Head of Data and AI at Intuit Credit Karma, brings a wealth of experience from academia to the finance tech forefront. She dives into the monumental task of managing 80 billion daily predictions and the strategic shift to an 'Agentic Data Layer' for proactive financial management. Maddie shares insights on utilizing Google Cloud for real-time processing, the importance of the Unified Consumer Profile for personalized experiences, and how her team deploys 22,000 models each month, revolutionizing user interaction in finance.

Oct 7, 2025 • 20min

Block Bad Data Before the Write with Nike’s Ashok Singamaneni

Ashok Singamaneni, Principal Data Engineer at Nike and creator of Spark Expectations and BrickFlow, discusses preventing bad data writes and improving pipeline reliability. He explains treating ingestion and transformation like a software product. Topics include rule types for checks, running validations before final writes, decorator-based integration to avoid double scans, performance trade-offs, and cautious use of generative AI tools.

Sep 17, 2025 • 22min

Postgres vs. Elasticsearch: The Unexpected Winner in High-Stakes Search for Instacart with Ankit Mittal

Ankit Mittal, former Senior Engineer at Instacart and now at ParadeDB, shares his journey of enhancing search infrastructure by transitioning from Elasticsearch to PostgreSQL. He discusses the challenges of managing fast-moving grocery inventory and how consolidating search functions into one PostgreSQL cluster optimized performance. Ankit highlights the benefits of using PostgreSQL extensions for complex queries and the trade-offs between search systems, emphasizing improved efficiency and reduced latency in data retrieval.

Aug 28, 2025 • 21min

Is Self-Service BI a False Promise? Lei Tang of Fabi.ai Thinks So

Explore the future of AI-powered business intelligence with Lei Tang, CTO and Co-founder of Fabi.ai, as he discusses the evolution from traditional self-service BI to "Vibe-analytics." Learn how AI is transforming data accessibility, enabling anyone to perform sophisticated analytics without deep technical expertise. From building trust in AI-generated insights to creating intelligent semantic layers, discover how modern BI platforms are bridging the gap between data teams and business stakeholders. Tune in to understand why static dashboards are becoming obsolete and how AI agents will soon proactively surface business opportunities and insights.Key points:The limitations of traditional self-service BI and how AI is addressing themBuilding secure, context-aware AI systems for data analysisThe future of human-AI interaction in business intelligenceTechnical insights into modern BI platform architectureVision for proactive, AI-driven business insightsWhat You'll Learn:Why traditional self-service BI has failed to deliver on its promises and how AI can bridge the gapHow to build an AI-native BI platform that combines SQL, Python, and natural language processingThe framework for implementing "Vibe-analytics" - a new paradigm of AI-powered visual analyticsWhy context engineering and semantic understanding are crucial for accurate AI-driven analysisHow to balance security and accessibility when deploying AI-powered analytics toolsThe future of BI platforms as proactive insight generators rather than passive dashboardsWhy caching and stateful environments are essential for responsive AI-powered analyticsHow to leverage AI to translate business questions into accurate technical queries while maintaining data integrityAbout the Guest(s)Lei is the Co-founder and CTO of Fabi.ai, where he leads the development of AI-native business intelligence solutions. With a PhD in machine learning and over a decade of experience in the data domain, Lei has held significant roles, including positions at Yahoo, Walmart, Lyft (as Director of Data Science), and Clari (as Chief Data Scientist). His expertise spans machine learning, data engineering, and business analytics, with a particular focus on making data analysis more accessible and efficient. In this episode, Lei shares insights on the evolution of self-service BI and how AI is transforming business intelligence, drawing from his experience building Fabi.ai, a platform that combines SQL, Python, and AI to democratize data analysis. His work in developing "Vibe AI" (AI-powered BI) represents a significant advancement in making complex data analysis accessible to non-technical users while maintaining data accuracy and trust.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Quotes"For the past decade, it's really difficult to make sure the self-service BI can work. And then now with AI, the worst part is that it can run properly, but the numbers are wrong." - Lei"If you talk to anybody working in the BI space, like self-service BI, that has been termed for maybe for the past decade. But I have to say that is a false promise." - Lei"We're saying that we really want those data team to be able to, like, say, what type of data is exposed to, like, say, less technical folks." - Lei"In order to build AI native BI, I would say the focus should be how human interact with AI." - Lei"We believe that, essentially, this BI system or, like, AI BI system would be more like a agent, and then it'll actually looking for, like, business opportunities and insight and surface to you." - Lei"The one common theme I have been experiencing is that normally would work with other business stakeholders, could be marketing, could be operations, could be sales." - Lei"We strongly believe that BI should be stored as code." - Lei"Enterprise data tends to be very noisy, very complex." - Lei"The semantics of itself becomes part of the context for the AI engine." - Lei"Most organizations, the data, like the schema, the kind of business, like metrics and logic, has been constantly evolving." - LeiResourcesFabi.ai - AI-native BI platformFirebolt (firebolt.io) - Cloud data warehouse platformTools & Technologies:Firebolt Core - Free self-hosted query engineLooker - BI PlatformTableau - BI PlatformSisense - BI PlatformSnowflake - Data WarehouseBigQuery - Data WarehousePostgreSQL - DatabaseSQL Alchemy - Database toolkitPandas - Data analysis libraryFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.io Primary Speakers:Lei Tang Benjamin Wagner The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Jul 22, 2025 • 26min

Building Uber's AI Assistant: How Genie Revolutionizes On-Call Support with Paarth Chothani from Uber

Journey inside Uber's innovative AI assistant "Genie" with Paarth Chotani, Staff Engineer at Uber, as he shares how they're revolutionizing on-call support using LLMs and vector search. From processing massive amounts of internal documentation to building scalable RAG pipelines, discover how Uber tackles the challenges of implementing AI assistants at scale. Get insights into the evolution from traditional chatbots to agent-based solutions, and learn practical lessons about staying current in the rapidly evolving AI landscape. Whether you're building AI-powered tools or scaling data infrastructure, this episode offers valuable perspectives on balancing innovation with real-world implementation.• Building and scaling RAG pipelines at enterprise scale• Evolution from traditional chatbots to AI agents• Practical insights on data processing and vector search implementation• Leveraging open-source technologies in production environments• Navigating rapid technological changes in AI developmentWhat You'll Learn:How Uber transformed its on-call support system by building an AI assistant that searches across internal documentation, wikis, and codeWhy combining multiple data sources with vector databases creates more accurate and contextual responses for enterprise supportThe evolution from basic RAG implementation to agent-based architecture for handling complex support scenariosHow to scale AI processing pipelines using Apache Spark for large-scale data chunking and embedding generationWhy customization and internal data sources are crucial for enterprise AI assistant effectivenessThe future of AI assistants: moving from documentation lookup to automated problem resolution through multi-agent systemsHow to balance rapid AI innovation with setting realistic customer expectations in fast-moving tech environmentsPaarth is a Staff Engineer at Uber, where he works on Michelangelo, Uber's machine learning platform. With over four years at Uber, he specializes in feature store development, online serving at scale, and GenAI implementations. He has been instrumental in developing Genie, an AI-powered on-call assistant that revolutionizes how Uber's engineering teams handle support requests and documentation access. In this episode, Paarth shares valuable insights on building and scaling RAG-based systems, vector search implementations, and the evolution of AI assistants from traditional chatbots to sophisticated agent-based solutions. His experience spanning both AWS chatbot development and current GenAI innovations at Uber offers listeners a unique perspective on the rapid advancement of AI-powered enterprise solutions.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Quotes"Think of Genie as your on-call assistant. Different infra teams have their Slack channels, and because these technologies are widely used, you have to wait a lot." - Paarth"What we realized is for our engineers to really get help, data sources really should be internal only because we customize lot of these open source engines for making it work at Uber scale." - Paarth"Instead of building a mega scale pipeline that just ingest all data sources and then keeps a central data source solution, we instead are giving users the flexibility to ingest what data sources they want." - Paarth"We had to scale our you can say the whole infrared layer to chunk data faster to be able to create embedding set scale." - Paarth"It almost felt like they're doing what EMR was doing. You have your Hadoop and big data technology, and we needed these pipelines to basically process all this data quickly." - Paarth"We've even evolved from just giving you the right documentation to starting to evolve into a situation where we'll also start taking actions on your behalf." - Paarth"That intuition that comes from building this kind of bot, I feel like that intuition came again as we were starting to see this technology come, and we're like, hey, this looks like where you can pretty much fit all these pieces together." - Paarth"What we have seen with several use cases is agentic genie works well when designed well, when you've analyzed the problem of which type of subproblems the bot should resolve per channel, per use case." - Paarth"I think having a problem in mind always helps that way, the energy is little bit focused and directed." - Paarth"Whatever you're building is not enough because the expectation has already gone to the next level, so the pace is too fast right now." - PaarthResourcesCompanies & Platforms:Uber - ML Platform & EngineeringFirebolt - Cloud Data Warehouse (firebolt.io)Tools & Technologies:Michelangelo - Uber's ML Platform Genie - Uber's On-Call Assistant BotCursor - Developer IDEOpenSearch - Vector DatabaseLangGraph - Agent FrameworkNotable Projects Mentioned:MetaMate (Meta)Query Copilot (Uber)Scale at AI (Meta Meetup)Company Blogs:Uber Engineering Blog - Genie and Query Optimization articles Primary Speakers:Paarth Chotani - Staff Engineer, UberBenjamin - FireboltEldad - FireboltFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app