Block Bad Data Before the Write with Nike’s Ashok Singamaneni

Oct 7, 2025

Ashok Singamaneni, Principal Data Engineer at Nike and creator of Spark Expectations and BrickFlow, discusses preventing bad data writes and improving pipeline reliability. He explains treating ingestion and transformation like a software product. Topics include rule types for checks, running validations before final writes, decorator-based integration to avoid double scans, performance trade-offs, and cautious use of generative AI tools.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Treat Bronze/Silver Like A Software Product

Treat ingestion and transformation layers as a software product with tests and checks before final writes.
Ashok built Spark Expectations after seeing Databricks DLT to run DQ checks pre-write in Spark so bad data never lands in final tables.

ADVICE

Run Full Prewrite Checks With Defined Failure Actions

Run row-level, aggregation-level, and query-level DQ checks on the full dataframe before final write and choose actions: ignore, drop, or fail.
Use error tables for dropped records and fail jobs for mission-critical failures to avoid costly recomputes.

ADVICE

Hook Checks Into Jobs With A Decorator

Integrate Spark Expectations using a Python decorator so checks run when your function returns a dataframe, avoiding separate scans.
Place checks at the final write step to limit overhead and optimize aggregation queries carefully.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode of The Data Engineering Show, Benjamin and Eldad are joined by Ashok Singamaneni, a Principal Data Engineer at Nike. Ashok dives deep into his work on the open-source projects BrickFlow and Spark Expectations. He shares his journey from mechanical engineering to data engineering and the lessons learned over a decade of tackling production data quality issues that lead to costly recomputes.

Ashok explains the philosophy behind Spark Expectations: treating the ingestion and transformation layers of a data pipeline (Bronze/Silver) as a software product rather than just a data engineering product. This means implementing rigorous checks like data quality, unit testing, and integration testing before the data is written to the final layer. He details the implementation using a Python decorator pattern within Spark jobs, allowing engineers to define rules that check for everything from basic column validation to complex referential integrity and aggregation consistency. The discussion also covers the trade-offs of using generative AI tools like Cursor for data engineering and the growing industry trend of prioritizing upfront data quality due to the rise of AI-powered analytics and direct leadership access to data.

What You'll Learn:

Why the ingestion and transformation layers (Bronze/Silver) of a data pipeline should be treated as a software product with rigorous testing.
How Spark Expectations moves data quality checks to before data is written to the final tables to prevent mission-critical failures and recomputes.
The three types of checks in Spark Expectations: row-level, aggregation-level, and query DQ (for referential integrity).
How the tool handles failures with options to ignore, drop the record, or fail the entire job.
Why big data quality is becoming a prime focus across the industry due to AI integrations and direct executive-level access to data.
Ashok’s lessons on using Generative AI tools (like Cursor/Cloud Code) in data engineering projects and the necessity of restrictive permissions.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ashok Singamaneni is a Principal Data Engineer at Nike, with over twelve years of experience in the data space across the banking, healthcare, and retail domains. He is the creator of the popular open-source frameworks Spark Expectations and BrickFlow, which focus on improving data quality and pipeline reliability. Ashok advocates for treating data ingestion and transformation as a software product, ensuring checks and balances are in place early in the pipeline. He holds a background in mechanical engineering.

Quotes

"DLT expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables." - Ashok

"I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product." - Ashok

"If it's mission critical, then you fail the job, not process the data, and don't put that data into the final table so that you don't need to recompute that again." - Ashok

"As the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong... it takes time for you to debug and see, like, lot of human effort also involved." - Ashok

"Data observability and quality is becoming prime because of AI integrations that are happening." - Ashok

"Ultimately, at the end of the day, you are responsible when you're checking in the code. It's not Claude or Karsar that will be blamed if something goes wrong." - Ashok

"The leadership is directly looking at the data and if there is something wrong in the data, then there can be some serious repercussions happening on the business decisions." - Ashok

"Rather than having bad data in the tables and then recomputing or reclarifying things, let's not put that data first in the first place." - Ashok

"You can drop the record and put that in an error table and give that alert to the engineering team that there is some error in the error table you can look at." - Ashok

"The road eq checks that happens are very fast. It should happen as a pretty standard checks that happens on the scale." - Ashok

Resources

Projects:

Spark Expectations - Data quality framework
BrickFlow - Open source project for data pipelines

Tools & Technologies:

Apache Spark
Databricks DLT (Delta Live Tables)
Great Expectations - Post-processing data quality tool
Cursor / Cloud Code - Generative AI coding tools
SQLMesh

For Feedback & Discussions on Firebolt Core:

Primary Speakers:

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes: