AI Agents Can't Fix Data - Josh Wills on Where AI Breaks in Data Engineering

18 snips

May 7, 2026

Josh Wills, a 25-year data engineering veteran now at Datology AI who helped build tools like dbt and DuckDB. He talks about why AI agents misdiagnose messy pipelines, the rise of petabyte-scale multimodal datasets, fragile $200K vibe-coded pipelines with no training data, the enduring role for classical ML, and why managing unreliable agents is now part of the job.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

ADVICE

Match Engine To Workload Not Hype

Use Ray for map‑only, CPU/GPU orchestration and keep Spark for heavy shuffles; consider separating Spark's shuffle service into a standalone component.
Josh recommends Ray for non‑shuffle workloads and highlights Celeborn as a shuffle service project.

ADVICE

Build Domain Models When You Own Unique Data

If your company owns unique domain data, build a private model; it'll often outperform third‑party models and cost less long term.
Custom models cost ~$20M once versus paying large providers' token bills indefinitely.

ADVICE

Automate The ML Research Loop

Automate boring ML research loops with agents to free engineers for higher‑value design work.
Datology deploys agents to tune hyperparameters and run experiments, turning repetitive ML tasks into engineering problems.

Get the Snipd Podcast app to discover more snips from this episode

Get the app