
The Joe Reis Show AI Agents Can't Fix Data - Josh Wills on Where AI Breaks in Data Engineering
18 snips
May 7, 2026 Josh Wills, a 25-year data engineering veteran now at Datology AI who helped build tools like dbt and DuckDB. He talks about why AI agents misdiagnose messy pipelines, the rise of petabyte-scale multimodal datasets, fragile $200K vibe-coded pipelines with no training data, the enduring role for classical ML, and why managing unreliable agents is now part of the job.
AI Snips
Chapters
Books
Transcript
Episode notes
Match Engine To Workload Not Hype
- Use Ray for map‑only, CPU/GPU orchestration and keep Spark for heavy shuffles; consider separating Spark's shuffle service into a standalone component.
- Josh recommends Ray for non‑shuffle workloads and highlights Celeborn as a shuffle service project.
Build Domain Models When You Own Unique Data
- If your company owns unique domain data, build a private model; it'll often outperform third‑party models and cost less long term.
- Custom models cost ~$20M once versus paying large providers' token bills indefinitely.
Automate The ML Research Loop
- Automate boring ML research loops with agents to free engineers for higher‑value design work.
- Datology deploys agents to tune hyperparameters and run experiments, turning repetitive ML tasks into engineering problems.




