The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Scaling TensorFlow at LinkedIn with Jonathan Hung - #314

Nov 4, 2019
Jonathan Hung, a Sr. Software Engineer at LinkedIn, shares insights on scaling TensorFlow within their infrastructure. He discusses leveraging existing Hadoop clusters for deep learning, introducing TonY, a framework that runs TensorFlow jobs natively on Hadoop. The conversation delves into the challenges of resource management and fault tolerance, particularly in GPU allocation. Hung also highlights LinkedIn's transition to Kubernetes to enhance machine learning workloads and improve the experience for engineers navigating complex AI systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Challenges with TensorFlow on Spark

  • TensorFlow on Spark lacked fault tolerance and GPU support.
  • Spark's resource profiles didn't align with TensorFlow's varying job types.
INSIGHT

Motivation for Tony

  • LinkedIn chose to build Tony due to their existing Hadoop clusters and expertise.
  • Their Hadoop ecosystem is mature, with thousands of nodes and petabytes of compute.
ANECDOTE

Early Deep Learning Infrastructure Challenges

  • Early deep learning at LinkedIn was a "Wild West" with unmanaged hardware.
  • Engineers faced race conditions when competing for available machines and GPUs.
Get the Snipd Podcast app to discover more snips from this episode
Get the app