Super Data Science: ML & AI Podcast with Jon Krohn

580: Collecting Valuable Data

Jun 3, 2022
Dive into the world of machine learning as strategies for effective data collection take center stage. Discover how to identify commercial problems before starting your data journey. Learn the significance of labeled data and the various methods to obtain these crucial labels. The discussion also highlights challenges that may arise during the data collection process, setting the stage for future insights.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Labeled Data Value

  • Prioritize collecting labeled data over unlabeled data for commercial machine learning models.
  • Labeled data, like images tagged as "dog" or "cat", enable training for specific, valuable predictions.
ADVICE

Automated Labeling

  • Automate data labeling whenever possible, for example, by scraping existing labeled datasets.
  • Inferring labels from star ratings on IMDB reviews demonstrates such automation.
ADVICE

Manual and Outsourced Labeling

  • If automated labeling isn't feasible, consider manual labeling or outsourcing to labeling services.
  • While more time-consuming and costly, this creates unique, proprietary datasets.
Get the Snipd Podcast app to discover more snips from this episode
Get the app