AI Inside

AI Positive

15 snips
Jan 24, 2024
Discussion on AI copyright issues with Common Crawl Foundation's Rich Skrenta. Debate on restricting AI access to data for training. Mention of recent court ruling related to web scraping. Exciting device announcement from CES - Rabbit R1 with Perplexity AI integration. Study on actual risk of AI automating jobs away in the near future.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Common Crawl Is The Web's Public Time Capsule

  • Common Crawl is a 17-year archive of the public web, now widely used as primary training data for LLMs.
  • Its dataset spans ~250B pages and ~10PB, making it a key internet time capsule and research resource.
INSIGHT

Fair Use Is Central To Training LLMs

  • Publishers are demanding payment and control for AI training and outputs, challenging existing notions of fair use.
  • Jeff Jarvis argues training models is transformative and cutting fair use would undermine copyright's purpose.
ADVICE

Crawl Politely And Respect Site Controls

  • Common Crawl respects robots.txt, avoids cookies, logins, and paywalls when crawling the public web.
  • Rely on published, open pages only and sample politely with per-site budgets to build archival indexes.
Get the Snipd Podcast app to discover more snips from this episode
Get the app