
AI Inside AI Positive
15 snips
Jan 24, 2024 Discussion on AI copyright issues with Common Crawl Foundation's Rich Skrenta. Debate on restricting AI access to data for training. Mention of recent court ruling related to web scraping. Exciting device announcement from CES - Rabbit R1 with Perplexity AI integration. Study on actual risk of AI automating jobs away in the near future.
AI Snips
Chapters
Transcript
Episode notes
Common Crawl Is The Web's Public Time Capsule
- Common Crawl is a 17-year archive of the public web, now widely used as primary training data for LLMs.
- Its dataset spans ~250B pages and ~10PB, making it a key internet time capsule and research resource.
Fair Use Is Central To Training LLMs
- Publishers are demanding payment and control for AI training and outputs, challenging existing notions of fair use.
- Jeff Jarvis argues training models is transformative and cutting fair use would undermine copyright's purpose.
Crawl Politely And Respect Site Controls
- Common Crawl respects robots.txt, avoids cookies, logins, and paywalls when crawling the public web.
- Rely on published, open pages only and sample politely with per-site budgets to build archival indexes.
