AI Positive

15 snips

Jan 24, 2024

Discussion on AI copyright issues with Common Crawl Foundation's Rich Skrenta. Debate on restricting AI access to data for training. Mention of recent court ruling related to web scraping. Exciting device announcement from CES - Rabbit R1 with Perplexity AI integration. Study on actual risk of AI automating jobs away in the near future.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Common Crawl Is The Web's Public Time Capsule

Common Crawl is a 17-year archive of the public web, now widely used as primary training data for LLMs.
Its dataset spans ~250B pages and ~10PB, making it a key internet time capsule and research resource.

INSIGHT

Fair Use Is Central To Training LLMs

Publishers are demanding payment and control for AI training and outputs, challenging existing notions of fair use.
Jeff Jarvis argues training models is transformative and cutting fair use would undermine copyright's purpose.

ADVICE

Crawl Politely And Respect Site Controls

Common Crawl respects robots.txt, avoids cookies, logins, and paywalls when crawling the public web.
Rely on published, open pages only and sample politely with per-site budgets to build archival indexes.

Get the Snipd Podcast app to discover more snips from this episode