Common Crawl: Archiving the Web for AI and Research

Description

Common Crawl is a non-profit organisation established in 2007 with the aim of providing an openly accessible archive of the World Wide Web. This massive collection of crawled web data began in 2008 and has grown substantially, becoming a crucial resource for researchers and developers, particularly in the field of artificial intelligence. Milestones include Amazon Web Services hosting the archive from 2012, the adoption of the Nutch crawler in 2013, and the pivotal use of its data to train influential large language models like GPT-3 starting around 2020. The organisation continues to collect billions of web pages, offering raw HTML, metadata, and extracted text in formats like WARC, WAT, and WET, thereby facilitating diverse analyses and the training of sophisticated AI systems.

Listen

Description

Want to check another podcast?