introduction to common crawl
DESCRIPTION
March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wodTRANSCRIPT
Introduction to Common Crawl
Dave LesterMarch 21, 2013
Monday, April 1, 13
video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4
Monday, April 1, 13
What is Common Crawl?
• non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone
• data is currently shared as a public dataset on Amazon S3
Monday, April 1, 13
Why Open Data?
• It’s difficult to crawl the web at scale
• Provides a shared resource for researchers to compare results and recreate experiments
Monday, April 1, 13
2012 Corpus Stats
• Total # of Web Documents: 3.8 billion
• Total Uncompressed Content Size: 100 TB+
• # of Domains: 61 million
• # of PDFs: 92.2 million
• # of Word Docs: 6.6 million
• # of Excel Docs: 1.3 million
Monday, April 1, 13
Other Data Sources
• Blekko - “spam-free search engine”
• their metadata includes:
• rank on a linear scale, and 0-10 web rank
• true/false for Blekko’s webspam algorithm thinking this domain or page is spam
• true/false for Blekko’s pr0n detection algorithm
Monday, April 1, 13
What is Crawled?
• Check out the new URL search tool: http://commoncrawl.org/url-search-tool/
• (try entering ischool.berkeley.edu)
• First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!
Monday, April 1, 13
How is Data Crawled?
• Customized crawler (it’s open source!)
• Some basic page rank included. Lots of time spent optimizing this and filtering spam
• See Apache Nutch as alternative web-scale crawler
• Future datasets may incl other crawl sources
Monday, April 1, 13
Common Crawl Uses
Monday, April 1, 13
Analyze References to Facebook
• of ~1.3 Billion URLs:
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
• Among ~500m hardcoded links to Facebook, only 3.5 million are unique
• These are primarily for simple social integrations
Monday, April 1, 13
References to FB Pages
• /merriamwebster 676071 (0.14%)
• /kevjumba 651389 (0.14%)
• /placeformusic 618963 (0.13%)
• /lyricskeeper 517999 (0.11%)
• /kayak 465179 (0.10%)
• /twitter 281882 (0.06%)
Monday, April 1, 13
Analyze JavaScript Libraries on the Web
1. jQuery (82.64%)
2. Prototype(6.06%)
3. Mootools (4.83%)
4. Ext (3.47%)
5. YUI (1.78%)
6. Modernizr (0.59%)
7. Dojo(0.21%)
8. Ember (0.14%)
9. Underscore (0.11%)
10. Backbone (0.09%)
Monday, April 1, 13
Library Co-occurence
Monday, April 1, 13
Web Data Commons
• sub-corpus of Common Crawl data
• includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN
• built using 2009/2010 corpus
Monday, April 1, 13
Monday, April 1, 13
Traitor: Associating Concepts
http://www.youtube.com/watch?v=c7Y149RnQjw
Monday, April 1, 13
Associated Costs?
• Complete data set, ~$1300.00
• Facebook Link Analysis, $434.61
• Searchable Index of Data Set, $100
• “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate”
Monday, April 1, 13
Give it a Try
Monday, April 1, 13
ARC Files
• Files contain the full HTTP response and payload for all pages crawled.
• Format designed by the Internet Archive
• ARC files are a series of concatenated GZIP documents
Monday, April 1, 13
Text-Only Files• Saved as sequence files -- consisting of binary
key/value pairs. (Used extensively in MapReduce as input/output formats)
• On average 20% the size of raw content
• located in the segment directories, with a file name of "textData-nnnnn". For example:
• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Monday, April 1, 13
Metadata Files
• For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.
• Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).
• Records in the Metadata files are in the same order and have the same file numbers as the Text Only content
• Saved as sequence files
Monday, April 1, 13
Browsing Data
• You can use s3cmd on your local machine
• Install using pip, ‘pip install s3cmd’
• Configure, ‘s3cmd --configure’
• Requires AWS keys
• Demo: s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/
Monday, April 1, 13
Common Crawl AMI
• Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce
• Amazon AMI ID: "ami-07339a6e"
Monday, April 1, 13
Running Example MR Jobs Using the AMI
• ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket )
• bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/
• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java
Monday, April 1, 13
Code Samples to Try
• http://github.com/commoncrawl/
• Pete Warden’s Ruby example• http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-
ruby-code-across-five-billion-web-pages.html
Monday, April 1, 13
Helpful Resources
• Developer Documentation:
• https://commoncrawl.atlassian.net/
• Developer Discussion List:
• https://groups.google.com/group/common-crawl
Monday, April 1, 13
Questions?
• @davelester
• www.davelester.org
Monday, April 1, 13