introduction to common crawl

Introduction to Common Crawl

Dave LesterMarch 21, 2013

Monday, April 1, 13

video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4

Monday, April 1, 13

https://www.youtube.com/watch?v=ozX4GvUWDm4

https://www.youtube.com/watch?v=ozX4GvUWDm4

What is Common Crawl?

• non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone

• data is currently shared as a public dataset on Amazon S3

Monday, April 1, 13

Why Open Data?

• It’s difficult to crawl the web at scale

• Provides a shared resource for researchers to compare results and recreate experiments

Monday, April 1, 13

2012 Corpus Stats

• Total # of Web Documents: 3.8 billion

• Total Uncompressed Content Size: 100 TB+

• # of Domains: 61 million

• # of PDFs: 92.2 million

• # of Word Docs: 6.6 million

• # of Excel Docs: 1.3 million

Monday, April 1, 13

Other Data Sources

• Blekko - “spam-free search engine”

• their metadata includes:

• rank on a linear scale, and 0-10 web rank

• true/false for Blekko’s webspam algorithm thinking this domain or page is spam

• true/false for Blekko’s pr0n detection algorithm

Monday, April 1, 13

What is Crawled?

• Check out the new URL search tool: http://commoncrawl.org/url-search-tool/

• (try entering ischool.berkeley.edu)

• First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!

Monday, April 1, 13

http://commoncrawl.org/url-search-tool/

http://commoncrawl.org/url-search-tool/

How is Data Crawled?

• Customized crawler (it’s open source!)

• Some basic page rank included. Lots of time spent optimizing this and filtering spam

• See Apache Nutch as alternative web-scale crawler

• Future datasets may incl other crawl sources

Monday, April 1, 13

Common Crawl Uses

Monday, April 1, 13

Analyze References to Facebook

• of ~1.3 Billion URLs:

• 22% of Web pages contain Facebook URLs

• 8% of Web pages implement Open Graph tags

• Among ~500m hardcoded links to Facebook, only 3.5 million are unique

• These are primarily for simple social integrations

Monday, April 1, 13

References to FB Pages

• /merriamwebster 676071 (0.14%)

• /kevjumba 651389 (0.14%)

• /placeformusic 618963 (0.13%)

• /lyricskeeper 517999 (0.11%)

• /kayak 465179 (0.10%)

• /twitter 281882 (0.06%)

Monday, April 1, 13

Analyze JavaScript Libraries on the Web

1. jQuery (82.64%)

2. Prototype(6.06%)

3. Mootools (4.83%)

4. Ext (3.47%)

5. YUI (1.78%)

6. Modernizr (0.59%)

7. Dojo(0.21%)

8. Ember (0.14%)

9. Underscore (0.11%)

10. Backbone (0.09%)

Monday, April 1, 13

Library Co-occurence

Monday, April 1, 13

Web Data Commons

• sub-corpus of Common Crawl data

• includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN

• built using 2009/2010 corpus

Monday, April 1, 13

Monday, April 1, 13

Traitor: Associating Concepts

http://www.youtube.com/watch?v=c7Y149RnQjw

Monday, April 1, 13



Associated Costs?

• Complete data set, ~$1300.00

• Facebook Link Analysis, $434.61

• Searchable Index of Data Set, $100

• “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate”

Monday, April 1, 13

Give it a Try

Monday, April 1, 13

ARC Files

• Files contain the full HTTP response and payload for all pages crawled.

• Format designed by the Internet Archive

• ARC files are a series of concatenated GZIP documents

Monday, April 1, 13

Text-Only Files• Saved as sequence files -- consisting of binary

key/value pairs. (Used extensively in MapReduce as input/output formats)

• On average 20% the size of raw content

• located in the segment directories, with a file name of "textData-nnnnn". For example:

• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

Monday, April 1, 13

Metadata Files

• For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.

• Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).

• Records in the Metadata files are in the same order and have the same file numbers as the Text Only content

• Saved as sequence files

Monday, April 1, 13

Browsing Data

• You can use s3cmd on your local machine

• Install using pip, ‘pip install s3cmd’

• Configure, ‘s3cmd --configure’

• Requires AWS keys

• Demo: s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/

Monday, April 1, 13

Common Crawl AMI

• Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce

• Amazon AMI ID: "ami-07339a6e"

Monday, April 1, 13

Running Example MR Jobs Using the AMI

• ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket )

• bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/

• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java

Monday, April 1, 13

Code Samples to Try

• http://github.com/commoncrawl/

• Pete Warden’s Ruby example• http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-

ruby-code-across-five-billion-web-pages.html

Monday, April 1, 13

http://github.com/commoncrawl/

http://github.com/commoncrawl/

http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages.html




Helpful Resources

• Developer Documentation:

• https://commoncrawl.atlassian.net/

• Developer Discussion List:

• https://groups.google.com/group/common-crawl

Monday, April 1, 13

https://commoncrawl.atlassian.net

https://commoncrawl.atlassian.net

https://groups.google.com/group/common-crawl

https://groups.google.com/group/common-crawl

Questions?

• @davelester

• [email protected]

• www.davelester.org

Monday, April 1, 13

mailto:[email protected]