introduction to common crawl

27
Introduction to Common Crawl Dave Lester March 21, 2013 Monday, April 1, 13

Upload: davelester

Post on 06-May-2015

19.102 views

Category:

Documents


399 download

DESCRIPTION

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

TRANSCRIPT

Page 1: Introduction to Common Crawl

Introduction to Common Crawl

Dave LesterMarch 21, 2013

Monday, April 1, 13

Page 2: Introduction to Common Crawl

video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4

Monday, April 1, 13

Page 3: Introduction to Common Crawl

What is Common Crawl?

• non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone

• data is currently shared as a public dataset on Amazon S3

Monday, April 1, 13

Page 4: Introduction to Common Crawl

Why Open Data?

• It’s difficult to crawl the web at scale

• Provides a shared resource for researchers to compare results and recreate experiments

Monday, April 1, 13

Page 5: Introduction to Common Crawl

2012 Corpus Stats

• Total # of Web Documents: 3.8 billion

• Total Uncompressed Content Size: 100 TB+

• # of Domains: 61 million

• # of PDFs: 92.2 million

• # of Word Docs: 6.6 million

• # of Excel Docs: 1.3 million

Monday, April 1, 13

Page 6: Introduction to Common Crawl

Other Data Sources

• Blekko - “spam-free search engine”

• their metadata includes:

• rank on a linear scale, and 0-10 web rank

• true/false for Blekko’s webspam algorithm thinking this domain or page is spam

• true/false for Blekko’s pr0n detection algorithm

Monday, April 1, 13

Page 7: Introduction to Common Crawl

What is Crawled?

• Check out the new URL search tool: http://commoncrawl.org/url-search-tool/

• (try entering ischool.berkeley.edu)

• First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!

Monday, April 1, 13

Page 8: Introduction to Common Crawl

How is Data Crawled?

• Customized crawler (it’s open source!)

• Some basic page rank included. Lots of time spent optimizing this and filtering spam

• See Apache Nutch as alternative web-scale crawler

• Future datasets may incl other crawl sources

Monday, April 1, 13

Page 9: Introduction to Common Crawl

Common Crawl Uses

Monday, April 1, 13

Page 10: Introduction to Common Crawl

Analyze References to Facebook

• of ~1.3 Billion URLs:

• 22% of Web pages contain Facebook URLs

• 8% of Web pages implement Open Graph tags

• Among ~500m hardcoded links to Facebook, only 3.5 million are unique

• These are primarily for simple social integrations

Monday, April 1, 13

Page 11: Introduction to Common Crawl

References to FB Pages

• /merriamwebster 676071 (0.14%)

• /kevjumba 651389 (0.14%)

• /placeformusic 618963 (0.13%)

• /lyricskeeper 517999 (0.11%)

• /kayak 465179 (0.10%)

• /twitter 281882 (0.06%)

Monday, April 1, 13

Page 12: Introduction to Common Crawl

Analyze JavaScript Libraries on the Web

1. jQuery (82.64%)

2. Prototype(6.06%)

3. Mootools (4.83%)

4. Ext (3.47%)

5. YUI (1.78%)

6. Modernizr (0.59%)

7. Dojo(0.21%)

8. Ember (0.14%)

9. Underscore (0.11%)

10. Backbone (0.09%)

Monday, April 1, 13

Page 13: Introduction to Common Crawl

Library Co-occurence

Monday, April 1, 13

Page 14: Introduction to Common Crawl

Web Data Commons

• sub-corpus of Common Crawl data

• includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN

• built using 2009/2010 corpus

Monday, April 1, 13

Page 15: Introduction to Common Crawl

Monday, April 1, 13

Page 16: Introduction to Common Crawl

Traitor: Associating Concepts

http://www.youtube.com/watch?v=c7Y149RnQjw

Monday, April 1, 13

Page 17: Introduction to Common Crawl

Associated Costs?

• Complete data set, ~$1300.00

• Facebook Link Analysis, $434.61

• Searchable Index of Data Set, $100

• “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate”

Monday, April 1, 13

Page 18: Introduction to Common Crawl

Give it a Try

Monday, April 1, 13

Page 19: Introduction to Common Crawl

ARC Files

• Files contain the full HTTP response and payload for all pages crawled.

• Format designed by the Internet Archive

• ARC files are a series of concatenated GZIP documents

Monday, April 1, 13

Page 20: Introduction to Common Crawl

Text-Only Files• Saved as sequence files -- consisting of binary

key/value pairs. (Used extensively in MapReduce as input/output formats)

• On average 20% the size of raw content

• located in the segment directories, with a file name of "textData-nnnnn". For example:

• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

Monday, April 1, 13

Page 21: Introduction to Common Crawl

Metadata Files

• For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.

• Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).

• Records in the Metadata files are in the same order and have the same file numbers as the Text Only content

• Saved as sequence files

Monday, April 1, 13

Page 22: Introduction to Common Crawl

Browsing Data

• You can use s3cmd on your local machine

• Install using pip, ‘pip install s3cmd’

• Configure, ‘s3cmd --configure’

• Requires AWS keys

• Demo: s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/

Monday, April 1, 13

Page 23: Introduction to Common Crawl

Common Crawl AMI

• Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce

• Amazon AMI ID: "ami-07339a6e"

Monday, April 1, 13

Page 24: Introduction to Common Crawl

Running Example MR Jobs Using the AMI

• ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket )

• bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/

• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java

Monday, April 1, 13

Page 26: Introduction to Common Crawl

Helpful Resources

• Developer Documentation:

• https://commoncrawl.atlassian.net/

• Developer Discussion List:

• https://groups.google.com/group/common-crawl

Monday, April 1, 13

Page 27: Introduction to Common Crawl

Questions?

• @davelester

[email protected]

• www.davelester.org

Monday, April 1, 13