Introduction to Common Crawl

Download Introduction to Common Crawl

Post on 06-May-2015




258 download


March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information


  • 1.Introduction to Common Crawl Dave Lester March 21, 2013Monday, April 1, 13
2. video intro:, April 1, 13 3. What is Common Crawl?• non-profit org providing an openrepository of web crawl data to beaccessed and analyzed by anyone• data is currently shared as a public dataseton Amazon S3Monday, April 1, 13 4. Why Open Data?• It’s difficult to crawl the web at scale• Provides a shared resource for researchersto compare results and recreateexperimentsMonday, April 1, 13 5. 2012 Corpus Stats• Total # of Web Documents: 3.8 billion• Total Uncompressed Content Size: 100 TB+• # of Domains: 61 million• # of PDFs: 92.2 million• # of Word Docs: 6.6 million• # of Excel Docs: 1.3 millionMonday, April 1, 13 6. Other Data Sources• Blekko - “spam-free search engine”• their metadata includes: • rank on a linear scale, and 0-10 web rank • true/false for Blekko’s webspam algorithmthinking this domain or page is spam • true/false for Blekko’s pr0n detectionalgorithmMonday, April 1, 13 7. What is Crawled?• Check out the new URL search tool: • (try entering• First five people to share open source codeon GitHub that incorporates a JSON filefrom URL Search will each get $100 inAWS Credit!Monday, April 1, 13 8. How is Data Crawled?• Customized crawler (it’s open source!)• Some basic page rank included. Lots of timespent optimizing this and filtering spam• See Apache Nutch as alternative web-scalecrawler• Future datasets may incl other crawlsourcesMonday, April 1, 13 9. Common Crawl UsesMonday, April 1, 13 10. Analyze References to Facebook• of ~1.3 Billion URLs:• 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags• Among ~500m hardcoded links to Facebook,only 3.5 million are unique• These are primarily for simple social integrationsMonday, April 1, 13 11. References to FB Pages• /merriamwebster 676071 (0.14%)• /kevjumba651389 (0.14%) • /placeformusic 618963 (0.13%) • /lyricskeeper 517999 (0.11%) • /kayak465179 (0.10%) • /twitter281882 (0.06%) Monday, April 1, 13 12. Analyze JavaScriptLibraries on the Web1. jQuery (82.64%)6. Modernizr (0.59%)2. Prototype(6.06%) 7. Dojo(0.21%)3. Mootools (4.83%) 8. Ember (0.14%)4. Ext (3.47%)9. Underscore (0.11%)5. YUI (1.78%)10. Backbone (0.09%)Monday, April 1, 13 13. Library Co-occurenceMonday, April 1, 13 14. Web Data Commons • sub-corpus ofCommon Crawl data• includes RDFa,hCalendar, hCard,Geo Microdata,hResume, XFN• built using2009/2010 corpusMonday, April 1, 13 15. Monday, April 1, 13 16. Traitor: Associating Concepts, April 1, 13 17. Associated Costs?• Complete data set, ~$1300.00• Facebook Link Analysis, $434.61• Searchable Index of Data Set, $100 • “average per-hour cost for a High-CPUMedium Instance (c1.medium) was about$.018, just under one tenth of the on-demand rate”Monday, April 1, 13 18. Give it a TryMonday, April 1, 13 19. ARC Files• Files contain the full HTTP response andpayload for all pages crawled.• Format designed by the Internet Archive• ARC files are a series of concatenatedGZIP documentsMonday, April 1, 13 20. Text-Only Files• Saved as sequence files -- consisting of binarykey/value pairs. (Used extensively in MapReduceas input/output formats)• On average 20% the size of raw content• located in the segment directories, with a filename of "textData-nnnnn". For example:• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112Monday, April 1, 13 21. Metadata Files• For each URL, metadata files contain status information,the HTTP response code, and file names and offsets ofARC files where the raw content can be found.• Also contain the HTML title, HTML meta tags, RSS/Atominformation, and all anchors/hyperlinks from HTMLdocuments (including all fields on the link tags).• Records in the Metadata files are in the same order andhave the same file numbers as the Text Only content• Saved as sequence filesMonday, April 1, 13 22. Browsing Data• You can use s3cmd on your local machine• Install using pip, ‘pip install s3cmd’• Configure, ‘s3cmd --configure’• Requires AWS keys• Demo:s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/Monday, April 1, 13 23. Common Crawl AMI• Amazon Machine Image loaded withCommon Crawl example programs, adevelopment Hadoop instance, and scriptsto submit jobs to Amazon ElasticMapReduce• Amazon AMI ID: "ami-07339a6e"Monday, April 1, 13 24. Running Example MR Jobs Using the AMI• ccRunExample [ LocalHadoop | AmazonEMR ][ ExampleName ] ( S3Bucket )• bin/ccRunExample LocalHadoopExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.javaMonday, April 1, 13 25. Code Samples to Try•• Pete Warden’s Ruby example • ruby-code-across-five-billion-web-pages.htmlMonday, April 1, 13 26. Helpful Resources• Developer Documentation: •• Developer Discussion List: •, April 1, 13 27. Questions?• @davelester•• www.davelester.orgMonday, April 1, 13


View more >