introduction to common crawl

of 27/27
Introduction to Common Crawl Dave Lester March 21, 2013 Monday, April 1, 13

Post on 06-May-2015




399 download

Embed Size (px)


March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information


  • 1.Introduction to Common Crawl Dave Lester March 21, 2013Monday, April 1, 13

2. video intro:, April 1, 13 3. What is Common Crawl? non-prot org providing an openrepository of web crawl data to beaccessed and analyzed by anyone data is currently shared as a public dataseton Amazon S3Monday, April 1, 13 4. Why Open Data? Its difcult to crawl the web at scale Provides a shared resource for researchersto compare results and recreateexperimentsMonday, April 1, 13 5. 2012 Corpus Stats Total # of Web Documents: 3.8 billion Total Uncompressed Content Size: 100 TB+ # of Domains: 61 million # of PDFs: 92.2 million # of Word Docs: 6.6 million # of Excel Docs: 1.3 millionMonday, April 1, 13 6. Other Data Sources Blekko - spam-free search engine their metadata includes: rank on a linear scale, and 0-10 web rank true/false for Blekkos webspam algorithmthinking this domain or page is spam true/false for Blekkos pr0n detectionalgorithmMonday, April 1, 13 7. What is Crawled? Check out the new URL search tool: (try entering First ve people to share open source codeon GitHub that incorporates a JSON lefrom URL Search will each get $100 inAWS Credit!Monday, April 1, 13 8. How is Data Crawled? Customized crawler (its open source!) Some basic page rank included. Lots of timespent optimizing this and ltering spam See Apache Nutch as alternative web-scalecrawler Future datasets may incl other crawlsourcesMonday, April 1, 13 9. Common Crawl UsesMonday, April 1, 13 10. Analyze References to Facebook of ~1.3 Billion URLs: 22% of Web pages contain Facebook URLs 8% of Web pages implement Open Graph tags Among ~500m hardcoded links to Facebook,only 3.5 million are unique These are primarily for simple social integrationsMonday, April 1, 13 11. References to FB Pages /merriamwebster 676071 (0.14%) /kevjumba651389 (0.14%) /placeformusic 618963 (0.13%) /lyricskeeper 517999 (0.11%) /kayak465179 (0.10%) /twitter281882 (0.06%)Monday, April 1, 13 12. Analyze JavaScriptLibraries on the Web1. jQuery (82.64%)6. Modernizr (0.59%)2. Prototype(6.06%) 7. Dojo(0.21%)3. Mootools (4.83%) 8. Ember (0.14%)4. Ext (3.47%)9. Underscore (0.11%)5. YUI (1.78%)10. Backbone (0.09%)Monday, April 1, 13 13. Library Co-occurenceMonday, April 1, 13 14. Web Data Commons sub-corpus ofCommon Crawl data includes RDFa,hCalendar, hCard,Geo Microdata,hResume, XFN built using2009/2010 corpusMonday, April 1, 13 15. Monday, April 1, 13 16. Traitor: Associating Concepts, April 1, 13 17. Associated Costs? Complete data set, ~$1300.00 Facebook Link Analysis, $434.61 Searchable Index of Data Set, $100 average per-hour cost for a High-CPUMedium Instance (c1.medium) was about$.018, just under one tenth of the on-demand rateMonday, April 1, 13 18. Give it a TryMonday, April 1, 13 19. ARC Files Files contain the full HTTP response andpayload for all pages crawled. Format designed by the Internet Archive ARC les are a series of concatenatedGZIP documentsMonday, April 1, 13 20. Text-Only Files Saved as sequence les -- consisting of binarykey/value pairs. (Used extensively in MapReduceas input/output formats) On average 20% the size of raw content located in the segment directories, with a lename of "textData-nnnnn". For example: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112Monday, April 1, 13 21. Metadata Files For each URL, metadata les contain status information,the HTTP response code, and le names and offsets ofARC les where the raw content can be found. Also contain the HTML title, HTML meta tags, RSS/Atominformation, and all anchors/hyperlinks from HTMLdocuments (including all elds on the link tags). Records in the Metadata les are in the same order andhave the same le numbers as the Text Only content Saved as sequence lesMonday, April 1, 13 22. Browsing Data You can use s3cmd on your local machine Install using pip, pip install s3cmd Congure, s3cmd --congure Requires AWS keys Demo:s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/Monday, April 1, 13 23. Common Crawl AMI Amazon Machine Image loaded withCommon Crawl example programs, adevelopment Hadoop instance, and scriptsto submit jobs to Amazon ElasticMapReduce Amazon AMI ID: "ami-07339a6e"Monday, April 1, 13 24. Running Example MR Jobs Using the AMI ccRunExample [ LocalHadoop | AmazonEMR ][ ExampleName ] ( S3Bucket ) bin/ccRunExample LocalHadoopExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/ look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.javaMonday, April 1, 13 25. Code Samples to Try Pete Wardens Ruby example ruby-code-across-ve-billion-web-pages.htmlMonday, April 1, 13 26. Helpful Resources Developer Documentation: Developer Discussion List:, April 1, 13 27. Questions? @davelester [email protected] www.davelester.orgMonday, April 1, 13