robert meusel christian bizer

International Internet Preservation ConsortiumGeneral Assembly 2014, Paris

Mining a Large Web Corpus

Robert MeuselChristian Bizer

The Common Crawl

Hyperlink Graphs

Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.

HTML-embedded Data on the Web

Several million websites semantically markup the content of their HTML pages.

Markup Syntaxes

Microformats

RDFa

Microdata

Data snippets within info

boxes

Relational HTML Tables

HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia.

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.

In a corpus of 14B raw tables, 154M are „good“relations (1.1%)

The Web Data Commons Project

Has developed an Amazon-based framework for extracting data from large web crawls Capable to run on any cloud infrastructure

Has applied this framework to the Common Crawl data Adaptable to other crawls

Results and framework are publicly available http://webdatacommons.org

Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.

Extraction Framework

AWS EC2 Instance

AWS EC2 Instance

Master

AWS SQS

AWS EC2 Instance

AWS S3

1: Fill queue

2: Launch instances

3: Request file-reference

4: Download file

5: Extract & Upload

automated

manual

6: Collect results

Extraction Worker

AWS S3

AWS S3

WDC Extractor

.(w)arc

Worker

Filter

output

Worker:• Written in Java• Process one page at

once • Independent from

other files and workers

Download file

Upload output file

Filter:•Reduce Runtime•Mime-Type filter•Regex detection of content or meta-information

Worker

Web Data Commons – Extraction Framework

Written in Java

Mainly tailored for Amazon Web Services

Fault tolerant and cheap 300 USD to extract 17 billion RDF statements from 44 TB

Easy customizable Only worker has to be adapted

Worker is a single process method processing one file each time

Scaling is automated by the framework

Access Open Source Code: https://www.assembla.com/code/commondata/

Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.

Extracted Datasets

Hyperlink Graph

HTML-embedded Data


Hyperlink Graph

HTML-embedded Data


Hyperlink Graph

Extracted from the Common Crawl 2012 Dataset

Over 3.5 billion pages connected by over 128 billion links

Graph files: 386 GB

http://webdatacommons.org/hyperlinkgraph/http://wwwranking.webdatacommons.org/

Hyperlink Graph

Degrees do not follow a power-law

Detection of Spam pages

Further insights: WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)

WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)

Discovery of evolutions in the global structure of the World Wide Web.

Hyperlink Graph

Discovery of important and interesting sites using different popularity rankings or website categorization libraries

Websites connected by at least ½ Million Links

HTML-embedded Data

More and more Websites semantically markup the content of their HTML pages.Markup Syntaxes

RDFaMicroformats

Microdata

Websites containing Structured Data (2013)

1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%)

585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%).

Web Data Commons - Microformat, Microdata, RDFa Corpus 17 billion RDF triples from Common Crawl 2013

Next release will be in winter 2014

http://webdatacommons.org/structureddata/

Top Classes Microdata (2013)

• schema = Schema.org• dv = Google‘s

Rich Snippet Vocabulary

HTML Tables

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.

• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.

In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008)

Classification Precision: 70-80%

WDC - Web Tables Corpus

Large corpus of relational Web tables for public download

Extracted from Common Crawl 2012 (3.3 billion pages)

147 million relational tables

selected out of 11.2 B raw tables (1.3%)

download includes the HTML pages of the tables (1TB zipped)

Table Statistics

Heterogeneity: Very high.

http://webdatacommons.org/webtables/

Min Max Average Median

Attributes 2 2,368 3.49 3

Data Rows 1 70,068 12.41 6

Attribute Statistics

28,000,000 different attribute labels

WDC - Web Tables Corpus

Attribute #Tables

name 4,600,000

price 3,700,000

date 2,700,000

artist 2,100,000

location 1,200,000

year 1,000,000

manufacturer 375,000

counrty 340,000

isbn 99,000

area 95,000

population 86,000

Subject Attribute Values

1.74 billion rows 253,000,000 different subject labels

Value #Rows

usa 135,000

germany 91,000

greece 42,000

new york 59,000

london 37,000

athens 11,000

david beckham 3,000

ronaldinho 1,200

oliver kahn 710

twist shout 2,000

yellow submarine 1,400

Conclusion

Three factors are necessary to work with web-scale data:

Thanks to Common Crawl, this data is available

Like Amazon or other on-demand cloud-services

The Web Data Commons Framework, or standard tools like Pig

Cost evaluation on task-base, but the WDC framework has turned out to be cheaper

Availability of Crawls

Availability of cheap, easy-to-use infrastructures

Easy to adopt scalable extraction frameworks

Questions

Please visit our website: www.webdatacommons.org

Data and Framework are available as free download

Web Data Commons is supported by:

robert meusel christian bizer

Documents