towards crawling the web for structured data: pitfalls of common crawl crawling the web for...

Download Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce ... 1 Introduction ... ber 2014

Post on 23-May-2018

247 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

  • Towards Crawling the Web for Structured Data:Pitfalls of Common Crawl for E-Commerce

    Alex Stolz and Martin Hepp

    Universitaet der Bundeswehr Munich, D-85579 Neubiberg, Germany{alex.stolz,martin.hepp}@unibw.de

    Abstract. In the recent years, the publication of structured data insideHTML content of Web sites has become a mainstream feature of com-mercial Web sites. In particular, e-commerce sites have started to addRDFa or Microdata markup based on schema.org and GoodRelations vo-cabularies. For many potential usages of this huge body of data, we needto crawl the sites and extract the data from the markup. Unfortunately,a lot of markup resides in very deep branches of the sites, namely in theproduct detail pages. Such pages are difficult to crawl because of theirsheer number and because they often lack links pointing to them. In thispaper, we present a small-sized experiment where we compare the Webpages from a popular Web crawler, Common Crawl, with the URLs insitemap files of respective Web sites. We show that Common Crawl lacksa major share of the product detail pages that hold a significant partof the data, and that an approach as simple as a sitemap crawl yieldsmuch more product pages. Based on our findings, we conclude that arethinking of state-of-the-art crawling strategies is necessary in order tocater for e-commerce scenarios.

    1 Introduction

    Nowadays, evermore vendors are offering and selling their products on the Web.The spectrum of online vendors ranges from a few very large retailers like Ama-zon and BestBuy featuring an ample amount of goods to many small Web shopswith reasonable assortments. Consequently, the amount of product informationavailable online is constantly growing.

    While product descriptions were mostly unstructured in the past, the situ-ation has meanwhile changed. In the last few years, numerous Web shops haveincreasingly started to expose product offers using structured data markup em-bedded as RDFa, Microdata, or microformats in HTML pages (cf. [1,2]). Stan-dard e-commerce Web vocabularies like GoodRelations or schema.org typicallycomplement these data formats (RDFa and Microdata1) by semantically de-scribing products and their commercial properties. To some extent, the semanticannotations of product offers on the Web have been promoted by search engine

    1 Unlike for pure syntactical formats like RDFa and Microdata, the class and propertynames in microformats already imply semantics.

  • 2 A. Stolz and M. Hepp

    operators that offer tangible benefits and incentives for Web shop owners. Inparticular, they claim that additional data granularity can increase the visibilityof single Web pages, for example in the form of rich snippets (or rich captions,in Bing terminology) displayed on search engine result pages [3,4]. Furthermore,structured data and structured product data in particular can provide use-ful relevance signals that search engines can take into account for their rankingalgorithms as for the delivery of more appropriate search results. In essence,structured product data opens up novel opportunities for sophisticated use casessuch as deep product comparison at Web scale.

    Unfortunately, for regular data consumers other than search engines likeGoogle, Bing, Yahoo!, or Yandex2, it is difficult to make use of the wealth ofproduct information on the Web. Web crawling is an open problem for it isknown to be expensive due to the sheer size of the Web, and because it requiresdeep technical expertise. Hence, motivated by these challenges and the idea toprovide easy and open access to periodic snapshots of industrial-strength Webcrawls, the Common Crawl project was launched in 2007.

    The Common Crawl corpus with its hundreds of terabytes worth of gathereddata constitutes an incredibly valuable resource for research experiments andapplication development. At almost no cost3, it can offer unprecedented insightsinto several aspects of the Web. Among others, researchers found it to be usefulto analyze the graph structure of the Web [5], or to support machine translationtasks (e.g. [6]). Moreover, it is possible to extract and analyze structured dataon a large scale, as tackled by the Web Data Commons initiative [1].

    One popular and persistent misconception about Common Crawl, however,is to think that it is truly representative for the Web as a whole. The FAQsection on commoncrawl.org is further contributing to this fallacy, as it literallyterms Common Crawl as a copy of the web [7]. This statement is problematicthough, because it can foster wrong decisions. In particular, Web-scale crawlingis costly, that is why Web crawlers generally give precedence to a subset of pages,selected by popularity or relevance thresholds of Web shops. To prioritize thevisiting of Web pages, they take into account sitemaps, page load time, or thelink structure of Web pages (i.e. PageRank [8]). Common Crawl e.g., employsa customized PageRank algorithm and includes URLs donated by the formersearch engine Blekko4. Yet, a lot of the markup can be found in very deepbranches of the Web sites, like product detail pages. Aforementioned crawlingstrategies miss out many of these long-tail product pages, which renders theminappropriate for commercial use cases. For a related discussion on the limitedcoverage of Common Crawl with respect to structured e-commerce data, we referto a W3C mailing list thread from 20125.

    2 Shop owners often proactively direct popular search engines to their Web pages.3 The only costs that incur are those for storage and processing of the data.4 http://www.slideshare.net/davelester/introduction-to-common-crawl5 http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html andhttp://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html

    http://www.slideshare.net/davelester/introduction-to-common-crawlhttp://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.htmlhttp://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html

  • Towards Crawling the Web for Structured Data 3

    Actual URL distribution Sitemap URL distribution Common Crawl URL distribution

    1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

    Fig. 1. Our initial assumption about the URL coverage at varying Web site levels

    The main motivation for this paper was to answer the question whetherCommon Crawl is complete enough for use cases around structured e-commercedata. For e-commerce applications e.g., it is crucial to be able to rely on asolid and comprehensive product information database, which we think CommonCrawl cannot offer to date. Even if it might not be very surprising that CommonCrawl misses out a considerable amount of product data in commercial Web sites,it remains so far unclear by how much.

    In this paper, we investigate the representativeness of Common Crawl withrespect to e-commerce on the Web. For this purpose, we rely on a randomsample of product-related Web sites from Common Crawl, for which we measurethe coverage of (1) Web pages in general, and (2) Web pages with structuredproduct offer data. We further analyze the crawl depth of Common Crawl. Asour baseline, we use sitemaps to approximate the actual URL distribution ofWeb sites, as depicted in Fig. 1.

    The rest of this paper is structured as follows: Section 2 reviews importantrelated work; in Sect. 3, we outline our method, the relevant data sources, anddesign decisions that we made; Section 4 reports the results of the analysis ofCommon Crawl; in Sect. 5, we discuss our findings and emergent shortcomings;and finally, Sect. 6 concludes our paper.

    2 Related Work

    Considerable literature work deals with crawling structured data from the Web.In contrast to conventional Web crawling of textual content, structured dataposes special challenges on the organization and performance of crawling andindexing tasks [9]. The work in [9] suggests a pipelined crawling and indexingarchitecture for the Semantic Web. In [10], the authors propose a learning-based,focused crawling strategy for structured data on the Web. The approach uses anonline classifier and performs a bandit-based selection process of URLs, i.e. itdetermines a trade-off between exploration and exploitation based on the pagecontext and by incorporating feedback from previously discovered metadata.The Web Data Commons [1] initiative reuses a Web crawl provided by CommonCrawl to extract structured data.

  • 4 A. Stolz and M. Hepp

    Surprisingly, none of the aforementioned approaches address the questionwhere the structured data actually resides in a Web site, albeit this informa-tion can be useful for tuning the crawling process. [11] provide a study on thedistribution of structured content for information extraction on the Web. Theyshow that in order to capture a reasonable amount of the data of a domain, it iscrucial to regard the long tail of Web sites. On a more general scope, the authorsof [12] estimate that a crawler is able to reach a significant portion of the Webpages visited by users with only following three to five steps from the start page.

    3 Methodology and Data

    In this paper, we address the problem of crawling e-commerce data from theWeb, and in particular we focus on caveats of using Common Crawl. This sec-tion describes the data sources and the methodology for the inspection of theCommon Crawl relative to e-commerce.

    3.1 Datasets and APIs

    Each of the Common Crawl corpora represents a huge body of crawled Webpages which to process and analyze in detail is not a straightforward task. TheDecember 2014 Common Crawl corpus has a compressed file size of 160 terabytesand is reported to contain 15.7 million Web sites and a total of over two billionWeb pages [13]. To cope with this enormous amount of data, we drew on deriveddatasets and APIs.

    Web Data Commons. The Web Data Commons dataset series contains allstructu