[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Extracting Objects from the Web

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Extracting Objects from the Web

Post on 28-Mar-2017

215 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • Extracting Objects from the Web

    Zaiqing Nie1 Fei Wu2 Ji-Rong Wen1 Wei-Ying Ma11Microsoft Research Asia 2 Tsinghua University

    {znie,jrwen,wyma}@microsoft.com wufei98@mails.tsinghua.edu.cn

    Abstract

    Extracting and integrating object information fromthe Web is of great significance for Web data manage-ment. The existing Web information extraction tech-niques cannot provide satisfactory solution to the Webobject extraction task since objects of the same type aredistributed in diverse Web sources, whose structuresare highly heterogeneous. In this paper, we propose anovel approach called Object-Level Information Extrac-tion (OLIE) to extract Web objects. This approach ex-tends a classic information extraction algorithm, Con-ditional Random Fields (CRF), by adding Web-specificinformation. The experimental results show OLIE cansignificantly improve the Web object extraction accu-racy.

    1 Introduction

    This paper studies how to automatically extract ob-ject information from the Web. The main challengeis that objects of the same type are distributed in di-verse Web sources, whose structures are highly hetero-geneous. For instance, information about paper ob-jects can be found in homepages, PDF files, and evenonline databases.

    Although it is possible to combine existing Web in-formation extraction techniques to construct a toolkitto extract object from some template-generated Webpages. We think this is not a practical solution, sinceattribute values of an object are extracted from vari-ous Web sources independently, it is required to learna template for each Website.

    Another tightly related work is classic informationextraction from plain text document [1]. However,these methods are originally designed for processingplain texts and not for Web pages, and thus cannot bedirectly applied to the Web object extraction task. Ofcourse, we can transform each Web page into a plain

    This work is done when the author is visiting Microsoft Re-search Asia.

    text document by removing HTML tags and other ir-relevant codes. But treating Web pages as plain textdocuments is unwise since some important Web-specificinformation for object extraction, such as page struc-ture and layout, is lost.

    The advantage of classic IE algorithms is their ca-pability of handling heterogeneous data sources andintegrating information extraction and object identifi-cation in a uniform framework, while Web IE takes ad-vantage of the Web-specific information, e.g. tags andlayouts, to extract objects. In this paper, we present anobject-level information extraction (OLIE) approachwhich can effectively extract Web objects from mul-tiple heterogeneous Web data sources. Our basic ideais to extend a classic IE algorithm, Conditional Ran-dom Fields (CRF), by adding Web-specific features. Soour method is essentially a combination of Web IE andclassic IE. More specifically, besides text, we found thatthere are other two kinds of Web information, namelyvisual information on the Web pages and structuredinformation from Web databases, are of particular im-portance for Web object extraction.

    2 Problem Formulation

    The Web object extraction problem is motivated byLibra, a scientific literature search engine that we aredeveloping[3].

    2.1 Object Blocks and Elements

    Web Objects & Attributes: We define the con-cept of Web Objects as the principle data units aboutwhich Web information is to be collected, indexed andranked. Web objects are usually recognizable con-cepts, such as authors, papers, conferences, or jour-nals which have relevance to the application domain.Different types of objects are used to represent the in-formation for different concepts. We assume the sametype of objects follows a common relational schema:R(a1, a2, ..., am). Attributes, A = {a1, a2, ..., am}, areproperties which describe the objects, and key at-

    1

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • Figure 1. Four Object Blocks Located in a WebPage and Five Elements Shown in the BottomBlock

    tributes, AK = {aK1, aK2, ..., aKK} A, are prop-erties which can uniquely identify an object.

    Object Blocks & Elements: The information aboutan object on a Web page is usually grouped togetheras a block, since Web page creators are always tryingto display semantically related information together.We define the concept of an object block as a collec-tion of information within a Web page that relates toa single object. Given an object block found on a Webpage, we further segment it to atomic extraction en-tities called object elements. In this way, the objectblock Ei is converted to a sequence of elements, i.e.Ei =< ei1ei2 eiT >. Each element eij only belongsto a single attribute of the object, and an attributecan contain several elements. Figure 1 shows four ob-ject blocks located in a Web page generated by Froogleand five elements located in the bottom block. Withthe help of data record mining techniques such as [2],we can automatically detect the object blocks from aWeb page.

    2.2 Web Object Extraction

    Given an object block Ei =< ei1ei2 eiT >, andits relevant object schema R(a1, a2, ..., am), we need toassign an attribute name from the attribute set A ={a1, a2, ..., am} to each object element eij to determinethe corresponding label sequence Li =< li1li2 liT >.If the object block Ei and a previously extracted ob-ject On in the database refer to the same entity, weintegrate On and the labeled Ei together. The key at-tributes AK are used to decide whether they refer tothe same entity. The combined labeling and integra-tion inference is called Web object extraction.

    After locating an object block on Web pages andsegmenting it to an object element set, the labelingoperation can be treated as a sequence data classifica-tion problem. Please see [4] for a detailed discussion

    on the sequence characteristics between the elementsin an object block. To the best of our knowledge, theConditional Random Fields(CRF) model is among themost popular and effective methods for this task [1].So, we select the CRF as the base model and extend itfor Web object extraction.

    3 Object-Level Information Extraction

    As stated above, our goal is to incorporate all avail-able information to assist the Web object extraction.The basic CRF model can not meet this requirement,since it models the label sequence probability only con-ditioned on the element sequence E =< e1e2 eT >,and no object identification is performed. We introducea novel object-level information extraction approachcalled OLIE. Our OLIE approach uses an EnhancedCRF (ECRF) model. ECRF extends the basic CRFmodel by introducing two variations.

    First, we modify the label sequence probability tocondition on not only the element sequence, but alsoavailable databases,

    P (L|E,D,) = 1ZE

    exp

    {T

    t=1

    Nk=1

    kfk(lt1, lt, E,D, t)

    }(1)

    where, E is the object element sequence, and it con-tains both the text and visual information. D de-note databases which store structured information.fk(lt1, lt, E,D, t) is the new feature function based onall the three categories of information.

    There are cases when we have sufficiently highconfidence that some object element et should havecertain label. For instance, the cases may be that goodmatches between et and key/important attributesof records in databases are found, or that et has ahigh enough element emission probability for someattribute. For example, if the following statistics holds,p(lt = conference|et contains in proceedings of) =0.99, and current et is in proceedings of SIGMOD04,it is almost definite that conference is the label. Theseconstraints can be used to guide the solution searchingprogress to find the optimal label path correctlyand quickly. This leads to our second variation forthe basic CRF. Specifically, we first compute theconfidence ct(ai) that et belongs to certain attributeai based on some feature functions. If the confidenceis high enough(ct(ai) > ), we modify the inductionformula of Viterbi algorithm as follows,

    t(l)=

    maxl

    {ct(ai)t1(l

    ) exp

    [N

    k=1

    kfk(l,l,E,D,t)

    ]}l=ai

    maxl

    {(1ct(ai))t1(l

    ) exp

    [N

    k=1

    kfk(l,l,E,D,t)

    ]}others

    (2)

    2

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • if ct(ai) , the induction formula is the same as(3).

    Based on ECRF, our OLIE sufficiently utilizesall available information to assist the extraction forWeb objects. Because object identification is per-formed during this process, a bidirectional communi-cation among object blocks and records of databasesis achieved, which leads to a combined information ex-traction and integration.

    4 Experiments

    The OLIE approach proposed in the paper are fullyimplemented and evaluated in the context of Libra.Two types of Web objects are defined in the experi-ments: papers and authors. We use instance accuracyto evaluate the performance of our OLIE approach.Instance accuracy is defined as the percentage of in-stances in which all words are correctly labelled.

    4.0.1 Datasets

    Paper Citations: We took the citation dataset de-rived from the Cora project for testing. It contains 500citations and we used 300 for training and the remain-ing 200 for testing. 7 attributes of paper objects areextracted: Author, Title, Editor, Booktitle, Journal,Year, and Others.

    Paper Headers: We randomly selected 200 papersin the Cora dataset and downloaded them from theinternet. We used 100 papers for training and theremaining for testing. 9 attributes of author objectsare extracted: Name, Affiliation, Address, Email, Fax,Phone, Web URL, Degree, and Others. 4 attributes ofpaper objects are extracted: Title, Author, Abstract,and Others.

    Author homepages: We randomly collected 200computer scientists homepages from the internet.Compared with previous two datasets, this dataset ismore general and flexible. 11 attributes of the authorobjects are extracted: Name, Affiliation, Designation,Address, Email, Phone, Fax, Education, Secretary, Of-fice, and Others. We randomly selected 100 homepagesfor training and the remaining 100 for testing. .

    ACM Digital Library: ACM Digital Library is on-line Web database with high quality structured data,which totally contains essential structured informationabout 150,000 papers on computer science.

    4.1 Experimental Results

    In Figure 2, we show OLIEs extraction results oninstance accuracy and compared them with some typ-ical algorithms (i.e. CRF and HMM). An obvious im-provement is obtained due to two main reasons. First,

    additional information such as vision and database isutilized to help the extraction. Second, the labelingprocess is based on elements instead of words.

    Figure 2. Instance accuracy by different algo-rithms

    To test the effectiveness of using object elements in-stead of words, we discard database features during theextraction. The result is shown in Figure 2 correspond-ing to the OLIE* method. We can see that, though theresult is not so satisfying as OLIE, an improvement isstill obtained compared with CRF and HMM.

    5 Conclusion

    By leveraging the advantages of both Web IE andclassic IE techniques, we propose an Object-Level In-formation Extraction (OLIE) approach by extendingthe Conditional Random Fields (CRF) algorithm withmore Web-specific information such as vision featuresand database features. The novelty of this approachlies in that it utilizes as much available Web informa-tion as possible to assist the extraction process.

    References

    [1] J. Lafferty, A. McCallum, and F. Pereira. Conditionalrandom fields: Probabilistic models for segmenting andlabeling sequence data. In ICML, 2001.

    [2] B. Liu, R. Grossman, and Y. Zhai. Mining data recordsin web pages. In ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD),2003.

    [3] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Pro-ceedings of WWW Conference, 2005.

    [4] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2dconditional random fields for web information extrac-tion. In Proceedings of ICML conference, 2005.

    3

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 2.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice

Recommended

View more >