Transcript

Extracting Objects from the Web

Zaiqing Nie1 Fei Wu2 ∗ Ji-Rong Wen1 Wei-Ying Ma1

1Microsoft Research Asia 2 Tsinghua University{znie,jrwen,wyma}@microsoft.com [email protected]

Abstract

Extracting and integrating object information fromthe Web is of great significance for Web data manage-ment. The existing Web information extraction tech-niques cannot provide satisfactory solution to the Webobject extraction task since objects of the same type aredistributed in diverse Web sources, whose structuresare highly heterogeneous. In this paper, we propose anovel approach called Object-Level Information Extrac-tion (OLIE) to extract Web objects. This approach ex-tends a classic information extraction algorithm, Con-ditional Random Fields (CRF), by adding Web-specificinformation. The experimental results show OLIE cansignificantly improve the Web object extraction accu-racy.

1 Introduction

This paper studies how to automatically extract ob-ject information from the Web. The main challengeis that objects of the same type are distributed in di-verse Web sources, whose structures are highly hetero-geneous. For instance, information about “paper” ob-jects can be found in homepages, PDF files, and evenonline databases.

Although it is possible to combine existing Web in-formation extraction techniques to construct a toolkitto extract object from some template-generated Webpages. We think this is not a practical solution, sinceattribute values of an object are extracted from vari-ous Web sources independently, it is required to learna template for each Website.

Another tightly related work is classic informationextraction from plain text document [1]. However,these methods are originally designed for processingplain texts and not for Web pages, and thus cannot bedirectly applied to the Web object extraction task. Ofcourse, we can transform each Web page into a plain

∗This work is done when the author is visiting Microsoft Re-search Asia.

text document by removing HTML tags and other ir-relevant codes. But treating Web pages as plain textdocuments is unwise since some important Web-specificinformation for object extraction, such as page struc-ture and layout, is lost.

The advantage of classic IE algorithms is their ca-pability of handling heterogeneous data sources andintegrating information extraction and object identifi-cation in a uniform framework, while Web IE takes ad-vantage of the Web-specific information, e.g. tags andlayouts, to extract objects. In this paper, we present anobject-level information extraction (OLIE) approachwhich can effectively extract Web objects from mul-tiple heterogeneous Web data sources. Our basic ideais to extend a classic IE algorithm, Conditional Ran-dom Fields (CRF), by adding Web-specific features. Soour method is essentially a combination of Web IE andclassic IE. More specifically, besides text, we found thatthere are other two kinds of Web information, namelyvisual information on the Web pages and structuredinformation from Web databases, are of particular im-portance for Web object extraction.

2 Problem Formulation

The Web object extraction problem is motivated byLibra, a scientific literature search engine that we aredeveloping[3].

2.1 Object Blocks and Elements

Web Objects & Attributes: We define the con-cept of Web Objects as the principle data units aboutwhich Web information is to be collected, indexed andranked. Web objects are usually recognizable con-cepts, such as authors, papers, conferences, or jour-nals which have relevance to the application domain.Different types of objects are used to represent the in-formation for different concepts. We assume the sametype of objects follows a common relational schema:R(a1, a2, ..., am). Attributes, A = {a1, a2, ..., am}, areproperties which describe the objects, and key at-

1

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Figure 1. Four Object Blocks Located in a WebPage and Five Elements Shown in the BottomBlock

tributes, AK = {aK1, aK2, ..., aKK} ⊆ A, are prop-erties which can uniquely identify an object.

Object Blocks & Elements: The information aboutan object on a Web page is usually grouped togetheras a block, since Web page creators are always tryingto display semantically related information together.We define the concept of an object block as a collec-tion of information within a Web page that relates toa single object. Given an object block found on a Webpage, we further segment it to atomic extraction en-tities called object elements. In this way, the objectblock Ei is converted to a sequence of elements, i.e.Ei =< ei1ei2 · · · eiT >. Each element eij only belongsto a single attribute of the object, and an attributecan contain several elements. Figure 1 shows four ob-ject blocks located in a Web page generated by Froogleand five elements located in the bottom block. Withthe help of data record mining techniques such as [2],we can automatically detect the object blocks from aWeb page.

2.2 Web Object Extraction

Given an object block Ei =< ei1ei2 · · · eiT >, andits relevant object schema R(a1, a2, ..., am), we need toassign an attribute name from the attribute set A ={a1, a2, ..., am} to each object element eij to determinethe corresponding label sequence Li =< li1li2 · · · liT >.If the object block Ei and a previously extracted ob-ject On in the database refer to the same entity, weintegrate On and the labeled Ei together. The key at-tributes AK are used to decide whether they refer tothe same entity. The combined labeling and integra-tion inference is called Web object extraction.

After locating an object block on Web pages andsegmenting it to an object element set, the labelingoperation can be treated as a sequence data classifica-tion problem. Please see [4] for a detailed discussion

on the sequence characteristics between the elementsin an object block. To the best of our knowledge, theConditional Random Fields(CRF) model is among themost popular and effective methods for this task [1].So, we select the CRF as the base model and extend itfor Web object extraction.

3 Object-Level Information Extraction

As stated above, our goal is to incorporate all avail-able information to assist the Web object extraction.The basic CRF model can not meet this requirement,since it models the label sequence probability only con-ditioned on the element sequence E =< e1e2 · · · eT >,and no object identification is performed. We introducea novel object-level information extraction approachcalled OLIE. Our OLIE approach uses an EnhancedCRF (ECRF) model. ECRF extends the basic CRFmodel by introducing two variations.

First, we modify the label sequence probability tocondition on not only the element sequence, but alsoavailable databases,

P (L|E, D, Θ) =1

ZEexp

{T∑

t=1

N∑k=1

λkfk(lt−1, lt, E, D, t)

}(1)

where, E is the object element sequence, and it con-tains both the text and visual information. D de-note databases which store structured information.fk(lt−1, lt, E, D, t) is the new feature function based onall the three categories of information.

There are cases when we have sufficiently highconfidence that some object element et should havecertain label. For instance, the cases may be that goodmatches between et and key/important attributesof records in databases are found, or that et has ahigh enough element emission probability for someattribute. For example, if the following statistics holds,p(lt = “conference′′|et contains “in proceedings of”) =0.99, and current et is “in proceedings of SIGMOD04”,it is almost definite that conference is the label. Theseconstraints can be used to guide the solution searchingprogress to find the optimal label path correctlyand quickly. This leads to our second variation forthe basic CRF. Specifically, we first compute theconfidence ct(ai) that et belongs to certain attributeai based on some feature functions. If the confidenceis high enough(ct(ai) > τ), we modify the inductionformula of Viterbi algorithm as follows,

δt(l)=

⎧⎪⎪⎨⎪⎪⎩

maxl′

{ct(ai)·δt−1(l

′) exp

[N∑

k=1

λkfk(l′,l,E,D,t)

]}l=ai

maxl′

{(1−ct(ai))·δt−1(l

′) exp

[N∑

k=1

λkfk(l′,l,E,D,t)

]}others

(2)

2

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

if ct(ai) ≤ τ , the induction formula is the same as(3).

Based on ECRF, our OLIE sufficiently utilizesall available information to assist the extraction forWeb objects. Because object identification is per-formed during this process, a bidirectional communi-cation among object blocks and records of databasesis achieved, which leads to a combined information ex-traction and integration.

4 Experiments

The OLIE approach proposed in the paper are fullyimplemented and evaluated in the context of Libra.Two types of Web objects are defined in the experi-ments: papers and authors. We use instance accuracyto evaluate the performance of our OLIE approach.Instance accuracy is defined as the percentage of in-stances in which all words are correctly labelled.

4.0.1 Datasets

Paper Citations: We took the citation dataset de-rived from the Cora project for testing. It contains 500citations and we used 300 for training and the remain-ing 200 for testing. 7 attributes of paper objects areextracted: Author, Title, Editor, Booktitle, Journal,Year, and Others.

Paper Headers: We randomly selected 200 papersin the Cora dataset and downloaded them from theinternet. We used 100 papers for training and theremaining for testing. 9 attributes of author objectsare extracted: Name, Affiliation, Address, Email, Fax,Phone, Web URL, Degree, and Others. 4 attributes ofpaper objects are extracted: Title, Author, Abstract,and Others.

Author homepages: We randomly collected 200computer scientists’ homepages from the internet.Compared with previous two datasets, this dataset ismore general and flexible. 11 attributes of the authorobjects are extracted: Name, Affiliation, Designation,Address, Email, Phone, Fax, Education, Secretary, Of-fice, and Others. We randomly selected 100 homepagesfor training and the remaining 100 for testing. .

ACM Digital Library: ACM Digital Library is on-line Web database with high quality structured data,which totally contains essential structured informationabout 150,000 papers on computer science.

4.1 Experimental Results

In Figure 2, we show OLIE’s extraction results oninstance accuracy and compared them with some typ-ical algorithms (i.e. CRF and HMM). An obvious im-provement is obtained due to two main reasons. First,

additional information such as vision and database isutilized to help the extraction. Second, the labelingprocess is based on elements instead of words.

Figure 2. Instance accuracy by different algo-rithms

To test the effectiveness of using object elements in-stead of words, we discard database features during theextraction. The result is shown in Figure 2 correspond-ing to the OLIE* method. We can see that, though theresult is not so satisfying as OLIE, an improvement isstill obtained compared with CRF and HMM.

5 Conclusion

By leveraging the advantages of both Web IE andclassic IE techniques, we propose an Object-Level In-formation Extraction (OLIE) approach by extendingthe Conditional Random Fields (CRF) algorithm withmore Web-specific information such as vision featuresand database features. The novelty of this approachlies in that it utilizes as much available Web informa-tion as possible to assist the extraction process.

References

[1] J. Lafferty, A. McCallum, and F. Pereira. Conditionalrandom fields: Probabilistic models for segmenting andlabeling sequence data. In ICML, 2001.

[2] B. Liu, R. Grossman, and Y. Zhai. Mining data recordsin web pages. In ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD),2003.

[3] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Pro-ceedings of WWW Conference, 2005.

[4] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2dconditional random fields for web information extrac-tion. In Proceedings of ICML conference, 2005.

3

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related