1 object-level vertical search zaiqing nie microsoft research asia

Object-Level Vertical Search

Zaiqing Nie

Microsoft Research Asia

2

Outline

• Overview• Demo: Libra Academic Search• Core Technologies

– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking – Object Mining

• Conclusion

3

Outline

Overview• Demo: Libra Academic Search• Core Technologies

– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking– Object Mining

• Conclusion

4

Terminology

• Web Object– A collection of (semi-) structured Web information about a real-

world object– e.g. Person, product, job, movie, restaurant, …

• Object-Level Search– Search based on Web objects

• Vertical Search– Search information in a specific domain

5

General Web Search (Google)

6

Page Level Vertical Search (Google Scholar)

7

Object Level Vertical Search (MSRA Libra)

8

Object-Level Search vs. Page-Level Search

Page-Level Search Object-Level Search

TechnologyInformation Retrieval (IR)

Page as the unit of retrieval

Database (DB)

Object as the unit of retrieval

Pros• Ease of authoring• Ease of use

• Powerful query capability• Direct answer• Aggregate answer

Cons

• Limited query capability•Sifting through hundreds of (irrelevant) pages

• Where and how to get the objects?

– A large portion of Web contents are inherently (semi-)structured

9

Why Vertical Search?

• Sites dedicated to a specific domain• Audiences with specific interests• Easier to build object level search in a domain

– Data in some domains are more structured and uniform– Easy to define object types and schemas

10

Web Search 2.0 -> Web Search 3.0

Current Web Search

Page-Level

General Search

Vertical Search

Object-Level

11

Web Search 2.0 -> Web Search 3.0

Current Web Search

Google ScholarLibra Academic

Search

Page-Level

General Search

Vertical Search

Object-Level

From Relevance to Intelligence

12

Goal of Object Level Vertical Search

• Make Web search engine– as scalable as an IR system– as effective as a DB system

Page CategoryNumber of English Pages

Percentage of Total (22.5K English Page)

# of Pages in 9B index(All Languages)

Product Transactional Pages 1835 8.15% 733M

Product List Pages 995 4.4% 396M

Product Review Pages 196 0.9% 81M

Comercial Services 196 0.9% 81M

Hotels,Travel packages,Tickets 131 0.6% 54M

Location 112 0.5% 45M

Newspapers 92 0.4% 36M

Job listings 18 0.08% 7.2M

Music 16 0.07% 6.3M

Movie 15 0.07% 6.3M

Commercial Data Statistics Manually labeled Web pages using predefined categorization Initial results from manually classifying 51K randomly selected pages

14

Architecture

Web Web

Object Crawling

Classification

LocationExtractor

ProductExtractor

ConferenceExtractor

AuthorExtractor

PaperExtractor

PaperIntegration

AuthorIntegration

ConferenceIntegration

LocationIntegration

ProductIntegration

Scientific WebObject Warehouse

Product ObjectWarehouse

Web Objects

PopRank Object Relevance Object Community Mining Object Categorization

15

Outline

• Overview Demo: Libra Academic Search• Core Technologies

– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking– Object Mining

• Conclusion

16

Http://libra.msra.cn

17


19


20


21


22


23


24


25


26


27


28


29


30


31

Outline


Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking – Object Mining

• Conclusion

32

Motivation

• Problems of treating a web page as an atomic unit– Web page usually contains not only pure content

• Noise: navigation, decoration, interaction, …

– Multiple topics

• Web page has internal structure– Two-dimension logical structure & Visual layout

presentation

– > Free text document– < Structured document

• Layout – the 3rd dimension of Web page– 1st dimension: content– 2nd dimension: hyperlink

33

Object Information on the Web

Scientific Papers Researchers Product Items

Business Locations

Images Jobs

34

Is DOM a Good Representation of Page Structure?

• Page segmentation using DOM– Extract structural tags such as P, TABLE, UL, TITLE, H1~H6, etc

• Page segmentation using DOM, content and link– Record boundary discovery by heuristics

– Fine-grained topic distillation by link analysis

• Function-based Object Model (FOM)– Define a function for each object and partition the page based on

these functions

• DOM is more related content display,

does not necessarily reflect semantic structure

• How about XML?

35

Vision-based Content Structure

• Goal: Extract content structure for web page based on visual cues– Typical visual cues

• Position• Line• Blank area• Color• Font size• Image• …

– Extracted from the rendering result of Web browsers

• Assumption: Content structure based on visual display reflects semantic partition of content, and visual cues help to build content structure

36

Definition of Vision-based Content Structure

• A hierarchical structure of layout block– Layout block is a basic objects or a group of basic objects– Basic object is the leaf node in the DOM tree of the page

• Can be formally described by :– is a finite set of layout blocks

– is a finite set of visual separators

– describes the relationship that two layout blocks are separated by a visual separator

– Each layout block is a sub-web-page and has similar intra structure

, ,O 1 2

, , ...,N

O

1 2, , ...,

T

O O NULL

37

An Example of Vision-based Content Structure

Web Page

VB1 VB2 VB3 VB4

VB1_1 VB1_2 VB2_1 VB2_2 VB2_3

VB2_1_1 VB2_1_2 VB2_2_1 VB2_2_2 VB2_2_3 VB2_3_1 VB2_3_2

• A hierarchical structure of layout block

• A Degree of Coherence (DOC) is defined for each block

– Show the intra coherence of the block– DoC of child block must be no less than

its parent’s

• The Permitted Degree of Coherence (PDOC) can be pre-defined to achieve different granularities for the content structure

– The segmentation will stop only when all the blocks’ DoC is no less than PDoC

– The smaller the PDoC, the coarser the content structure would be

38

VIPS (VIsion-based Page Segmentation)– An Algorithm to Effectively Extract Content Structure

Visual Block Extraction

Visual Separator Detection

Content Structure Construction

Iterating the Above Steps

Steps:

• Iteratively find all appropriate visual blocks

• Visual cues in this stage– Tag cue– Color cue– Text cue– Size cue

• Visual Separator– horizontal or vertical line– visually cross with no blocks

• Weight of Separator– Set weight to each separator according to some

patterns

• Maximally-weighted separators are chosen as the real separators

• Blocks that are not separated are merged

• Calculate DOC for each block• Each block is checked if they meet the

granularity requirement (i.e., if DOC > PDOC)

– For those that fail, iteratively partition them

39

The VIPS Algorithm

• Flowchart

40

Step 1: Visual Block Extraction (Cont.)

• Find iteratively all appropriate visual blocks contained in the current sub-tree

• Visual cues to determine whether dividing a DOM node– Tag cue:

• Tags such as <HR> are often used to separate different topics from visual perspective.

– Color cue:• DOM nodes with different background colors

– Text cue: • If most of the children of a DOM node are text nodes, we prefer to not

divide it.

– Size cue: • We predefine a relative size (comparing with the size of the whole page

or sub-page) threshold for different tags

41

Step 2: Visual Separator Detection

• Visual separator is represented by (Ps, Pe)

– Ps is the start pixel while Pe is the end pixel

– horizontal or vertical lines in a web page that visually cross with no blocks in the pool

• Two parts– Separator Detection– Separator Weights Setting

42

Step 2: Visual Separator Detection (Cont.)

• Separator Detection– Start with only one separator (Ptl, Pbr)

– Add block into the pool and update separators• If the block is contained in a separator, split it

• If the block crosses a separator, update it

• If the block covers a separator, remove it

– Remove the separators on the border

S1

S2

S1

S2

S3

S1

S2

S3

S4

S1

S2

S3

1

2

34

43

Step 2: Visual Separator Detection (Cont.)

• Setting Weight for Separator– Blank size between blocks

– Overlap of the separator and tags, such as HR

– Difference of font sizes between the two sides of the separator

– Difference of colors between the two sides of the separator

44

Step 3: Content Structure Construction

• Maximally-weighted separators are chosen as the real separators

• Blocks that are not separated are merged

• A content structure is built at this level

• Each sub-block is checked if they meet the granularity requirement– For those that fail, we iteratively partition them– If all sub-blocks meet the requirement, we finally get the content

structure for the web page

45

An VIPS Example

tabl e

center

p

p

td

text

tabl e

tabl e

tabl e

tabl e

tabl e

p

tabl e

tabl e

tabl e

text

Bl ock1

Bl ock2

Bl ock3

Bl ock4

Bl ock5

Bl ock6

VB2_2_2

VB2_2_2_1

VB2_2_2_2

VB2_2_2_3

VB2_2_2_1_1

VB2_2_2_1_2

VB2_2_2_2_1

VB2_2_2_2_2

VB2_2_2_3_1

VB2_2_2_3_1

47

Example of Web Page Segmentation (1)

48

Example of Web Page Segmentation (2)

• Can be applied on web image retrieval– Surrounding text extraction

49

Experiments

• Manual evaluation of page segmentation– 140 pages are selected from 14 Yahoo! categories

Human judgment Number of pages

Perfect 86

Satisfactory 50

Failed 4

50

Web Page Block – Better Information Unit

Page Segmentation

• Vision based approach

Block Importance Modeling

• Statistical learning

Importance = Med

Importance = Low

Importance = High

Web Page Blocks

WWW’03 Paper WWW’04 Paper

51

Block Importance (WWW’04)

• Page Importance– Importance page vs.

unimportant page– HITS, PageRank

• Block Importance– Valuable information vs.

noisy information– ?

52

A User Study of Block Importance

• Do people have consistent opinions about the importance of the same block in a page?

• Subjective importance– From users’ view– Attention: concentration of mental powers on an object, a

close or careful observing or listening– Affected by users’ purposes and preferences

• Objective importance– From authors’ views– Correlation degree between a block and the theme of the

web page

53

Settings of User Study

• Data– 600 web pages from 405 sites in 3 categories in Yahoo!: news, science and

shopping– Each category includes 200 pages– With diverse layouts and contents– 600 pages are segmented to 4,539 blocks using VIPS

• Importance Labeling– 5 human assessors to manually label

each block with 4-level importance values

– Level 1: noisy information such as advertisement, copyright, decoration, etc.

– Level 2: useful information, but not very relevant to the topic of the page, such as navigation, directory, etc.

– Level 3: relevant information to the theme of the page, but not with prominent importance, such as related topics, topic index, etc.

– Level 4: the most prominent part of the page, such as headlines, main content, etc.

54

Result Analysis

• Users do have consistent opinions when judging the importance of blocks

Levels 3/5 agreement 4/5 agreement 5/5 agreement

1,2,3,4 0.929 0.535 0.237

1,(2,3),4 0.995 0.733 0.417

(1,2,3),4 1 0.932 0.828

55

Result Analysis (cont.)

Levels 3/5 agreement 4/5 agreement 5/5 agreement

(1,2),3,4 0.965 0.76 0.562

1,(2,3),4 0.995 0.733 0.417

1,2,(3,4) 0.963 0.614 0.318

(1,3),2,4 0.965 0.553 0.244

1,3,(2,4) 0.965 0.555 0.248

(1,4),2,3 0.934 0.539 0.24

• Levels 2 and 3 are the most blurry zones to be distinguished

56

Block Importance Model

• A block importance model is formalized as

• Block features:– Content features

• Absolute– ImgNum, ImgSize– LinkNum, LinkTextLength– InnerTextLength– InteractionNum, InteractionSize: <INPUT> and <SELECT>– FormNum, FormSize: <FORM>

• Relative

importance blockfeatures block

57

Block Importance Model (cont.)

• Block features:– Spatial features

• Absolute– {BlockCenterX, BlockCenterY, BlockRectWidth,BlockRectHeight}

• Relative– {BlockCenterX/PageWidth, BlockCenterY/PageHeight,

BlockRectWidth/PageWidth, BlockRectHeight/PageHeight}

• Window– BlockRectHeight = BlockRectHeight/WindowHeight– BlockCenterY is modified

58

Learning Block Importance

• Training set T: Labeled blocks (x, y)– x: feature representation of a block– y: importance label

• A function f such is minimized

• Learning algorithms– Regression by neural network

• RBF network

– Classification by Support Vector Machines• Linear kernel• Gaussian RBF kernel

T)(2

)(y, -yfx x

59

Experiments

• Experimental setup

– 600 labeled web pages from 405 sites

– 4517 blocks for which at least 3 of the five assessors have

agreement on their importance

– 5-folder cross validation

– Measures: Micro-F1 and Micro-Accuracy

60

3-level vs. 4-level Importance

• For the 4-level importance model, the precision and recall of level 2 and 3 are much lower than those of level 1 and 4

• By combining level 2 and 3, the performance increased significantly

Level 1 Level 2 Level 3 Level 4 Micro-F1 Micro-Acc

4-level 0.708 (P)0.782 (R)

0.643 (P)0.658 (R)

0.567(P)0.372(R)

0.826 (P)0.822 (R)

0.685 0.843

3-level 0.763 (P)0.776 (R)

0.796 (P)0.804 (R)

0.839 (P)0.770 (R)

0.790 0.859

61

Spatial Features vs. All Features

• Content features do provide some complementary information to spatial features in measuring block importance

Level 1 Level 2 Level 4 Micro-F1 Micro-Acc

Spatial 0.714 (P)0.684 (R)

0.754 (P)0.769 (R)

0.805 (P)0.841 (R)

0.748 0.832

All 0.763 (P)0.776 (R)

0.796 (P)0.804 (R)

0.839 (P)0.770 (R)

0.790 0.859

62

Block Importance Model vs. Human Assessors

Level 1 Level 2 Level 3 Micro-F1 Micro-Acc

Assessor 1 0.817 (P)0.856 (R)

0.871 (P)0.857 (R)

0.934 (P)0.871 (R)

0.858 0.906

Assessor 2 0.756 (P)0.834 (R)

0.815 (P)0.782 (R)

0.816 (P)0.715 (R)

0.792 0.861

Assessor 3 0.864 (P)0.815 (R)

0.838 (P)0.881 (R)

0.852 (P)0.809 (R)

0.849 0.899

Assessor 4 0.904 (P)0.684 (R)

0.797 (P)0.908 (R)

0.827 (P)0.912 (R)

0.830 0.887

Assessor 5 0.849 (P)0.924 (R)

0.895 (P)0.882 (R)

0.938 (P)0.762 (R)

0.882 0.921

Average 0.838 (P)0.823 (R)

0.843 (P)0.862 (R)

0.873 (P)0.814(R)

0.842 0.895

Our model 0.763 (P)0.776 (R)

0.796 (P)0.804 (R)

0.839 (P)0.770 (R)

0.790 0.859

63

Block-level Link Analysis

C

A B

64

A Sample of User Browsing Behavior

66

Using Block-level PageRank to Improve Search

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10.115

0.12

0.125

0.13

0.135

0.14

0.145

0.15

0.155

0.16

0.165

Combining Parameter

Ave

rag

e P

reci

sio

n

BLPR-CombinationPR-Combination

Block-level PageRank achieves 15-25% improvement over PageRank (SIGIR’04)

PageRank

Block-level PageRank

Search =IR_Score + (1- PageRank

67

Outline


– Vision-based Page Segmentation Web Object Extraction (demo first)– Object Integration– Object Ranking– Object Mining

• Conclusion

Extracting Objects from the Web

Zaiqing Nie, Fei Wu, Ji-Rong Wen, and Wei-Ying Ma

69

Collecting Object Information

• Data Feed?– Limited coverage– Fail to handle “tail”

• Data Crawl– The largest index wins– Data refreshing

• Mining Web Objects– Bridge between unstructured and structured data– Deal with data with huge volume– Adapt to the highly diverse and dynamic Web environment

70

Object Information on the Web

Scientific Papers Researchers Product Items

Business Locations

Images Jobs

71

Existing Approaches

• Basic Idea– Convert HTML into a sequence of tokens or a tag tree– Discover pattern

• Representative Methods– Wrapper generation

• Manually write a wrapper

• Induct a wrapper [Liu 2000], [Kushmerick 1997]

– Extract structured data from Web pages that shared a common template• Equivalence classes [Arasu 2003]

• RoadRunner [Crescenzi 2001]

– Extract data record within a Web page• OMINI: record-boundary discovery

• IEPAD: pattern discovery on PAT tree

• MDR: repeated nodes discovery

– Extract data from tables in a Web page• Classify tables into genuine table or non-genuine table [Wang 2002]

• Extract data from data tables [Chen 2002], [Lerman 2001]

72

Problems with Existing Approaches

73


74


75


76

Vision-based Approach for Web Object Extraction

Visual Element Identification

Similarity Measure & Clustering

Record Identification & Extraction

Visual Element Identification

Similarity Measure & Clustering

Record Identification & Extraction

Object Blocks

77

Object Block & Object ElementObject Block

Elem

ent

78

Object-level Information Extraction (IE)

},...,,{ ,..... :sequence label optimal theFind

,... :sequenceelement object an Given

2121

21

miT

T

aaaAllllL

eeeE

• The Problem

Name

Price

Description

Brand

Rating

Image

Digital CameraObject Block

e1

e2

e3

e4

e5e6

a1

a2

a3

a4

a5

a6

Elem

ent

Attribute

79

Object Extraction as Sequence Data Labeling

• Sequence Characteristics:

product before researcher before

(name, desc) 1.000 (name, Tel) 1.000

(name, price) 0.987 (name, email) 1.000

(image, name) 0.941 (name, address) 1.000

(image, price) 0.964 (address, email) 0.847

(Image, desc) 0.977 (address, tel) 0.906

Product: 100 product pages (964 product blocks)

Researcher: 120 researcher’s homepages (120 homepage blocks)

80

Extended Conditional Random Fields

• Our Solution based on Extended Conditional Random Fields (ECRF) Model

T

t

K

kttkk

L

L

tDBEllfZ

DBELPL

1 11

*

),,,,(exp(1

maxarg

),,|(maxarg

DB

E

tll

t)DBE,llf

tt

ttk

is database the

and, is sequenceelement totalthe

,at occurs n transitioa

:event about the featurearbitrary an measures ,,,(

1

1

81

Example Features in Extended CRF Model

• Text Features– Only contain “$” and digits – Percent of digits in the element, …

• Vision Features– Font size, color, style,…– Element size & position– Separators: lines, blank areas, images…

• Database Features– Match an attribute of a record– Match key attributes of a record, …

82

Experiment

• Three kinds of objects are selected:

Header

Citation

Homepage

83

Experimental Results

0

20

40

60

80

100

1 2

Ins

tan

ce

Ac

cu

rac

y (

%)

OLIE CRF HMM

Paper Citation Paper Header

84

Information Integration

Database

Name

Price

Description

Brand

Rating

Image

Digital Camera

Website 1

Website N

Improve IE process

85

Results with DB of various sizes

84

85

86

87

88

89

90

91

92

0 50000 100000 150000

Database Size

Accu

racy (

%)

2D Conditional Random Fields for Web Information Extraction

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang and Wei-Ying Ma

August 9, 2005

87

Limitations of Linear-Chain CRFs

Image

Name

Description

Price Other Other Other

Sequentialization

Image

Name

Description

Price Other Other Other Image

Name

Description


• Attributes are two-dimensionally laid out• Which sequentialization is better?• Two-dimensional interactions are

seriously lost

Image

Name

Description


89

Inference on diagonal State sequence

1T

2T

NT

1NT

1N MT

1Y 2Y NY1NY 1N MY

90

Modeling an Object Block

1,0y 1,1y1,2y

2,2y

3,2y

ynull

ynull

ynull ynull0,0y

ynull

ynull

1,0y 1,1y 1,2y

2,2y

3,2y

ynull

ynull

1,1y

ynull ynull

1,1y

0,0y

1,0x1,1x 1,2x

2,2x

3,2x

xnull

xnull

xnull xnull0,0x

xnull

xnull

91

Experiment -- Dataset

• Randomly crawled 572 Web pages & collected 2500 Web blocks using Vision-based segmentation technology

• Two types of Web blocks– ODS: one dimensional blocks (information doesn’t have two-dimensional

interactions)– TDS: two dimensional blocks (information does have two-dimensional

interactions)

• Training Set– 500 Web blocks （ 400 TDS + 100 ODS）

• Testing Sets– ODS (1000)– TDS (1000)

92

Experiment -- Evaluation Criteria

• Precision– the percentage of returned elements that are correct

• Recall– the percentage of correct elements that are returned

• F1 Measure– the harmonic mean of precision & recall

• Average F1 Measure– the average of F1 values of different attributes

• Block Instance Accuracy– the percentage of blocks of which the important attributes (name, image, and price) are

correctly labeled

93

Experiment -- Results

0. 5

0. 55

0. 6

0. 65

0. 7

0. 75

0. 8

0. 85

0. 9

0. 95

1

name i mage pr i ce desc name i mage pr i ce desc name i mage pr i ce desc avg_f 1 Bl k_I A

preci si on recal l f 1 Bl k_I A

Li near- Chai n CRFs 2D CRFs

94

Experiment -- Results

0. 5

0. 55

0. 6

0. 65

0. 7

0. 75

0. 8

0. 85

0. 9

0. 95

1

name i mage pri ce desc name i mage pri ce desc name i mage pri ce desc avg_f1 Bl k_I A

preci si on recal l f 1 Bl k_I A

Li near-Chai n CRFs 2D CRFs

Simultaneous Record Detection and Attribute Labeling in Web Data Extraction

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma

SIGKDD 2006

96

De-coupled Web Object Extraction

• Papers – Nie et al., ICDE’06 & Zhu et al., ICML’05

• Technical Transfer of the Year Award (MSRA 2006)

• Basic Idea

1Y 2Y 3Y

4Y

5Y

1Y

1Y

2Y

1Y 1Y

2Y

0Y

1Y 2Y 3Y

4Y

5Y

0Y

Step1: Record detection and Element segmentation

Step2: Attribute Labeling

Kristjansson et al. AAAI’04, Nie et al., ICDE’06

Zhu et al., ICML’05

97

Inefficiencies

• Error propagation– Limited in overall performance

• Lack of semantics in record detection– Semantics help identify records

• Lack of mutual interactions in attribute labeling

– Records in the same page are related and mutually constrained

• First-order Markov assumption– Fail to incorporate long distance dependency

98

Vision-based Page Representation

Web Page

Data Record

Image

Name Desc Price Note

Desc

Note

Data Record

Image

Desc Desc Name Desc Price Note Note

Web Page

Data Record

Image


Desc

Note

Data Record

Image


Web Page

Data Record

Image


Desc

Note

Data Record

Image


• A Web Page can be represented as a Vision-tree [Cai et al., 2004]

– Makes use of page layout features such as font, color, and size.– Each node represents a data region in the Web page, is called a block

99

Joint Web Object Extraction

• Definition 1: Record Detection– Given a vision-tree, record detection is the task of locating the minimum

set of blocks that contain the content of a record.

• Definition 2: Attribute Labeling– For each identified record, attribute labeling is the process of assigning

attribute labels to the leaf blocks (or elements) within the record.

• Definition 3: Joint optimization of record extraction and attribute labeling: – let be the features of all the blocks,– let be one possible label assignment of the corresponding

blocks.

– *y arg max (y | x)p

0 1x {x ,x , ,x }N

0 1y {y ,y , , y }N Let Hierarchical CRF model do it!

100

Hierarchical CRF Model

• Assumptions– Sibling variables are directly

interacted

– Non-sibling variables are conditionally independent

• Parameter estimation and labeling can be solved using standard junction tree algorithm

• Details in our paper……

0

1 2

3 4 5 6

7 8 9 10 11 12 13 14

15 16 17 18 19 20 21 22 23 24 25 26

, ,

,

(y | ,x) (y | ,x)1

y | x expx (y | ,x)

k k v k k ev k e k

k k tt k

g f

pZ h

101

HCRF Model for Web Object Extraction

• Inter-level interactions– Dependencies between parents and children

– Different from Multi-scale CRF model [He et al., 2004]

• Long distance dependencies– Through the dependencies at various levels and the inter-level interactions

• Flexibility to incorporate any useful feature– HCRF model is a conditional model, and also a CRF model

102

• Inter-level interactions– Dependencies between parents and children– Different from Multi-scale CRF model [He et al., 2004]



• Computational efficiency


Contain Name and Price

Contain Description

Contain Image

103


• Inter-level interactions– Dependencies between parents and children– Different from Multi-scale CRF model [He et al., 2004]



• Computational efficiency

Product NameProduct Image

Product Description

Product Price

104

Empirical Evaluation

• Two datasets for two types of Web pages– List dataset (LDST)

• 771 list pages (200 for training and 571 for testing)

– Detail dataset (DDST)• 450 detail pages

(150 for training and 300 for testing)

00. 10. 20. 30. 40. 50. 60. 70. 80. 9

1

preci si on recal l F1

2D CRF HCRF

105

Outline


– Vision-based Page Segmentation– Web Object Extraction Object Integration– Object Ranking– Object Mining

• Conclusion

106

Motivation

Which Lei Zhang are we talking

about?

107

Object Identification

• Only text similarity is used in existing approaches

• Connection strength in the object relationship graph is another important evidence

The author identification problem

Lei Zhang

Multiple researchers with the same name

Alon Levy

Alon Y. Halevy

Multiple names of the same researcher

Conference

Paper

Paper

Paper

Paper

Paper

Paper

Paper

Paper

Conference

Identifying all papers by a researcherL

ei Z

han

g

108

Web Connections

• Local information is incomplete• Web is a good source for validating connections between objects• Co-appearance on the same sentence, Web page, or Website

Web

O1 O2

……

Web

109

Outline


– Vision-based Page Segmentation– Web Object Extraction– Object Integration Object Ranking Object Mining

• Conclusion

110

Object-level Ranking:

Bringing Order to Web Objects

Zaiqing Nie Yuanzhi Zhang

Ji-Rong Wen Wei-Ying Ma

Microsoft Research Microsoft Research AsiaAsia

(presented by Zaiqing Nie)

111

Object Relationship Graph

• Different types of links– Paper->paper, author->paper, conf->paper– have different semantics– Affect the popularity of the related objects differently

112

Back-Links of An Object on the Web

• Adding a Popularity Propagation Factor to each relationship link– The same type of links have the same factor

• The popularity of an object is also affected by the popularity of the Web pages containing the object

113

Biased Random Surfing Behavior

• A Random Object Finder Model– Starting his random walk on the Web to find the first seed

object– Then starting following only the relationship links– Eventually getting bored– Restarting his random walk on the Web again to find another

seed object

• Popularity of an Object Depends on– The Probability of finding the object through the Web graph– The Probability of finding the object through the object

relationship graph

114

The PopRank Model

RX = εREX + (1-ε)ΣγYXMTYXRY

Where• X = {x1,x2,…,xn },Y = {y1,y2,…,yn}: objects of type X and type Y ;

• RX, RY: vector of popularity rankings of objects of type X and type Y;

• MYX: adjacent matrices,

myx = 1 / Num(y;x) , if there is a relationship link from object y to object x, Num(x,y) denotes the number of links from object y to any objects of type X;

myx = 0, otherwise;

• γYX: denotes the popularity propagation factor of the relationship link from an object of type Y to an object of type X

• REX: vector of Web popularity of objects of type X;

• ε: a damping factor which is the probability that the "random object finder" will get bored of following the object relationship links and start looking for another object through the Web graph.

115

How to Assign PPF Factors

• Impractical to manually assign PPF factors

• Easy to collect some partial ranking of the objects from domain experts– An example: SIGMOD -> VLDB -> ICDE -> ER

• A typical parameter optimization problem– Selecting a combination of PPF factors for the PopRank model

that results in rankings of the training objects that match the expert ranking as closely as possible.

– Exploring the search space using a simulated annealing approach

116

Searching for PPF Factors

PopRankCalculator

Ranking DistanceEstimator

Select a new combination from

neighbors of the best

Chosen as the best

Link Graph Initial Combination of PPFs

Better than the best

?

AcceptThe worse one

?

Expert Ranking

Yes

No

Yes

117

Challenges Facing our Learning Approach

• May take hours/days to try and evaluation a single combination of PPF factors for a large graph

• Prohibitively expensive to try hundreds of combinations

The effect decreases as the "relationship distance" increases

A subgraph that includes the training objects and their closely related objects to approximate the full graph

118

Subgraph Selection

PopRankCalculator

Ranking DistanceEstimator

Increase the distance

Link Graph Initial Relationship Distance

Greater thanthe stop threshold

?

Ranking from the full graph

Yes

NoDone

119

Experimental Study

• Datasets– 7 million object relationship links from three different types

of links– 1 million papers, 650,000 authors, 1700 conferences, and

480 journals– 14 partial ranking lists containing ranking information for 67

objects (8 for training, and 6 for testing)

120

Experimental Results for Different Subgraph

0

3000

6000

9000

12000

15000

18000

1 2 3 4 5 6 7 Al l

Di ameters(k)

Time

(s)

0. 15

0. 2

0. 25

0. 3

0. 35

0. 4

0. 45

0. 5

1 2 3 4 5 6 7 Al l

Di ameters(k)

Rank

ing

Dist

ance

Learning Time Ranking Distance

121

Experimental Results for Different Stop Thresholds

0

30000

60000

90000

Threshol ds ( )

Time

(s)

δ

0. 15

0. 18

0. 21

0. 24

0. 27

0. 3

Threshol ds( )

Rank

Dis

tanc

e

δ

Learning Time Ranking Distance

122

PopRank versus PageRank

123

Conclusion

• A PopRank model for calculating object popularity scores– Web graph– Object relationship graph

• An Automated approach for assigning Popularity Propagation Factors

• The effectiveness is shown in Libra– Significantly better than PageRank

• Generally applicable for most vertical search domains– Product Search, Movie Search, …

124

Web Object Retrieval

• Information about a Web Object is Extracted from Multiple Sources– Inconsistent Copies – Reliability assumption no longer valid

• Inconsistency Example

SourceTitle Authors

Ground Truth

Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives

Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel

CiteSeer Towards Higher Disk Head Utilization:

Extracting Free Bandwidth From Busy Disk Drives Christopher R. Lumb, Jiri...

DBLP Towards Higher Disk Head Utilization: Extracting “Free” Bandwidth from Busy Disk Drives

Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel

125

Unreliability of Information about Web Objects

Web Object

Attribute1 Attribute2

Attribute

n

Object Block1 Object Block2 Object Blockm

imp 1

imp 2

imp n

conf 1

conf 2

conf m

The unreliability of Objects–Unreliable data sources–Incorrect object detection–Incorrect attribute value extraction

126

Web Object Retrieval

• A Language Model for Object Retrieval

• Balancing Structured and Unstructured Retrieval– Block-level unstructured object retrieval– Attribute-level retrieval– Using the confidence of the extracted object

information as the parameter to find the balance

Web Source1 Web Source2 Web Sourcem

Record Record Record

Attribute1

Web Object

Attribute2 Attributen

Record

Extraction

Record-level

Representation

Attribute

Extraction

Attribute-level

Representation

α1α2 αm

γ1 γ2 γm

β1 β2 βn

1 1

1( | ) (1 ) ( | )

K M

k k j k jkk j

P w O P w OM

127

Experimental Results

• Models– Bag of Words (BW) – Unstructured Object Retrieval (UOR) – Multiple Weighted Fields (MWF)– Structured Object Retrieval (SOR)– Balancing Structured and Unstructured Retrieval (BSUR)

0.65

0.7

0.75

0.8

0.85

0.9

10 20 30 40 50 60 70 80Error Rate (%)

Pre

cisi

on

BW UOR MWF SOR BSUR

128

Outline


– Vision-based Page Segmentation– Web Object Extraction– Object Integration Object Ranking Object Mining

• Conclusion

129

Object Mining

• Object Community Mining

• Relationship mining

• Trend analysis

130

Research Community Mining

• Motivation– Discovering research communities and their important papers, authors

• A community is described as a set of concentric circles– Core objects in the center– Affiliated objects surround the core with different ranks

Demo

Demo

132

Conclusion

• An object-level vertical search model is proposed

• Key technologies to build object-level vertical search engine – Object extraction

– Object identification

– Object popularity ranking

– Object community mining

• More applications– Yellow page search

– Job search

– Mobile search

– Movie search

– ……

133

Thank you!

1 object-level vertical search zaiqing nie microsoft research asia

Documents

objectlevel search search

object level search

web search engine

architecture web object

terminology web object

object types

object mining conclusion

web pages