1 object-level vertical search zaiqing nie microsoft research asia
Post on 19-Dec-2015
213 views
TRANSCRIPT
2
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking – Object Mining
• Conclusion
3
Outline
Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking– Object Mining
• Conclusion
4
Terminology
• Web Object– A collection of (semi-) structured Web information about a real-
world object– e.g. Person, product, job, movie, restaurant, …
• Object-Level Search– Search based on Web objects
• Vertical Search– Search information in a specific domain
8
Object-Level Search vs. Page-Level Search
Page-Level Search Object-Level Search
TechnologyInformation Retrieval (IR)
Page as the unit of retrieval
Database (DB)
Object as the unit of retrieval
Pros• Ease of authoring• Ease of use
• Powerful query capability• Direct answer• Aggregate answer
Cons
• Limited query capability•Sifting through hundreds of (irrelevant) pages
• Where and how to get the objects?
– A large portion of Web contents are inherently (semi-)structured
9
Why Vertical Search?
• Sites dedicated to a specific domain• Audiences with specific interests• Easier to build object level search in a domain
– Data in some domains are more structured and uniform– Easy to define object types and schemas
10
Web Search 2.0 -> Web Search 3.0
Current Web Search
Page-Level
General Search
Vertical Search
Object-Level
11
Web Search 2.0 -> Web Search 3.0
Current Web Search
Google ScholarLibra Academic
Search
Page-Level
General Search
Vertical Search
Object-Level
From Relevance to Intelligence
12
Goal of Object Level Vertical Search
• Make Web search engine– as scalable as an IR system– as effective as a DB system
Page CategoryNumber of English Pages
Percentage of Total (22.5K English Page)
# of Pages in 9B index(All Languages)
Product Transactional Pages 1835 8.15% 733M
Product List Pages 995 4.4% 396M
Product Review Pages 196 0.9% 81M
Comercial Services 196 0.9% 81M
Hotels,Travel packages,Tickets 131 0.6% 54M
Location 112 0.5% 45M
Newspapers 92 0.4% 36M
Job listings 18 0.08% 7.2M
Music 16 0.07% 6.3M
Movie 15 0.07% 6.3M
Commercial Data Statistics Manually labeled Web pages using predefined categorization Initial results from manually classifying 51K randomly selected pages
14
Architecture
Web Web
Object Crawling
Classification
LocationExtractor
ProductExtractor
ConferenceExtractor
AuthorExtractor
PaperExtractor
PaperIntegration
AuthorIntegration
ConferenceIntegration
LocationIntegration
ProductIntegration
Scientific WebObject Warehouse
Product ObjectWarehouse
Web Objects
PopRank Object Relevance Object Community Mining Object Categorization
15
Outline
• Overview Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking– Object Mining
• Conclusion
31
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
Vision-based Page Segmentation– Web Object Extraction– Object Integration– Object Ranking – Object Mining
• Conclusion
32
Motivation
• Problems of treating a web page as an atomic unit– Web page usually contains not only pure content
• Noise: navigation, decoration, interaction, …
– Multiple topics
• Web page has internal structure– Two-dimension logical structure & Visual layout
presentation
– > Free text document– < Structured document
• Layout – the 3rd dimension of Web page– 1st dimension: content– 2nd dimension: hyperlink
33
Object Information on the Web
Scientific Papers Researchers Product Items
Business Locations
Images Jobs
34
Is DOM a Good Representation of Page Structure?
• Page segmentation using DOM– Extract structural tags such as P, TABLE, UL, TITLE, H1~H6, etc
• Page segmentation using DOM, content and link– Record boundary discovery by heuristics
– Fine-grained topic distillation by link analysis
• Function-based Object Model (FOM)– Define a function for each object and partition the page based on
these functions
• DOM is more related content display,
does not necessarily reflect semantic structure
• How about XML?
35
Vision-based Content Structure
• Goal: Extract content structure for web page based on visual cues– Typical visual cues
• Position• Line• Blank area• Color• Font size• Image• …
– Extracted from the rendering result of Web browsers
• Assumption: Content structure based on visual display reflects semantic partition of content, and visual cues help to build content structure
36
Definition of Vision-based Content Structure
• A hierarchical structure of layout block– Layout block is a basic objects or a group of basic objects– Basic object is the leaf node in the DOM tree of the page
• Can be formally described by :– is a finite set of layout blocks
– is a finite set of visual separators
– describes the relationship that two layout blocks are separated by a visual separator
– Each layout block is a sub-web-page and has similar intra structure
, ,O 1 2
, , ...,N
O
1 2, , ...,
T
O O NULL
37
An Example of Vision-based Content Structure
Web Page
VB1 VB2 VB3 VB4
VB1_1 VB1_2 VB2_1 VB2_2 VB2_3
VB2_1_1 VB2_1_2 VB2_2_1 VB2_2_2 VB2_2_3 VB2_3_1 VB2_3_2
• A hierarchical structure of layout block
• A Degree of Coherence (DOC) is defined for each block
– Show the intra coherence of the block– DoC of child block must be no less than
its parent’s
• The Permitted Degree of Coherence (PDOC) can be pre-defined to achieve different granularities for the content structure
– The segmentation will stop only when all the blocks’ DoC is no less than PDoC
– The smaller the PDoC, the coarser the content structure would be
38
VIPS (VIsion-based Page Segmentation)– An Algorithm to Effectively Extract Content Structure
Visual Block Extraction
Visual Separator Detection
Content Structure Construction
Iterating the Above Steps
Steps:
• Iteratively find all appropriate visual blocks
• Visual cues in this stage– Tag cue– Color cue– Text cue– Size cue
• Visual Separator– horizontal or vertical line– visually cross with no blocks
• Weight of Separator– Set weight to each separator according to some
patterns
• Maximally-weighted separators are chosen as the real separators
• Blocks that are not separated are merged
• Calculate DOC for each block• Each block is checked if they meet the
granularity requirement (i.e., if DOC > PDOC)
– For those that fail, iteratively partition them
40
Step 1: Visual Block Extraction (Cont.)
• Find iteratively all appropriate visual blocks contained in the current sub-tree
• Visual cues to determine whether dividing a DOM node– Tag cue:
• Tags such as <HR> are often used to separate different topics from visual perspective.
– Color cue:• DOM nodes with different background colors
– Text cue: • If most of the children of a DOM node are text nodes, we prefer to not
divide it.
– Size cue: • We predefine a relative size (comparing with the size of the whole page
or sub-page) threshold for different tags
41
Step 2: Visual Separator Detection
• Visual separator is represented by (Ps, Pe)
– Ps is the start pixel while Pe is the end pixel
– horizontal or vertical lines in a web page that visually cross with no blocks in the pool
• Two parts– Separator Detection– Separator Weights Setting
42
Step 2: Visual Separator Detection (Cont.)
• Separator Detection– Start with only one separator (Ptl, Pbr)
– Add block into the pool and update separators• If the block is contained in a separator, split it
• If the block crosses a separator, update it
• If the block covers a separator, remove it
– Remove the separators on the border
S1
S2
S1
S2
S3
S1
S2
S3
S4
S1
S2
S3
1
2
34
43
Step 2: Visual Separator Detection (Cont.)
• Setting Weight for Separator– Blank size between blocks
– Overlap of the separator and tags, such as HR
– Difference of font sizes between the two sides of the separator
– Difference of colors between the two sides of the separator
44
Step 3: Content Structure Construction
• Maximally-weighted separators are chosen as the real separators
• Blocks that are not separated are merged
• A content structure is built at this level
• Each sub-block is checked if they meet the granularity requirement– For those that fail, we iteratively partition them– If all sub-blocks meet the requirement, we finally get the content
structure for the web page
45
An VIPS Example
tabl e
center
p
p
td
text
tabl e
tabl e
tabl e
tabl e
tabl e
p
tabl e
tabl e
tabl e
text
Bl ock1
Bl ock2
Bl ock3
Bl ock4
Bl ock5
Bl ock6
VB2_2_2
VB2_2_2_1
VB2_2_2_2
VB2_2_2_3
VB2_2_2_1_1
VB2_2_2_1_2
VB2_2_2_2_1
VB2_2_2_2_2
VB2_2_2_3_1
VB2_2_2_3_1
48
Example of Web Page Segmentation (2)
• Can be applied on web image retrieval– Surrounding text extraction
49
Experiments
• Manual evaluation of page segmentation– 140 pages are selected from 14 Yahoo! categories
Human judgment Number of pages
Perfect 86
Satisfactory 50
Failed 4
50
Web Page Block – Better Information Unit
Page Segmentation
• Vision based approach
Block Importance Modeling
• Statistical learning
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
WWW’03 Paper WWW’04 Paper
51
Block Importance (WWW’04)
• Page Importance– Importance page vs.
unimportant page– HITS, PageRank
• Block Importance– Valuable information vs.
noisy information– ?
52
A User Study of Block Importance
• Do people have consistent opinions about the importance of the same block in a page?
• Subjective importance– From users’ view– Attention: concentration of mental powers on an object, a
close or careful observing or listening– Affected by users’ purposes and preferences
• Objective importance– From authors’ views– Correlation degree between a block and the theme of the
web page
53
Settings of User Study
• Data– 600 web pages from 405 sites in 3 categories in Yahoo!: news, science and
shopping– Each category includes 200 pages– With diverse layouts and contents– 600 pages are segmented to 4,539 blocks using VIPS
• Importance Labeling– 5 human assessors to manually label
each block with 4-level importance values
– Level 1: noisy information such as advertisement, copyright, decoration, etc.
– Level 2: useful information, but not very relevant to the topic of the page, such as navigation, directory, etc.
– Level 3: relevant information to the theme of the page, but not with prominent importance, such as related topics, topic index, etc.
– Level 4: the most prominent part of the page, such as headlines, main content, etc.
54
Result Analysis
• Users do have consistent opinions when judging the importance of blocks
Levels 3/5 agreement 4/5 agreement 5/5 agreement
1,2,3,4 0.929 0.535 0.237
1,(2,3),4 0.995 0.733 0.417
(1,2,3),4 1 0.932 0.828
55
Result Analysis (cont.)
Levels 3/5 agreement 4/5 agreement 5/5 agreement
(1,2),3,4 0.965 0.76 0.562
1,(2,3),4 0.995 0.733 0.417
1,2,(3,4) 0.963 0.614 0.318
(1,3),2,4 0.965 0.553 0.244
1,3,(2,4) 0.965 0.555 0.248
(1,4),2,3 0.934 0.539 0.24
• Levels 2 and 3 are the most blurry zones to be distinguished
56
Block Importance Model
• A block importance model is formalized as
• Block features:– Content features
• Absolute– ImgNum, ImgSize– LinkNum, LinkTextLength– InnerTextLength– InteractionNum, InteractionSize: <INPUT> and <SELECT>– FormNum, FormSize: <FORM>
• Relative
importance blockfeatures block
57
Block Importance Model (cont.)
• Block features:– Spatial features
• Absolute– {BlockCenterX, BlockCenterY, BlockRectWidth,BlockRectHeight}
• Relative– {BlockCenterX/PageWidth, BlockCenterY/PageHeight,
BlockRectWidth/PageWidth, BlockRectHeight/PageHeight}
• Window– BlockRectHeight = BlockRectHeight/WindowHeight– BlockCenterY is modified
58
Learning Block Importance
• Training set T: Labeled blocks (x, y)– x: feature representation of a block– y: importance label
• A function f such is minimized
• Learning algorithms– Regression by neural network
• RBF network
– Classification by Support Vector Machines• Linear kernel• Gaussian RBF kernel
T)(2
)(y, -yfx x
59
Experiments
• Experimental setup
– 600 labeled web pages from 405 sites
– 4517 blocks for which at least 3 of the five assessors have
agreement on their importance
– 5-folder cross validation
– Measures: Micro-F1 and Micro-Accuracy
60
3-level vs. 4-level Importance
• For the 4-level importance model, the precision and recall of level 2 and 3 are much lower than those of level 1 and 4
• By combining level 2 and 3, the performance increased significantly
Level 1 Level 2 Level 3 Level 4 Micro-F1 Micro-Acc
4-level 0.708 (P)0.782 (R)
0.643 (P)0.658 (R)
0.567(P)0.372(R)
0.826 (P)0.822 (R)
0.685 0.843
3-level 0.763 (P)0.776 (R)
0.796 (P)0.804 (R)
0.839 (P)0.770 (R)
0.790 0.859
61
Spatial Features vs. All Features
• Content features do provide some complementary information to spatial features in measuring block importance
Level 1 Level 2 Level 4 Micro-F1 Micro-Acc
Spatial 0.714 (P)0.684 (R)
0.754 (P)0.769 (R)
0.805 (P)0.841 (R)
0.748 0.832
All 0.763 (P)0.776 (R)
0.796 (P)0.804 (R)
0.839 (P)0.770 (R)
0.790 0.859
62
Block Importance Model vs. Human Assessors
Level 1 Level 2 Level 3 Micro-F1 Micro-Acc
Assessor 1 0.817 (P)0.856 (R)
0.871 (P)0.857 (R)
0.934 (P)0.871 (R)
0.858 0.906
Assessor 2 0.756 (P)0.834 (R)
0.815 (P)0.782 (R)
0.816 (P)0.715 (R)
0.792 0.861
Assessor 3 0.864 (P)0.815 (R)
0.838 (P)0.881 (R)
0.852 (P)0.809 (R)
0.849 0.899
Assessor 4 0.904 (P)0.684 (R)
0.797 (P)0.908 (R)
0.827 (P)0.912 (R)
0.830 0.887
Assessor 5 0.849 (P)0.924 (R)
0.895 (P)0.882 (R)
0.938 (P)0.762 (R)
0.882 0.921
Average 0.838 (P)0.823 (R)
0.843 (P)0.862 (R)
0.873 (P)0.814(R)
0.842 0.895
Our model 0.763 (P)0.776 (R)
0.796 (P)0.804 (R)
0.839 (P)0.770 (R)
0.790 0.859
66
Using Block-level PageRank to Improve Search
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10.115
0.12
0.125
0.13
0.135
0.14
0.145
0.15
0.155
0.16
0.165
Combining Parameter
Ave
rag
e P
reci
sio
n
BLPR-CombinationPR-Combination
Block-level PageRank achieves 15-25% improvement over PageRank (SIGIR’04)
PageRank
Block-level PageRank
Search =IR_Score + (1- PageRank
67
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation Web Object Extraction (demo first)– Object Integration– Object Ranking– Object Mining
• Conclusion
69
Collecting Object Information
• Data Feed?– Limited coverage– Fail to handle “tail”
• Data Crawl– The largest index wins– Data refreshing
• Mining Web Objects– Bridge between unstructured and structured data– Deal with data with huge volume– Adapt to the highly diverse and dynamic Web environment
70
Object Information on the Web
Scientific Papers Researchers Product Items
Business Locations
Images Jobs
71
Existing Approaches
• Basic Idea– Convert HTML into a sequence of tokens or a tag tree– Discover pattern
• Representative Methods– Wrapper generation
• Manually write a wrapper
• Induct a wrapper [Liu 2000], [Kushmerick 1997]
– Extract structured data from Web pages that shared a common template• Equivalence classes [Arasu 2003]
• RoadRunner [Crescenzi 2001]
– Extract data record within a Web page• OMINI: record-boundary discovery
• IEPAD: pattern discovery on PAT tree
• MDR: repeated nodes discovery
– Extract data from tables in a Web page• Classify tables into genuine table or non-genuine table [Wang 2002]
• Extract data from data tables [Chen 2002], [Lerman 2001]
76
Vision-based Approach for Web Object Extraction
Visual Element Identification
Similarity Measure & Clustering
Record Identification & Extraction
Visual Element Identification
Similarity Measure & Clustering
Record Identification & Extraction
Object Blocks
78
Object-level Information Extraction (IE)
},...,,{ ,..... :sequence label optimal theFind
,... :sequenceelement object an Given
2121
21
miT
T
aaaAllllL
eeeE
• The Problem
Name
Price
Description
Brand
Rating
Image
Digital CameraObject Block
e1
e2
e3
e4
e5e6
a1
a2
a3
a4
a5
a6
Elem
ent
Attribute
79
Object Extraction as Sequence Data Labeling
• Sequence Characteristics:
product before researcher before
(name, desc) 1.000 (name, Tel) 1.000
(name, price) 0.987 (name, email) 1.000
(image, name) 0.941 (name, address) 1.000
(image, price) 0.964 (address, email) 0.847
(Image, desc) 0.977 (address, tel) 0.906
Product: 100 product pages (964 product blocks)
Researcher: 120 researcher’s homepages (120 homepage blocks)
80
Extended Conditional Random Fields
• Our Solution based on Extended Conditional Random Fields (ECRF) Model
T
t
K
kttkk
L
L
tDBEllfZ
DBELPL
1 11
*
),,,,(exp(1
maxarg
),,|(maxarg
DB
E
tll
t)DBE,llf
tt
ttk
is database the
and, is sequenceelement totalthe
,at occurs n transitioa
:event about the featurearbitrary an measures ,,,(
1
1
81
Example Features in Extended CRF Model
• Text Features– Only contain “$” and digits – Percent of digits in the element, …
• Vision Features– Font size, color, style,…– Element size & position– Separators: lines, blank areas, images…
• Database Features– Match an attribute of a record– Match key attributes of a record, …
83
Experimental Results
0
20
40
60
80
100
1 2
Ins
tan
ce
Ac
cu
rac
y (
%)
OLIE CRF HMM
Paper Citation Paper Header
84
Information Integration
Database
Name
Price
Description
Brand
Rating
Image
Digital Camera
Website 1
Website N
Improve IE process
85
Results with DB of various sizes
84
85
86
87
88
89
90
91
92
0 50000 100000 150000
Database Size
Accu
racy (
%)
2D Conditional Random Fields for Web Information Extraction
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang and Wei-Ying Ma
August 9, 2005
87
Limitations of Linear-Chain CRFs
Image
Name
Description
Price Other Other Other
Sequentialization
Image
Name
Description
Price Other Other Other Image
Name
Description
Price Other Other Other
• Attributes are two-dimensionally laid out• Which sequentialization is better?• Two-dimensional interactions are
seriously lost
Image
Name
Description
Price Other Other Other
88
2D Conditional Random Fields
(0: 1)j M
X
,i jY
, ,
1y | x exp ( , y | , x) ( , y | , x)
(x) k k e k k ve E k v V k
p f e g vZ
y , ,
x exp ( , y | , x) ( , y | , x)k k e k k ve E k v V k
Z f e g v
90
Modeling an Object Block
1,0y 1,1y1,2y
2,2y
3,2y
ynull
ynull
ynull ynull0,0y
ynull
ynull
1,0y 1,1y 1,2y
2,2y
3,2y
ynull
ynull
1,1y
ynull ynull
1,1y
0,0y
1,0x1,1x 1,2x
2,2x
3,2x
xnull
xnull
xnull xnull0,0x
xnull
xnull
91
Experiment -- Dataset
• Randomly crawled 572 Web pages & collected 2500 Web blocks using Vision-based segmentation technology
• Two types of Web blocks– ODS: one dimensional blocks (information doesn’t have two-dimensional
interactions)– TDS: two dimensional blocks (information does have two-dimensional
interactions)
• Training Set– 500 Web blocks ( 400 TDS + 100 ODS)
• Testing Sets– ODS (1000)– TDS (1000)
92
Experiment -- Evaluation Criteria
• Precision– the percentage of returned elements that are correct
• Recall– the percentage of correct elements that are returned
• F1 Measure– the harmonic mean of precision & recall
• Average F1 Measure– the average of F1 values of different attributes
• Block Instance Accuracy– the percentage of blocks of which the important attributes (name, image, and price) are
correctly labeled
93
Experiment -- Results
0. 5
0. 55
0. 6
0. 65
0. 7
0. 75
0. 8
0. 85
0. 9
0. 95
1
name i mage pr i ce desc name i mage pr i ce desc name i mage pr i ce desc avg_f 1 Bl k_I A
preci si on recal l f 1 Bl k_I A
Li near- Chai n CRFs 2D CRFs
94
Experiment -- Results
0. 5
0. 55
0. 6
0. 65
0. 7
0. 75
0. 8
0. 85
0. 9
0. 95
1
name i mage pri ce desc name i mage pri ce desc name i mage pri ce desc avg_f1 Bl k_I A
preci si on recal l f 1 Bl k_I A
Li near-Chai n CRFs 2D CRFs
Simultaneous Record Detection and Attribute Labeling in Web Data Extraction
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma
SIGKDD 2006
96
De-coupled Web Object Extraction
• Papers – Nie et al., ICDE’06 & Zhu et al., ICML’05
• Technical Transfer of the Year Award (MSRA 2006)
• Basic Idea
1Y 2Y 3Y
4Y
5Y
1Y
1Y
2Y
1Y 1Y
2Y
0Y
1Y 2Y 3Y
4Y
5Y
0Y
Step1: Record detection and Element segmentation
Step2: Attribute Labeling
Kristjansson et al. AAAI’04, Nie et al., ICDE’06
Zhu et al., ICML’05
97
Inefficiencies
• Error propagation– Limited in overall performance
• Lack of semantics in record detection– Semantics help identify records
• Lack of mutual interactions in attribute labeling
– Records in the same page are related and mutually constrained
• First-order Markov assumption– Fail to incorporate long distance dependency
98
Vision-based Page Representation
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
• A Web Page can be represented as a Vision-tree [Cai et al., 2004]
– Makes use of page layout features such as font, color, and size.– Each node represents a data region in the Web page, is called a block
99
Joint Web Object Extraction
• Definition 1: Record Detection– Given a vision-tree, record detection is the task of locating the minimum
set of blocks that contain the content of a record.
• Definition 2: Attribute Labeling– For each identified record, attribute labeling is the process of assigning
attribute labels to the leaf blocks (or elements) within the record.
• Definition 3: Joint optimization of record extraction and attribute labeling: – let be the features of all the blocks,– let be one possible label assignment of the corresponding
blocks.
– *y arg max (y | x)p
0 1x {x ,x , ,x }N
0 1y {y ,y , , y }N Let Hierarchical CRF model do it!
100
Hierarchical CRF Model
• Assumptions– Sibling variables are directly
interacted
– Non-sibling variables are conditionally independent
• Parameter estimation and labeling can be solved using standard junction tree algorithm
• Details in our paper……
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26
, ,
,
(y | ,x) (y | ,x)1
y | x expx (y | ,x)
k k v k k ev k e k
k k tt k
g f
pZ h
101
HCRF Model for Web Object Extraction
• Inter-level interactions– Dependencies between parents and children
– Different from Multi-scale CRF model [He et al., 2004]
• Long distance dependencies– Through the dependencies at various levels and the inter-level interactions
• Flexibility to incorporate any useful feature– HCRF model is a conditional model, and also a CRF model
102
• Inter-level interactions– Dependencies between parents and children– Different from Multi-scale CRF model [He et al., 2004]
• Long distance dependencies– Through the dependencies at various levels and the inter-level interactions
• Flexibility to incorporate any useful feature– HCRF model is a conditional model, and also a CRF model
• Computational efficiency
HCRF Model for Web Object Extraction
Contain Name and Price
Contain Description
Contain Image
103
HCRF Model for Web Object Extraction
• Inter-level interactions– Dependencies between parents and children– Different from Multi-scale CRF model [He et al., 2004]
• Long distance dependencies– Through the dependencies at various levels and the inter-level interactions
• Flexibility to incorporate any useful feature– HCRF model is a conditional model, and also a CRF model
• Computational efficiency
Product NameProduct Image
Product Description
Product Price
104
Empirical Evaluation
• Two datasets for two types of Web pages– List dataset (LDST)
• 771 list pages (200 for training and 571 for testing)
– Detail dataset (DDST)• 450 detail pages
(150 for training and 300 for testing)
00. 10. 20. 30. 40. 50. 60. 70. 80. 9
1
preci si on recal l F1
2D CRF HCRF
105
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction Object Integration– Object Ranking– Object Mining
• Conclusion
107
Object Identification
• Only text similarity is used in existing approaches
• Connection strength in the object relationship graph is another important evidence
The author identification problem
Lei Zhang
Multiple researchers with the same name
Alon Levy
Alon Y. Halevy
Multiple names of the same researcher
Conference
Paper
Paper
Paper
Paper
Paper
Paper
Paper
Paper
Conference
Identifying all papers by a researcherL
ei Z
han
g
108
Web Connections
• Local information is incomplete• Web is a good source for validating connections between objects• Co-appearance on the same sentence, Web page, or Website
Web
O1 O2
……
Web
109
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction– Object Integration Object Ranking Object Mining
• Conclusion
110
Object-level Ranking:
Bringing Order to Web Objects
Zaiqing Nie Yuanzhi Zhang
Ji-Rong Wen Wei-Ying Ma
Microsoft Research Microsoft Research AsiaAsia
(presented by Zaiqing Nie)
111
Object Relationship Graph
• Different types of links– Paper->paper, author->paper, conf->paper– have different semantics– Affect the popularity of the related objects differently
112
Back-Links of An Object on the Web
• Adding a Popularity Propagation Factor to each relationship link– The same type of links have the same factor
• The popularity of an object is also affected by the popularity of the Web pages containing the object
113
Biased Random Surfing Behavior
• A Random Object Finder Model– Starting his random walk on the Web to find the first seed
object– Then starting following only the relationship links– Eventually getting bored– Restarting his random walk on the Web again to find another
seed object
• Popularity of an Object Depends on– The Probability of finding the object through the Web graph– The Probability of finding the object through the object
relationship graph
114
The PopRank Model
RX = εREX + (1-ε)ΣγYXMTYXRY
Where• X = {x1,x2,…,xn },Y = {y1,y2,…,yn}: objects of type X and type Y ;
• RX, RY: vector of popularity rankings of objects of type X and type Y;
• MYX: adjacent matrices,
myx = 1 / Num(y;x) , if there is a relationship link from object y to object x, Num(x,y) denotes the number of links from object y to any objects of type X;
myx = 0, otherwise;
• γYX: denotes the popularity propagation factor of the relationship link from an object of type Y to an object of type X
• REX: vector of Web popularity of objects of type X;
• ε: a damping factor which is the probability that the "random object finder" will get bored of following the object relationship links and start looking for another object through the Web graph.
115
How to Assign PPF Factors
• Impractical to manually assign PPF factors
• Easy to collect some partial ranking of the objects from domain experts– An example: SIGMOD -> VLDB -> ICDE -> ER
• A typical parameter optimization problem– Selecting a combination of PPF factors for the PopRank model
that results in rankings of the training objects that match the expert ranking as closely as possible.
– Exploring the search space using a simulated annealing approach
116
Searching for PPF Factors
PopRankCalculator
Ranking DistanceEstimator
Select a new combination from
neighbors of the best
Chosen as the best
Link Graph Initial Combination of PPFs
Better than the best
?
AcceptThe worse one
?
Expert Ranking
Yes
No
Yes
117
Challenges Facing our Learning Approach
• May take hours/days to try and evaluation a single combination of PPF factors for a large graph
• Prohibitively expensive to try hundreds of combinations
The effect decreases as the "relationship distance" increases
A subgraph that includes the training objects and their closely related objects to approximate the full graph
118
Subgraph Selection
PopRankCalculator
Ranking DistanceEstimator
Increase the distance
Link Graph Initial Relationship Distance
Greater thanthe stop threshold
?
Ranking from the full graph
Yes
NoDone
119
Experimental Study
• Datasets– 7 million object relationship links from three different types
of links– 1 million papers, 650,000 authors, 1700 conferences, and
480 journals– 14 partial ranking lists containing ranking information for 67
objects (8 for training, and 6 for testing)
120
Experimental Results for Different Subgraph
0
3000
6000
9000
12000
15000
18000
1 2 3 4 5 6 7 Al l
Di ameters(k)
Time
(s)
0. 15
0. 2
0. 25
0. 3
0. 35
0. 4
0. 45
0. 5
1 2 3 4 5 6 7 Al l
Di ameters(k)
Rank
ing
Dist
ance
Learning Time Ranking Distance
121
Experimental Results for Different Stop Thresholds
0
30000
60000
90000
Threshol ds ( )
Time
(s)
δ
0. 15
0. 18
0. 21
0. 24
0. 27
0. 3
Threshol ds( )
Rank
Dis
tanc
e
δ
Learning Time Ranking Distance
123
Conclusion
• A PopRank model for calculating object popularity scores– Web graph– Object relationship graph
• An Automated approach for assigning Popularity Propagation Factors
• The effectiveness is shown in Libra– Significantly better than PageRank
• Generally applicable for most vertical search domains– Product Search, Movie Search, …
124
Web Object Retrieval
• Information about a Web Object is Extracted from Multiple Sources– Inconsistent Copies – Reliability assumption no longer valid
• Inconsistency Example
SourceTitle Authors
Ground Truth
Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives
Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel
CiteSeer Towards Higher Disk Head Utilization:
Extracting Free Bandwidth From Busy Disk Drives Christopher R. Lumb, Jiri...
DBLP Towards Higher Disk Head Utilization: Extracting “Free” Bandwidth from Busy Disk Drives
Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, Erik Riedel
125
Unreliability of Information about Web Objects
Web Object
Attribute1 Attribute2
Attribute
n
Object Block1 Object Block2 Object Blockm
imp 1
imp 2
imp n
conf 1
conf 2
conf m
The unreliability of Objects–Unreliable data sources–Incorrect object detection–Incorrect attribute value extraction
126
Web Object Retrieval
• A Language Model for Object Retrieval
• Balancing Structured and Unstructured Retrieval– Block-level unstructured object retrieval– Attribute-level retrieval– Using the confidence of the extracted object
information as the parameter to find the balance
Web Source1 Web Source2 Web Sourcem
Record Record Record
Attribute1
Web Object
Attribute2 Attributen
Record
Extraction
Record-level
Representation
Attribute
Extraction
Attribute-level
Representation
α1α2 αm
γ1 γ2 γm
β1 β2 βn
1 1
1( | ) (1 ) ( | )
K M
k k j k jkk j
P w O P w OM
127
Experimental Results
• Models– Bag of Words (BW) – Unstructured Object Retrieval (UOR) – Multiple Weighted Fields (MWF)– Structured Object Retrieval (SOR)– Balancing Structured and Unstructured Retrieval (BSUR)
0.65
0.7
0.75
0.8
0.85
0.9
10 20 30 40 50 60 70 80Error Rate (%)
Pre
cisi
on
BW UOR MWF SOR BSUR
128
Outline
• Overview• Demo: Libra Academic Search• Core Technologies
– Vision-based Page Segmentation– Web Object Extraction– Object Integration Object Ranking Object Mining
• Conclusion
130
Research Community Mining
• Motivation– Discovering research communities and their important papers, authors
• A community is described as a set of concentric circles– Core objects in the center– Affiliated objects surround the core with different ranks
132
Conclusion
• An object-level vertical search model is proposed
• Key technologies to build object-level vertical search engine – Object extraction
– Object identification
– Object popularity ranking
– Object community mining
• More applications– Yellow page search
– Job search
– Mobile search
– Movie search
– ……