hierarchical link analysis for ranking web data · link analysis on the web link analysis given a...

Post on 19-Jul-2020

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hierarchical Link Analysis for Ranking Web Data

Renaud Delbru, Nickolai Toupikov, Michele Catasta, GiovanniTummarello, and Stefan Decker

Digital Enterprise Research Institute, Galway

June 1, 2010

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Introduction

Web of Data

There is a growing increase of web data sources ...

Linked Open Data cloud;Open Graph protocol;e-commerces (good relations), e-government, ...

How to search and retrieve relevant information ?

One single query can return million of entities ...... and users expect only the most relevant ones.Web data search engines (e.g., Sindice) need effective way torank entities.Partial solution: Popularity-based entity ranking.

1 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structureSindice: Dataset/Entity centric view

2 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structureSindice: Dataset/Entity centric view

2 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structureSindice: Dataset/Entity centric view

2 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structureSindice: Dataset/Entity centric view

2 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structureSindice: Dataset/Entity centric view

2 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: Web Data Model

Web Data ModelWeb Data GraphDataset GraphInternal and External NodeIntra and Inter-Dataset EdgeLinksetTwo-Layer ModelQuantifying the Two-Layer Model

3 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Web Data Graph

Figure: Web data graph

4 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset Graph

Figure: Dataset graph

5 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Internal and External Node

Figure: Internal (red) and external nodes (blue)

6 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Intra and Inter-Dataset Edge

Figure: Inter-dataset (orange) and intra-dataset (black) edges

7 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Linkset

Figure: Linkset

8 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Two-Layer Model

Figure: Two-layer model of the Web of Data

9 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Quantifying the two-layer model

Datasets

DBpedia 17.7 million of entitiesCiteseer (RKBExplorer) 2.48 million of entities

Geonames 13.8 million of entitiesSindice 60 million of entities among 50.000 datasets

Dataset Intra Inter

DBpedia 88M (93.2%) 6.4M (6.8%)Citeseer 12.9M (77.7%) 3.7M (22.3%)Geonames 59M (98.3%) 1M (1.7%)Sindice 287M (78.8%) 77M (21.2%)

Table: Ratio intra / inter dataset links

10 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: The DING Model

The DING ModelOverviewUnsupervised Link WeightingComputing DatasetRankComputing Local EntityRankCombining Dataset Rank and Entity Rank

11 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the

top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link

analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and

combined with their local ranks to estimate a global entityrank.

12 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the

top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link

analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and

combined with their local ranks to estimate a global entityrank.

12 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the

top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link

analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and

combined with their local ranks to estimate a global entityrank.

12 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the

top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link

analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and

combined with their local ranks to estimate a global entityrank.

12 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i ,j

Assign low weight to very common links, such as rdfs:seeAlso

wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑

Lτ,i ,k |Lτ,i ,k |× log

N

1 + freq(σ)

13 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i ,j

Assign low weight to very common links, such as rdfs:seeAlso

wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑

Lτ,i ,k |Lτ,i ,k |× log

N

1 + freq(σ)

14 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i ,j

Assign low weight to very common links, such as rdfs:seeAlso

wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑

Lτ,i ,k |Lτ,i ,k |× log

N

1 + freq(σ)

15 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graph

Distribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset

rk(Dj) = α∑Lσ,i ,j

rk−1(Di )wσ,i ,j + (1− α)|EDj|∑

D∈G |ED |

16 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graph

Distribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset

rk(Dj) = α∑Lσ,i ,j

rk−1(Di )wσ,i ,j + (1− α)|EDj|∑

D∈G |ED |

17 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graphDistribution factor wσ,i ,j is defined by LF-IDF

Probability of random jump is proportional to the size of adataset

rk(Dj) = α∑Lσ,i ,j

rk−1(Di )wσ,i ,j + (1− α)|EDj|∑

D∈G |ED |

18 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graphDistribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset

rk(Dj) = α∑Lσ,i ,j

rk−1(Di )wσ,i ,j + (1− α)|EDj|∑

D∈G |ED |

19 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Local EntityRank

Generic Algorithms

Weighted EntityRank: Weighted PageRank applied on the internalentities and intra-links of a dataset

Weighted LinkCount: in-degree counting links applied on theinternal entities and intra-links of a dataset

20 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Combining Dataset Rank and Entity Rank

Naive approach

Purely probabilistic point of view: joint probability

Assumption: independent events

Global score rg (e) = P(e ∩ D) = r(e) ∗ r(D)

Problem: favours smaller datasets

DING Approach

Add a local entity rank factor;

Normalise local ranks to a same average based on dataset size

rg (e) = r(D) ∗ r(e) ∗ |ED |∑D′∈G |E ′

D |

21 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: Experimental Results

Experimental ResultsOverviewUser StudySemSearch 2010

22 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Experimental Results: Overview

Link Analysis Methods

Global EntityRank (GER);

Local LinkCount (LLC) and Local EntityRank (LER);

Local algorithms combined with DatasetRank (DR-LLC andDR-LER).

Experiments

1 User study to evaluate qualitatively each methods;

2 Semantic Search challenge.

23 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study: Design

Exp-A

Local entity ranking (LER & LLC) on DBpedia dataset31 participants

Exp-B

DING (DR-LER & DR-LLC) on Sindice’s page-repository58 participants

Task

10 queries (keyword and SPARQL queries)One result list (top-10) per algorithmRate algorithms (W, SW, S, SB, B) in relation to GER

24 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study: Questionnaire

Figure: One of the questionnaire given to the participant

25 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study A: Results

(a) LER

Rate Oi Ei %χ2

B 0 6.2 −13%SB 7 6.2 +0%S 21 6.2 +71%SW 3 6.2 −3%W 0 6.2 −13%Totals 31 31

(b) LLC

Rate Oi Ei %χ2

B 3 6.2 −12%SB 8 6.2 +4%S 13 6.2 +53%SW 6 6.2 −0%W 1 6.2 −31%Totals 31 31

Table: Chi-square test for Exp-A. The column %χ2 gives, for eachmodality, its contribution to χ2 (in relative value).

Conclusion

LER and LLC provides similar results than GER. However, there isa more significant proportion of the population that considers LERmore similar to GER.

26 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study B: Results

(a) DR-LER

Rate Oi Ei %χ2

B 12 11.6 +0%SB 12 11.6 +0%S 22 11.6 +57%SW 9 11.6 −4%W 3 11.6 −39%Totals 58 58

(b) DR-LLC

Rate Oi Ei %χ2

B 7 11.6 −9%SB 24 11.6 +65%S 13 11.6 +1%SW 10 11.6 −1%W 4 11.6 −24%Totals 58 58

Table: Chi-square test for Exp-B. The column %χ2 gives, for eachmodality, its contribution to χ2 (in relative value).

Conclusion

It appears that DR-LLC provides a better effectiveness. A largeproportion of the population finds it slightly better than GER, andthis is reinforced by a few number of people finding it worse.

27 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

SemSearch 2010: Entity Search Track

SemSearch 2010

First semantic search evaluation;

Focus on entity search.

Experiment Design

Billion Triple Challenge 2009 dataset;

92 keyword queries;

Relevance judgement on top 10 entities.

28 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

SemSearch 2010: Experiment Results

Figure: SemSearch 2010 evaluation results

29 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Computing Dataset Rank

Graph Node Edge

Web Data 60M 364MDataset 50K 1.2M

Table: Graph Size

DatasetRank

1 iteration ≈ 200ms;Good quality rank in few seconds.

30 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Dataset size distribution

Power-law distribution;The majority of the datasets contain less than 1000 nodes.

31 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Computing Entity Rank

EntityRank

55 iterations of 1 minute (for DBPedia dataset).

LinkCount

requires only 1 iteration;can be computed on the fly with appropriate data index.

32 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.

Graph Structure Dataset Algorithm

Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

33 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.

Graph Structure Dataset Algorithm

Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

34 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.

Graph Structure Dataset Algorithm

Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

35 / 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Conclusion

DING Method

Hierarchical Link Analysis for web data;Quality comparable or even better than standard approaches;Lower computational complexity;Dataset-dependent local entity ranking.

Future Work

Investigate how to detect appropriate local entity rankingmethod for a dataset;Study query-dependent ranking and how it can be combinedwith DING ranking.

36 / 36

top related