cuoliang li, beng chin ooi, jianhua feng, jianyong wang, lizhu zhou tsinghua university
DESCRIPTION
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data. Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University SIGMOD 2008 2009. 03. 19. Summarized by Jaehui Park , IDS Lab., Seoul National University - PowerPoint PPT PresentationTRANSCRIPT
EASE: An Effective 3-in-1 Keyword Search Method EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured for Unstructured, Semi-structured and Structured DataData
Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou
Tsinghua University
SIGMOD 2008SIGMOD 2008
2009. 03. 19.
Summarized by Jaehui Park, IDS Lab., Seoul National University
Presented by Jaehui Park, IDS Lab., Seoul National University
Copyright 2008 by CEBT
INTRODUCTIONINTRODUCTION
Keyword search capability into text documents, XML documents, and relational databases
Graph index
Instead of traditional inverted index
– Effective for unstructured data
– Inadequate for complex structural information.
EASE (Efficient and Adaptive keyword Search method)
Efficient algorithmic basis for scalable top-k-style processing of large amounts of heterogeneous data
– Employing and adaptive, efficient and novel index
2
Copyright 2008 by CEBT
ContributionsContributions
Model for unstructured, semi-structured and structured data as graphs
Effective graph index as opposed to the inverted index
Novel ranking mechanism for both DB and IR viewpoint
Extensive performance study
3
Copyright 2008 by CEBT
MotivationMotivation
Unstructured
Link awareness
– Relevant data may be separated into different pages but linked through hyperlinks
(Semi-) Structured
LCA (Lowest common ancestors)
– Connected tree with minimal cost
Ex) Steiner trees
4
Copyright 2008 by CEBT
r-Radius Steiner Graph Problemr-Radius Steiner Graph Problem
Meaningful Steiner graphs with acceptable sizes
Several concepts
Centric distance
Radius
r-Radius Steiner tree
– Radius of a Steiner graph cannot be larger than r
5
Copyright 2008 by CEBT
ExampleExample
DBLP example
6
Copyright 2008 by CEBT
The r-Radius Seiner Graph ProblemThe r-Radius Seiner Graph Problem
Given a graph and an input keyword query K, the r-Radius Seiner Graph Problem is to find all the r-radius Steiner graphs in , which contain all or a portion of the input keywords in K, ranked by relevancy with K.
7
Copyright 2008 by CEBT
EASE: An adaptive search methodEASE: An adaptive search method
Inverted indices are not effective for discovering the much richer structural relationships existing in databases with complicated structured [10].
Index r-radius Steiner graphs for each combination
– Very expensive
Proposed method
1. Discover r-radius graphs (indexing)
2. Extracting r-radius Steiner graphs (on the fly)
– By removing non-Steiner nodes
8
Copyright 2008 by CEBT
EASE: An adaptive search methodEASE: An adaptive search method
Adjacency Matrix
Extracting r-radius graphs effectively
9
Copyright 2008 by CEBT
EASE: An adaptive search methodEASE: An adaptive search method
Determining the subgraph that are r-radius graphs
By Lemma 1.
For efficient retrieval of r-radius graphs
– Graph index
r-radius graph that contain query keywords k
Extracting r-radius Steiner graphs
By Theorem 1.
10
Copyright 2008 by CEBT
EASE: An adaptive search methodEASE: An adaptive search method
Computing the Steiner nodes
1111
Copyright 2008 by CEBT
EASE: An adaptive search methodEASE: An adaptive search method
Maximal r-Radius Graph
Avoid redundancy
– Keep the maximal r-radius graphs in the graph index
Overlapping graphs
Graph partitioning
Avoid the incurrence of huge storage
Only need to retrieve the corresponding relevant graph partitions
Graph similarity
– Bigger overlap -> higher similarity
12
Copyright 2008 by CEBT
SummarySummary
1. Obtain adjacency matrix M
2. Compute Mr
3. Extract the maximal r-radius graphs
4. Cluster the graphs by employing the existing K-means algorithm and partition the graph
5. Construct the graph index to materialize the maximal r-radius graphs
13
Copyright 2008 by CEBT
OthersOthers
Ranking Functions
TF-IDF based IR-ranking
Structural Compactness-based DB Ranking
– Intuitively, when an r-radius Steiner graph SG is more compact, SG is more likely to be meaningful and relevant.
Indexing
14
Copyright 2008 by CEBT
Experimental studyExperimental study
Dataset: DBLife, DBLP and IMDB
Comparison
Unstructured
– InfoUnit [18]
Semi-structured
– SLCA [28]
Structured
– DPBF [6]
15
Copyright 2008 by CEBT
Experimental studyExperimental study
16
Copyright 2008 by CEBT
Experimental studyExperimental study
17
Copyright 2008 by CEBT
ConclusionConclusion
Proposed an efficient and adaptive keyword search method
EASE
– Keyword queries over unstructured, semi-structured and structure data
Examined the issues of indexing and ranking
By taking into account both the structural compactness
Experimental results shows that EASE achieves both high search efficiency and quality for keyword search over heterogeneous data.
18