compact encodings for all local path information in web taxonomies with application to wordnet
DESCRIPTION
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet. Svetlana Strunja š-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Compact Encodings for All Local Path Compact Encodings for All Local Path Information in Web Taxonomies with Information in Web Taxonomies with
Application to WordNetApplication to WordNet
Svetlana Strunjaš-Yoshikawa
Joint with Fred Annexstein and
Kenneth Berman
{strunjs,annexste,berman}@ececs.uc.edu
University of Cincinnati
IntroductionIntroduction
Consider Lowest Common Ancestor Query Problem
– Find most specific common generalization or least common subsumer among 2 or more terms or attributes in a large hierarchical/classification data sets
– Constraint: Evaluate queries without indirection
– Goal: Compact labeling schemes for taxonomies
Introduction (cont’d)Introduction (cont’d)
Applications– Fast classification of sets and similarity, e.g.
prediction sets similar to Google Sets (given “Bush" and “Clinton” it predicts all other US presidents)
– Fast answers to ancestor queries in XML search, e.g., test if 2 terms share a parent node without loading XML file (see[1],[2])
– Fast navigation through voluminous web taxonomies (see [3])
Data ModelData Model
Structural properties found in well-known web taxonomies:– large variance out-degree(Δ), i.e., some
nodes have many subclasses– small in-degree (δ) range and variance– small depth (σ) (logarithmic)– small number (>1) of paths from root
See paper for table of statistical values for Wordnet, ODP, and Math taxonomies
Our ApproachOur Approach
Given: large, rooted web taxonomies represented abstractly as Directed Acyclic Graph or DAG with above statistics
Problem: Label each node of the DAG so that all local path information for each taxonomy element is preserved in the encoding
Our labeling scheme is a variable-length, prefix-based scheme, and built up in two stages
Our Approach (cont’d)Our Approach (cont’d)
1.Greedy Dewey Labeling for Trees
(TGDL)
-Identifies a Breadth-First tree T in a DAG
-Encodes path information for the paths in T
-Label nodes with concatenation of edge labels
GDL exampleGDL example
v0
00 01
v1 v2
10
v3 v4
Out-degree 4 requires edge labels of maximum length 2.
v0
v1
0
v600
……..
Out-degree 600 requires edge labels of maximum length 10.
1101101110
TGDL exampleTGDL example
1 00 0 01
0 1
1
0
0 0
0
1
00
00
01 10
01
.1 .00 .0 .01
.1.0 .1.1 .00.0 .0.1 .0.00 .0.01.0.0 .01.0
.0.0.0 .0.0.1 .0.0.00 .0.0.01 .0.0.10
Analysis of the Length for TGDL Analysis of the Length for TGDL LabelsLabels
Performed in 2 stepsFirst step: assume that delimiting
labels are empty -- each node v labeled with bits at most
Second step: Using different edge delimiting schemes estimated upper bound of node labels
nvlog
Delimiting schemesDelimiting schemes
They encode length of each tree-edge label
Two approaches tested:
• Unary Length Encoding• Fixed Binary Length Encoding
Unary Length Encoding (ULE)Unary Length Encoding (ULE)
Comparable to Elias Gamma Code Gamma ULE
1 1 10 2 010 113 011 0100 4 00100 01015 00101 01106 00110 01117 00111 0010008 0001000 001001
ULE assigns |e|-1 bits long zero prefix to an edge label e with GDL label of the length |e|
Unary Length Encoding (ULE) Unary Length Encoding (ULE) AnalysisAnalysis
Theorem:
Upper bound on TGDL label length with
ULE of delimiters is
bits, for an arbitrary node v in a tree T
- is the depth of v in T
- n is number of nodes in T
)log(2 nv
v
Fixed Binary Length Encoding Fixed Binary Length Encoding (FBLE)(FBLE)
For an edge e, this encoding is the binary representation of the length for GDL(e)
Encoded with a fixed number of bits
- is the maximum node out-degree in T
- uses 4 bits in our application
)1log(log * *
FBLE exampleFBLE example
- 4 bits will encode delimiters for any T with maximum out-degree < 2^16
- Let e is an edge in T with a given GDL
label, e.g. GDL(e)=0000111111
Then FBLE produces delimiter 1010,
so label for e is 10100000111111
Fixed Binary Length Encoding Fixed Binary Length Encoding (FBLE) Analysis(FBLE) Analysis
Upper bound on TGDL label length with FBLE of delimiters is
bits, for an arbitrary node v in a tree T
)1log(loglog * vv
n
Our Approach (cont’d2)Our Approach (cont’d2)
2.Extended Greedy Dewey Labeling for DAGs (EGDL)
-Augment codes generated from step 1 -Used for inferring paths not part of the
Breadth-First tree -Adds TGDL node label pairs of non-tree
edges
1 00 0 01
0 1
1
0
0 0
0
1
00
00
01 10
01
.1 .00 .0 .01
.1.0 .1.1 .00.0 .0.1 .0.00 .0.01.0.0 .01.0
.0.0.0 .0.0.1 .0.0.00 .0.0.01 .0.0.10
EGDL Labeling - ExampleEGDL Labeling - Example
.01*.0.01
.01*.0.0
.0.01*.0.01
Experimental Results Experimental Results for Wordnet taxonomy for Wordnet taxonomy (n= 80K)(n= 80K)
Experimental Results-Label Experimental Results-Label LengthsLengths
WordNet 2.1 81426 Indegree
OutDegree
Depth
Paths
6
619
17
12
0
0
0
1
1.027
1.027
7.193
1.433
0.029
43.82
4.825
0.566
Encoding Length
Wordnet 2.1 Statistics
Type of Encoding Max.Length Avg.Length
EGDL with Unary Length Enc.
EGDL with Fixed Bin. Length Enc.
417
611
43.04
64.90
Fixed length baseline 17*log2(619)=170 170
ReferencesReferences
[1] Budanitsky, A., Hirst, G. Semantic distance in WordNet: An experimental, application-oriented evaluation of fivemeasures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association forComputational Linguistics, Pittsburgh,PA, 2001.[2] Resnik, F. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th InternationalJoint Conference on Artificial Intelligence (IJCAI), pages 448–453, 1995.[3] Christophides, V., Plexousakis, D. On Labeling Schemes for the Semantic Web. InProceedings of the 12th international conference on World Wide Web, pages 544–555, Budapest, Hungary.[4] Abiteboul., S., Kaplan, H., Milo, T. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556, Washington, D.C., 2001.[5] Strunjas-Yoshikawa, S., Annexstein, F., Berman, K. Compact Encodings for All Local Path Information in Web Taxonomies with applications to WordNet . In Proceedings of the 32nd International Conference on Current Trends in Theory and Practice of Computer Science, Merin, Czech Republic, January 21-27, 2006.