compact encodings for all local path information in web taxonomies with application to wordnet

20
Compact Encodings for All Local Compact Encodings for All Local Path Information in Web Path Information in Web Taxonomies with Application to Taxonomies with Application to WordNet WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs .uc.edu University of Cincinnati

Upload: meena

Post on 26-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet. Svetlana Strunja š-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Compact Encodings for All Local Path Compact Encodings for All Local Path Information in Web Taxonomies with Information in Web Taxonomies with

Application to WordNetApplication to WordNet

Svetlana Strunjaš-Yoshikawa

Joint with Fred Annexstein and

Kenneth Berman

{strunjs,annexste,berman}@ececs.uc.edu

University of Cincinnati

Page 2: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

IntroductionIntroduction

Consider Lowest Common Ancestor Query Problem

– Find most specific common generalization or least common subsumer among 2 or more terms or attributes in a large hierarchical/classification data sets

– Constraint: Evaluate queries without indirection

– Goal: Compact labeling schemes for taxonomies

Page 3: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Introduction (cont’d)Introduction (cont’d)

Applications– Fast classification of sets and similarity, e.g.

prediction sets similar to Google Sets (given “Bush" and “Clinton” it predicts all other US presidents)

– Fast answers to ancestor queries in XML search, e.g., test if 2 terms share a parent node without loading XML file (see[1],[2])

– Fast navigation through voluminous web taxonomies (see [3])

Page 4: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Data ModelData Model

Structural properties found in well-known web taxonomies:– large variance out-degree(Δ), i.e., some

nodes have many subclasses– small in-degree (δ) range and variance– small depth (σ) (logarithmic)– small number (>1) of paths from root

See paper for table of statistical values for Wordnet, ODP, and Math taxonomies

Page 5: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Our ApproachOur Approach

Given: large, rooted web taxonomies represented abstractly as Directed Acyclic Graph or DAG with above statistics

Problem: Label each node of the DAG so that all local path information for each taxonomy element is preserved in the encoding

Our labeling scheme is a variable-length, prefix-based scheme, and built up in two stages

Page 6: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Our Approach (cont’d)Our Approach (cont’d)

1.Greedy Dewey Labeling for Trees

(TGDL)

-Identifies a Breadth-First tree T in a DAG

-Encodes path information for the paths in T

-Label nodes with concatenation of edge labels

Page 7: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

GDL exampleGDL example

v0

00 01

v1 v2

10

v3 v4

Out-degree 4 requires edge labels of maximum length 2.

v0

v1

0

v600

……..

Out-degree 600 requires edge labels of maximum length 10.

1101101110

Page 8: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

TGDL exampleTGDL example

1 00 0 01

0 1

1

0

0 0

0

1

00

00

01 10

01

.1 .00 .0 .01

.1.0 .1.1 .00.0 .0.1 .0.00 .0.01.0.0 .01.0

.0.0.0 .0.0.1 .0.0.00 .0.0.01 .0.0.10

Page 9: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Analysis of the Length for TGDL Analysis of the Length for TGDL LabelsLabels

Performed in 2 stepsFirst step: assume that delimiting

labels are empty -- each node v labeled with bits at most

Second step: Using different edge delimiting schemes estimated upper bound of node labels

nvlog

Page 10: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Delimiting schemesDelimiting schemes

They encode length of each tree-edge label

Two approaches tested:

• Unary Length Encoding• Fixed Binary Length Encoding

Page 11: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Unary Length Encoding (ULE)Unary Length Encoding (ULE)

Comparable to Elias Gamma Code     Gamma             ULE   

1     1              10        2     010            113     011             0100 4     00100          01015     00101          01106     00110          01117     00111       0010008     0001000        001001

ULE assigns |e|-1 bits long zero prefix to an edge label e with GDL label of the length |e|

Page 12: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Unary Length Encoding (ULE) Unary Length Encoding (ULE) AnalysisAnalysis

Theorem:

Upper bound on TGDL label length with

ULE of delimiters is

bits, for an arbitrary node v in a tree T

- is the depth of v in T

- n is number of nodes in T

)log(2 nv

v

Page 13: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Fixed Binary Length Encoding Fixed Binary Length Encoding (FBLE)(FBLE)

For an edge e, this encoding is the binary representation of the length for GDL(e)

Encoded with a fixed number of bits

- is the maximum node out-degree in T

- uses 4 bits in our application

)1log(log * *

Page 14: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

FBLE exampleFBLE example

- 4 bits will encode delimiters for any T with maximum out-degree < 2^16

- Let e is an edge in T with a given GDL

label, e.g. GDL(e)=0000111111

Then FBLE produces delimiter 1010,

so label for e is 10100000111111

Page 15: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Fixed Binary Length Encoding Fixed Binary Length Encoding (FBLE) Analysis(FBLE) Analysis

Upper bound on TGDL label length with FBLE of delimiters is

bits, for an arbitrary node v in a tree T

)1log(loglog * vv

n

Page 16: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Our Approach (cont’d2)Our Approach (cont’d2)

2.Extended Greedy Dewey Labeling for DAGs (EGDL)

-Augment codes generated from step 1 -Used for inferring paths not part of the

Breadth-First tree -Adds TGDL node label pairs of non-tree

edges

Page 17: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

1 00 0 01

0 1

1

0

0 0

0

1

00

00

01 10

01

.1 .00 .0 .01

.1.0 .1.1 .00.0 .0.1 .0.00 .0.01.0.0 .01.0

.0.0.0 .0.0.1 .0.0.00 .0.0.01 .0.0.10

EGDL Labeling - ExampleEGDL Labeling - Example

.01*.0.01

.01*.0.0

.0.01*.0.01

Page 18: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Experimental Results Experimental Results for Wordnet taxonomy for Wordnet taxonomy (n= 80K)(n= 80K)

Page 19: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Experimental Results-Label Experimental Results-Label LengthsLengths

WordNet 2.1 81426 Indegree

OutDegree

Depth

Paths

6

619

17

12

0

0

0

1

1.027

1.027

7.193

1.433

0.029

43.82

4.825

0.566

Encoding Length

Wordnet 2.1 Statistics

Type of Encoding Max.Length Avg.Length

EGDL with Unary Length Enc.

EGDL with Fixed Bin. Length Enc.

417

611

43.04

64.90

Fixed length baseline 17*log2(619)=170 170

Page 20: Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

ReferencesReferences

[1] Budanitsky, A., Hirst, G. Semantic distance in WordNet: An experimental, application-oriented evaluation of fivemeasures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association forComputational Linguistics, Pittsburgh,PA, 2001.[2] Resnik, F. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th InternationalJoint Conference on Artificial Intelligence (IJCAI), pages 448–453, 1995.[3] Christophides, V., Plexousakis, D. On Labeling Schemes for the Semantic Web. InProceedings of the 12th international conference on World Wide Web, pages 544–555, Budapest, Hungary.[4] Abiteboul., S., Kaplan, H., Milo, T. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556, Washington, D.C., 2001.[5] Strunjas-Yoshikawa, S., Annexstein, F., Berman, K. Compact Encodings for All Local Path Information in Web Taxonomies with applications to WordNet . In Proceedings of the 32nd International Conference on Current Trends in Theory and Practice of Computer Science, Merin, Czech Republic, January 21-27, 2006.