measuring semantic similarity between words using web search engines www 07

19
Measuring Semantic Similar ity between Words Using We b Search Engines WWW 07

Upload: egbert-briggs

Post on 13-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Measuring Semantic Similarity between Words Using Web

Search Engines

WWW 07

Abstract

• Semantic web-related applications– Community chain mining– Relation extraction– Automatic meta data extraction– Entity disambiguation results

• This paper consists of four page-count-based similarity scores and automatically extracted lexico-syntactic patterns from text snippets.

Introduction – 1/2

• Page counts and snippets are two useful information sources provided by most Web search engines.

• Some problems– Page count analyses ignore the position of a word i

n a page • two words appear in a page, they might not be related

– Polysemous word (a word with multiple senses)• apple as a fruit

• apple as a company

Introduction – 2/2

• Lexico-syntactic patterns

– various semantic relations• also known as,

• is a,

• part of,

• is an example of

Method (Page-count-based)

• Page-count-based Similarity Scores (co-occurrence measures)

C = 5

Method (Lexico-Syntactic Patterns) – 1/4

• Extracting Lexico-Syntactic Patterns from Snippets– is a (X is a Y)

– and (X and Y)

Method (Lexico-Syntactic Patterns) – 2/4

• Given a set S of synonymous

– n-grames : n=2,3,4, and 5

Method (Lexico-Syntactic Patterns) – 3/4

• A set S of synonymous word-pairs– 5000 word pairs of synonymous nouns from Word

Net– 4,562,471 unique patterns– 80% occur less than 10 times

• A set of non-synonymous word-pairs– 5000 word pairs of non-synonymous nouns from W

ordNet

Method (Lexico-Syntactic Patterns) – 4/4

Integrating Patterns and Page Counts

Experiments

• WebOverlap (rank=18,weight=2.45)

• Web-Jaccard (rank=66, weight=0.618)

• WebPMI (rank=138,weight=0.0001)

Benchmark Dataset

– Rubenstein-Goodenough 28 word-pairs

Experiments

Semantic Similarity

Taxonomy-Based Methods

Community Mining

• 50 personal names from 5 communities: – tennis players, golfers,actors, politicians scientists

– 10 names from each community

– B-CUBED

Conclusion

• Semantic web-related applications– Community chain mining– Relation extraction– Automatic meta data extraction– Entity disambiguation results

• This paper consists of four page-count-based similarity scores and automatically extracted lexico-syntactic patterns from text snippets.