pairwise document similarity in large collections with map reduce

21
Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy

Upload: nivedalk

Post on 14-Jul-2015

1.892 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Pairwise document similarity in large collections with map reduce

Tamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

Page 2: Pairwise document similarity in large collections with map reduce

Pairwise Similarity

MapReduce Framework

Proposed algorithm• Inverted Index Construction

• Pairwise document similarity calculation

Results

Page 3: Pairwise document similarity in large collections with map reduce

PubMed – “More like this”

Similar blog posts

Google – Similar pages

Page 4: Pairwise document similarity in large collections with map reduce

Framework that supports distributed

computing on clusters of computers

Introduced by Google in 2004

Map step

Reduce step

Combine step (Optional)

Applications

Page 5: Pairwise document similarity in large collections with map reduce
Page 6: Pairwise document similarity in large collections with map reduce

Consider two files:

Hello

World

Bye

World

Hello

Hadoop

Goodbye

Hadoop

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Page 7: Pairwise document similarity in large collections with map reduce

Hello

Hadoop

Goodbye

Hadoop

Hello

World

Bye

World

Map 1

Map 2

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1>

<Hadoop,1>

Page 8: Pairwise document similarity in large collections with map reduce

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1>

<Hadoop,1>

<Hello (1,1)>

<World(1,1)>

<Bye(1)>

<Hadoop(1,1)>

<Goodbye(1)>

S

H

U

F

F

L

E

&

S

O

R

T

Reduce 2

Reduce 1

Reduce 3

Reduce 4

Reduce 5

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Page 9: Pairwise document similarity in large collections with map reduce

MAPREDUCE ALGORITHM

•Inverted Index Computation

•Pairwise Similarity

Scalable

and

Efficient

Page 10: Pairwise document similarity in large collections with map reduce

Document 2

B

D

D

Document 1

A

A

B

CMap 1

Map 2

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

Document 1

A

B

B

E

Map 3

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Page 11: Pairwise document similarity in large collections with map reduce

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

<A,[(d1,2),

(d3,1)]>

<B,[(d1,1), (d2,

1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

S

H

U

F

F

L

E

&

S

O

R

T

Reduce 1

Reduce 2

Reduce 3

Reduce 4

<B,[(d1,1), (d2,

1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Reduce 5 <E,[(d3,1)]>

<A,[(d1,2),

(d3,1)]>

<E,[(d3,1)]>

Page 12: Pairwise document similarity in large collections with map reduce

Group by document ID, not pairs

Golomb’s compression for postings Individual PostingsList of Postings

Page 13: Pairwise document similarity in large collections with map reduce

<B,[(d1,1),

(d2,1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

<A,[(d1,2),

(d3,1)]>Map 1

Map 2

<(d1,d3),2>

<(d1,d2),1

(d2,d3),2

(d1,d3),2>

Page 14: Pairwise document similarity in large collections with map reduce

<(d1,d3),2>

<(d1,d2),1

(d2,d3),2

(d1,d3),2>

S

H

U

F

F

L

E

&

S

O

R

T

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[2,2]>

Reduce 1

Reduce 2

Reduce 3

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[4]>

Page 15: Pairwise document similarity in large collections with map reduce

Hadoop 0.16.0

20 machine (4GB memory, 100GB disk)

Similarity function - BM25

Dataset: AQUAINT-2 (newswire text)• 2.5 GB

• 906k documents

Page 16: Pairwise document similarity in large collections with map reduce

Tokenization

Stop word removal

Stemming

Df-cut• Fraction of terms with highest document

frequency is eliminated – 99% cut (9093)

• 3.7 billion pairs (vs) 81. trillion pairs

Linear space and time complexity

Page 17: Pairwise document similarity in large collections with map reduce
Page 18: Pairwise document similarity in large collections with map reduce
Page 19: Pairwise document similarity in large collections with map reduce

Complexity: O(n2)

Df-cut of 99 percent eliminates meaning bearing

terms and some irrelevant terms

• Cornell, arthritis

• sleek, frail

Df-cut can be relaxed to 99.9 percent

Page 20: Pairwise document similarity in large collections with map reduce

Exact algorithms used for inverted index

construction and pair-wise document

similarity are not specified.

Df-cut – Does a df-cut of 99 percent affect

the quality of the results significantly?

The results have not been evaluated.

Page 21: Pairwise document similarity in large collections with map reduce