pairwise document similarity in large collections with map reduce

Tamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

Pairwise Similarity

MapReduce Framework

Proposed algorithm• Inverted Index Construction

• Pairwise document similarity calculation

Results

PubMed – “More like this”

Similar blog posts

Google – Similar pages

Framework that supports distributed

computing on clusters of computers

Introduced by Google in 2004

Map step

Reduce step

Combine step (Optional)

Applications

Consider two files:

Hello

World

Bye

World

Hello

Hadoop

Goodbye

Hadoop

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Hello

Hadoop

Goodbye

Hadoop

Hello

World

Bye

World

Map 1

Map 2

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1>

<Hadoop,1>

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1>

<Hadoop,1>

<Hello (1,1)>

<World(1,1)>

<Bye(1)>

<Hadoop(1,1)>

<Goodbye(1)>

S

H

U

F

F

L

E

&

S

O

R

T

Reduce 2

Reduce 1

Reduce 3

Reduce 4

Reduce 5

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

MAPREDUCE ALGORITHM

•Inverted Index Computation

•Pairwise Similarity

Scalable

and

Efficient

Document 2

B

D

D

Document 1

A

A

B

CMap 1

Map 2

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

Document 1

A

B

B

E

Map 3

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

<A,[(d1,2),

(d3,1)]>

<B,[(d1,1), (d2,

1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

S

H

U

F

F

L

E

&

S

O

R

T

Reduce 1

Reduce 2

Reduce 3

Reduce 4

<B,[(d1,1), (d2,

1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Reduce 5 <E,[(d3,1)]>

<A,[(d1,2),

(d3,1)]>

<E,[(d3,1)]>

Group by document ID, not pairs

Golomb’s compression for postings Individual PostingsList of Postings

<B,[(d1,1),

(d2,1),(d3,2)]>

<C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

<A,[(d1,2),

(d3,1)]>Map 1

Map 2

<(d1,d3),2>

<(d1,d2),1

(d2,d3),2

(d1,d3),2>

<(d1,d3),2>

<(d1,d2),1

(d2,d3),2

(d1,d3),2>

S

H

U

F

F

L

E

&

S

O

R

T

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[2,2]>

Reduce 1

Reduce 2

Reduce 3

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[4]>

Hadoop 0.16.0

20 machine (4GB memory, 100GB disk)

Similarity function - BM25

Dataset: AQUAINT-2 (newswire text)• 2.5 GB

• 906k documents

Tokenization

Stop word removal

Stemming

Df-cut• Fraction of terms with highest document

frequency is eliminated – 99% cut (9093)

• 3.7 billion pairs (vs) 81. trillion pairs

Linear space and time complexity

Complexity: O(n2)

Df-cut of 99 percent eliminates meaning bearing

terms and some irrelevant terms

• Cornell, arthritis

• sleek, frail

Df-cut can be relaxed to 99.9 percent

Exact algorithms used for inverted index

construction and pair-wise document

similarity are not specified.

Df-cut – Does a df-cut of 99 percent affect

the quality of the results significantly?

The results have not been evaluated.

pairwise document similarity in large collections with map reduce

Technology