pairwise document similarity in large collections with map reduce
TRANSCRIPT
Tamer Elsayed, Jimmy Lin, and Douglas Oard
Niveda Krishnamoorthy
Pairwise Similarity
MapReduce Framework
Proposed algorithm• Inverted Index Construction
• Pairwise document similarity calculation
Results
PubMed – “More like this”
Similar blog posts
Google – Similar pages
Framework that supports distributed
computing on clusters of computers
Introduced by Google in 2004
Map step
Reduce step
Combine step (Optional)
Applications
Consider two files:
Hello
World
Bye
World
Hello
Hadoop
Goodbye
Hadoop
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
Hello
Hadoop
Goodbye
Hadoop
Hello
World
Bye
World
Map 1
Map 2
<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1>
<Hadoop,1>
<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1>
<Hadoop,1>
<Hello (1,1)>
<World(1,1)>
<Bye(1)>
<Hadoop(1,1)>
<Goodbye(1)>
S
H
U
F
F
L
E
&
S
O
R
T
Reduce 2
Reduce 1
Reduce 3
Reduce 4
Reduce 5
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
MAPREDUCE ALGORITHM
•Inverted Index Computation
•Pairwise Similarity
Scalable
and
Efficient
Document 2
B
D
D
Document 1
A
A
B
CMap 1
Map 2
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
Document 1
A
B
B
E
Map 3
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
<A,[(d1,2),
(d3,1)]>
<B,[(d1,1), (d2,
1),(d3,2)]>
<C,[(d1,1)]>
<D,[(d2,2)]>
S
H
U
F
F
L
E
&
S
O
R
T
Reduce 1
Reduce 2
Reduce 3
Reduce 4
<B,[(d1,1), (d2,
1),(d3,2)]>
<C,[(d1,1)]>
<D,[(d2,2)]>
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
Reduce 5 <E,[(d3,1)]>
<A,[(d1,2),
(d3,1)]>
<E,[(d3,1)]>
Group by document ID, not pairs
Golomb’s compression for postings Individual PostingsList of Postings
<B,[(d1,1),
(d2,1),(d3,2)]>
<C,[(d1,1)]>
<D,[(d2,2)]>
<E,[(d3,1)]>
<A,[(d1,2),
(d3,1)]>Map 1
Map 2
<(d1,d3),2>
<(d1,d2),1
(d2,d3),2
(d1,d3),2>
<(d1,d3),2>
<(d1,d2),1
(d2,d3),2
(d1,d3),2>
S
H
U
F
F
L
E
&
S
O
R
T
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[2,2]>
Reduce 1
Reduce 2
Reduce 3
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[4]>
Hadoop 0.16.0
20 machine (4GB memory, 100GB disk)
Similarity function - BM25
Dataset: AQUAINT-2 (newswire text)• 2.5 GB
• 906k documents
Tokenization
Stop word removal
Stemming
Df-cut• Fraction of terms with highest document
frequency is eliminated – 99% cut (9093)
• 3.7 billion pairs (vs) 81. trillion pairs
Linear space and time complexity
Complexity: O(n2)
Df-cut of 99 percent eliminates meaning bearing
terms and some irrelevant terms
• Cornell, arthritis
• sleek, frail
Df-cut can be relaxed to 99.9 percent
Exact algorithms used for inverted index
construction and pair-wise document
similarity are not specified.
Df-cut – Does a df-cut of 99 percent affect
the quality of the results significantly?
The results have not been evaluated.