samudramanthan popular terms
DESCRIPTION
SamudraManthan Popular terms. Dinesh Bhirud Prasad Kulkarni Varada Kolhatkar. Architecture. R E DU C T I ON. create datastructures. Ngram pruning. Intra-process reduction. Finding Top Ngrams. create datastructures. Ngram pruning. Intra-process reduction. create datastructures. - PowerPoint PPT PresentationTRANSCRIPT
1
SamudraManthan Popular terms
Dinesh Bhirud
Prasad Kulkarni
Varada Kolhatkar
2
Architecture
MANAG E R
MANAG E R
create datastructures
Intra-process reduction
R E DU C T I ON
create datastructures
create datastructures
Ngram pruning
Ngram pruning
Ngram pruning
Intra-process reduction
Intra-process reduction
WORKER PROCESSORSWORKER PROCESSORS
Finding Top
Ngrams
3
Data DistributionMANAGER
P0
HANDSHAKE MODULE
WORKER PN
WORKER P2
WORKER P1
1. Signal Ready (W->M)
2. Data msg (M->W)
3. Next ready signal (W->M)
.
.
.4. Terminate msg (M->W)
Handshake protocol
4
Data Distribution (contd…)
Manager (processor 0) reads an article and passes it on to the other processors(workers) in a round-robin fashion
Before sending new article to the same worker, manager waits till the worker is ready to receive more data.
Worker processes the article and creates data structures before receiving new article.
Sends & receives are synchronous.
5
Suffix Array, LCP Vector And Equivalence Classes
Suffix array is a sorted array of suffixes LCP vector keeps track of repeating terms in
suffixes We use suffix arrays and LCP vector to
partition articles into classes Each class represents a group of Ngrams Classes represent all Ngrams in the article
and no Ngram is represented more than once
6
Example
S LCP Trivial Classes
Non-trivial Classes
0 A ROSE 0 <0,0> <0,1>
1 A ROSE IS A ROSE 2 <1,1>
2 IS A ROSE 0 <2,2>
3 ROSE 0 <3,3> <3,4>
4 ROSE IS A ROSE 1 <4,4>
5 0
A ROSE IS A ROSEA ROSE IS A ROSE
7
Advantages of Suffix Arrays
Time Complexity There can be at the most 2N-1 classes, where N
is number of words in an article Ngrams of all/any sizes can be identified with
their tfs in linear time These data structures enable us to represent
all and any sized Ngrams without actually storing them
8
Intra-Processor Reduction Problem
Suffix array data structure gives us article level unique Ngrams with term frequencies
A processor processes multiple articles Need to identify unique Ngrams across
articles Need to have an unique identifier for each
word
9
Dictionary – Our Savior
Dictionary is a sorted list of all unique words in the Gigaword corpus
Dictionary ids form a unified basis for intra/inter process reduction
10
Intra-Processor Reduction
Used a hash table to store unique Ngrams with tf and df Hashing function
Simple mod hashing function H(t)= ∑ t(i) mod HASH_SIZE, where t(i) is the dictionary id of
word i in Ngram t Hash data structure
struct ngramstore {int *word_id;int cnt;int doc_freq;struct ngramstore *chain;
};
11
Steps
• Inter-Process Reduction Binomial Tree
0
1
3
2 4
5
7
6
i varies from 0 to log(n) - 1• Send -> Recv diff = (2 ^ i)• For any iteration, recv if(id % (2^i) == 0) else sender• max_recv = (reductions-1) * (int)pow((double)2, i+1);
Processors enter next iteration by calling MPI_Barrier()
1 -> 03 -> 25 -> 47 -> 6
2 -> 06 -> 4
4 -> 0
12
Inter-Process Reduction using Hashing
Reusing our hash technique and code from intra-process reduction
All processes use binomial tree collection pattern to reduce unique Ngrams
After log n steps process 0 has the final hash with all unique Ngrams
13
Scaling up to GigaWord?
Goal Reduce per processor memory requirement
Cut off term frequency Ngrams with low tf are not going to score high Observation : 66 % of total trigrams have term
frequency 1 in 1.2GB data Unnecessary to carry such Ngrams Solution: Eliminate Ngrams with very low term
frequency
14
Pruning – stoplist motivation
Similarly Ngrams with high df are not going to score high.
Memory hotspot This elimination can be done only after intra-
process collection Defeats the goal of per processor memory
reduction Need for an adaptive elimination
15
Pruning - Stoplist
Ngrams such as "IN THE FIRST" scored high using TF*IDF measure
Eliminate such Ngrams to extract really interesting terms
Stoplist is a list of commonly occurring words such as “the”, “a”, “to”, “from”, “is”, “first”
Stoplist is based on our dictionary Still evolving and currently contains 160 words Eliminate Ngrams containing all words from the
stoplist
16
Interesting 3-grams on GigaWord
17
Performance Analysis - Speedup
18
Space Complexity
Memory requirement increases for higher order Ngrams
Why? Suppose there are n unique Ngrams in each article and
m such articles For higher order Ngrams, the number of unique ngrams
increase We store each unique Ngram in our hash data structure In worst case all Ngrams across articles are unique.
We have to store mn unique Ngrams per processor
19
Current Limitations
Static Dictionary M through N Interesting Ngrams
Our hash data structure is designed to handle a single sized Ngram at a time
We provide M through N functionality by repetitively building all data structures
Not a scalable approach
20
Thanks