sudhanshu khemka. treats each document as a vector with one component corresponding to each term in...

Sudhanshu Khemka

Accelerating search engine query processing using the GPU

Prominent Document Scoring Models

The Vector Space Model

Treats each document as a vector with one component corresponding to each term in the dictionary

Weight of a component is calculated using the tf-idf weighing scheme where tf is the total number of occurrences of the term in the document, while idf is the inverse document frequency of the term.

As the query is also a mini document, the model represents the query as a vector.

Similarity between two vectors can be found as follows:

The Language model based approach to IR

Builds a probabilistic language model for each document d and ranks documents based on P(d|q)

Formula is simplified using Bayes rule:

P(d|q) = P(q) is same for all documents and P(d) is treated as uniform

across all documents. Thus, P(d|q) = P(q|d)

P(q|d) can be found using number of different methods. For example, using the Maximum likelihood estimate and the unigram assumption:

My research

Lot of research has been done to develop efficient algorithms for the CPU that improve query response time

We look at the task of improving the query response time from a different perspective

Instead of just focusing on writing efficient algorithms for the CPU, we shift our focus to the processor and formulate the following question:

“Can we accelerate search engine query processing

using the GPU?”

Why the GPU? GPU’s programming model highly suitable for processing data

in parallel

Allows programmers to define a grid of thread blocks. Each thread in a thread block can execute a subset of the operations in parallel:

Useful for information retrieval as the score of each document can be computed in parallel.

Past work done Ding et.al. in their paper, “Using Graphics Processors for High

Performance IR Query Processing,” implement a variant of the vector space model , the Okapi BM25, on the GPU and demonstrate promising results.

Okapi BM25:

In particular, they provide data parallel algorithms for inverted list intersection, list compression, and top k scoring.

My contribution Propose an efficient implementation of the second ranking model,

the LM based approach to document scoring, on the GPU

Method:

Apply a divide and conquer approach as need to compute P(q|d) for each document in the collection

Each block in the GPU would calculate the score of a subset of the total documents, sort the scores, and transfer the results to an array in the global memory of the GPU

After all the blocks have written the sorted scores to the array in the global memory, we would use a Parallel merge algorithm to merge the results and return the top k results.

Satish et. al., in their paper “Designing Efficient Sorting Algorithms for Manycore GPUs,” provide an efficient implementation of merge sort that is the fastest among all other implementations in the literature.

sudhanshu khemka. treats each document as a vector with one component corresponding to each term in...

Documents

12 tracenet-organic_traceablity system - sudhanshu

shoppers stop sudhanshu

kumar, vinita, khemka,

l board of directors -...

sudhanshu singh vipresh jha rohit nair gunjan juyal … ·...

higher education information management sudhanshu bhushan

conference schedule · conference schedule monday, 6th...

fast simulation of lightning for 3d games jeremy bryan...

varun sethi sudhanshu mittal

ph.d.(management) [gurukul kangri university], mba...

4 aug 2015 press release - namaste... - hss uk...4 aug 2015...

oct 2012 issue of ezine: dr ashok khemka + india pak war...

portfolio sudhanshu pal 2014

expanding our notions of ahimsa to environmental stewardship...

sudhanshu shekhar singh

spm [lecture highlights by sudhanshu kumar / mim ii yr...

brian kelly , dr. sudhanshu panda , dr. carl trettin , dr

khemka exim solutions (p)...

human resource management- i€¦ · • dr. sudhanshu...

55 sudhanshu project