sudhanshu khemka. treats each document as a vector with one component corresponding to each term in...

Post on 17-Jan-2018

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated using the tf-idf weighing scheme where tf is the total number of occurrences of the term in the document, while idf is the inverse document frequency of the term.  As the query is also a mini document, the model represents the query as a vector.  Similarity between two vectors can be found as follows:

TRANSCRIPT

Sudhanshu Khemka

Accelerating search engine query processing using the GPU

Prominent Document Scoring Models

The Vector Space Model

Treats each document as a vector with one component corresponding to each term in the dictionary

Weight of a component is calculated using the tf-idf weighing scheme where tf is the total number of occurrences of the term in the document, while idf is the inverse document frequency of the term.

As the query is also a mini document, the model represents the query as a vector.

Similarity between two vectors can be found as follows:

The Language model based approach to IR

Builds a probabilistic language model for each document d and ranks documents based on P(d|q)

Formula is simplified using Bayes rule:

P(d|q) = P(q) is same for all documents and P(d) is treated as uniform

across all documents. Thus, P(d|q) = P(q|d)

P(q|d) can be found using number of different methods. For example, using the Maximum likelihood estimate and the unigram assumption:

My research

Lot of research has been done to develop efficient algorithms for the CPU that improve query response time

We look at the task of improving the query response time from a different perspective

Instead of just focusing on writing efficient algorithms for the CPU, we shift our focus to the processor and formulate the following question:

“Can we accelerate search engine query processing

using the GPU?”

Why the GPU? GPU’s programming model highly suitable for processing data

in parallel

Allows programmers to define a grid of thread blocks. Each thread in a thread block can execute a subset of the operations in parallel:

Useful for information retrieval as the score of each document can be computed in parallel.

Past work done Ding et.al. in their paper, “Using Graphics Processors for High

Performance IR Query Processing,” implement a variant of the vector space model , the Okapi BM25, on the GPU and demonstrate promising results.

Okapi BM25:

In particular, they provide data parallel algorithms for inverted list intersection, list compression, and top k scoring.

My contribution Propose an efficient implementation of the second ranking model,

the LM based approach to document scoring, on the GPU

Method:

Apply a divide and conquer approach as need to compute P(q|d) for each document in the collection

Each block in the GPU would calculate the score of a subset of the total documents, sort the scores, and transfer the results to an array in the global memory of the GPU

After all the blocks have written the sorted scores to the array in the global memory, we would use a Parallel merge algorithm to merge the results and return the top k results.

Satish et. al., in their paper “Designing Efficient Sorting Algorithms for Manycore GPUs,” provide an efficient implementation of merge sort that is the fastest among all other implementations in the literature.

top related