efficient parallel cky parsing on gpus youngmin yi (university of seoul) chao-yue lai (uc berkeley)...

Efficient Parallel CKY Parsing on GPUs

Youngmin Yi (University of Seoul)

Chao-Yue Lai (UC Berkeley)

Slav Petrov (Google Research)

Kurt Keutzer (UC Berkeley)

CKY Parsing

• Find the most likely parse tree for a given sen-tence

• Parse trees can be used in many NLP applications – Machine translation– Question answering– Information extraction

• Dynamic Programming in O(|G|n3)– n is number of words in a sentence– |G| is size of the grammar

I love you . love

you .

you .

. you

love I

I love

I love you love

you

(0,0)

(0,1)

(0,2)

(0,3)

(1,3)

(2,3)

(3,3)

(2,2)(1,1)

(1,2)

Why Faster Parsers?

• O(|G|n3)– n is on average about 20– |G| is much more larger

• grammars with high accuracy: >1,000,000 rules

• We need faster parsers for real-time NLprocessing with high accuracy!

GPUs

• Manycore era– Due to “Power Wall”, it is unlikely that CPUs

with faster clock frequency appear

– Instead, number of processing cores will con-tinue toincrease

• GPU (Graphics Processing Unit)– Currently available manycore architecture:– 480 processing cores in GTX480

Overall Structure• Hierarchical parallel platform

– Several Streaming Processors (SP) grouped into aStreaming Mulitprocessor (SM)

…

Memory Types

• Different types of memory– Global, shared, texture, constant memory

• Can you guys please add a bit more here?

CUDA

• CUDA (Compute Unified Device Architec-ture)– Parallel programming framework for GPUs

• Programming model, language, compilers, APIs

– Allows general purpose computing on GPUs

Thread and Thread Block in CUDA

• Thread blocks (Blocks)– Independent execution units

• Threads– Maximum threads per block: 512 or 1024

• Warps– Group of threads executed together: 32

• Kernel– Configured as #blocks, #threads

9

• Fork-join programming model, host+device program– Serial or modestly parallel parts in host C code– Highly parallel parts in device kernel C code

Serial code (host)

. . .

. . .

Parallel code in kernel (device)

KernelA<<< nBlk, nThr >>>(args);

Serial code (host)

Parallel code = kernel (device)

KernelB<<< nBlk, nThr >>>(args);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Programming Model in CUDA

SIMT model in CUDA

__global__ void Kenrel1(..)

{if( threadIdx.x < a)

...else

...}

• SIMT (Single Instruction Multiple Thread)– Not SIMD (Single Instruction Multiple Data) be-

cause…• Threads can actually execute different locations of the

program

– Not SPMD (Single Program Multiple Data) because…• Threads with different execution path cannot execute in

parallel __global__ void Kenrel2(..){ int tx = threadIdx.x;

for(i=0; i<LoopCount[tx]; i++)

... }

Parallelisms in CKY Parsing• Dynamic Programming

– Iterations must be executed in serial

• But, in each iteration– About a million rules (with thousands of symbols) need to

be evaluated for each span

I love

you .love

you .you .

.

yo

u

love

I

I love

I love

you love

you

(0,0)

(0,1)

(0,2)

(0,3)

(1,3)

(2,3)

(3,3)

(2,2)

(1,1)

(1,2)

Rules

Spans

Unary Rule Relaxation

Binary Rule Re-laxation

# rules, # span

Thread-Mapping• Map a symbol to a thread?

– Not good for load balancing– Remember SIMT!

• Map a rule to a thread?– 850K rules good concurrency– Thread blocks are just groups of the same # of threads

…

Block-Mapping

• Map each symbol to a thread block– and map the rules to threads in the thread block that cor-

responds to the parent symbol – (+) All the threads in the same thread block has the

same parent – (-) What if #rules of a symbol exceeds the #thread limit?

…

Block-Mapping

…

0 1 2 3 4 5 6 7 1023

…

0 1 2 3 4 5 6 7 1023

0 1 2

Symbol i

Virtual Sym-bol j

Virtual Symbol j+1

Span-Mapping

• It is easy to further parallelize another level of parallel-ism orthogonally– Simply add another dimension in the grid of thread blocks

…

…

blockIndex.y=0blockIndex.y=1

blockIndex.y=n-len+1

blockIndex.x=sym0

blockIndex.x=sym1

…

…

span index

Synchronization

• Massive number of threads with the same parent symbol need to update its computed score correctly such that the reduced final value is the maximum value

Atomic Operations• atomicMax(&max,value);

– CUDA API– Much efficient for shared memorythan global memory

shared memory

global memory

Parallel Reduction

• After log2N steps (N is #threads in a block), the reduced value is obtained– All the threads work for the same symbol– An option only for block-mapping

• __syncthreads()

Reducing Global Memory Using Texture Memory

• Grammar information– parent[], lchild[], rchild[]– Read-only throughout the whole program

• Scores updated in the previous iterations of dynamic programming– scores[][][]– Read-only

• Locate such read-only data in texture memory!• But, in case of scores[][][], we need to locate newly

updated scores in the current iteration to the texture memory– Locating array in texture memory = cudaBindTexture( )– The execution time of this API is proportional to the array size– (-) scores[start][stop][S] is huge array…

Sj SrSs

scores[wp][wd][Sr],scores[wd+1][wq][Ss]

Reducing Global Memory Using Texture Memory (Cont’d)

• Change the layout – scores[start][stop][S] scores[len][start][S]– We only need to update the part of scores[][][] when

len=current iteration

I love you . love

you .

you .

. you

love I

I love

I love you love

you

(0,0)

(0,1)

(0,2)

(0,3)

(1,3)

(2,3)

(3,3)

(2,2)(1,1)

(1,2)len=1

len=2

len=3

len=4

Experimental Results

• GTX285 – No cache memory supported– Low memory bandwidth

speedup

thread-atom

6.4

block-atom

8.1

block-pr

10.1

block-atom-SS

11.1

block-pr-SS

14.2

block-atom-SS-tex

11.9

block-pr-SS-tex

17.4

Experimental Results

• GTX480 – Cache memory supported– Higher memory bandwidth

speedup

thread-atom

13.2

block-atom

14.1

block-pr

25.8

block-atom-SS

15.2

block-pr-SS

23.4

block-atom-SS-tex

13.9

block-pr-SS-tex

22.2

Conclusions

• We explored design space for parallelizing CKY parsing on a GPU– Different mappings, synchronization methods, – Utilizing different types of memories

• We compared each version two GPUs– 26X on GTX480, 17X on GTX285

• We expect scalable performance gains as the numberof processing cores increases in future GPUs

efficient parallel cky parsing on gpus youngmin yi (university of seoul) chao-yue lai (uc berkeley)...

Documents