improving bayesian computational time and scalability with gpgpu

8/6/2019 Improving Bayesian Computational Time and Scalability With GPGPU

1/33

Improving BayesianComputational Time andScalability with GPGPU

Thanakij Pechprasarn, Noppadon Khiripet

[email protected]

Knowledge Elicitation Archiving Laboratory (KEA)

National Electronics and Computer Technology Center

(NECTEC),A N S C S E 1 5 1 stApril2011


2/33

Bayesian applications

Style of problems includes: inferenceproblems, causal problems

For example, the problem may bethat given that the grass is wet(evidence), what is the probabilityof each influential cause (rain,

sprinkler)?


3/33

Bayesian probability

probability as a degree of belief

conditional probability given information(evidence), your belief changes

posterior as inverse probability Bayes theorem

Where,

= prior of = likelihood = posterior = prior ofD (acts as a normalizing constant of value

)

)(

)()|()|(

DP

PDPDP

=

)(P

)|( DP

)|( DP

)(DP dPDP )()|(


4/33

Our selected application

To do hypothesis testing givenobserved data

The expected value of the posteriorhas to fall under 95% region(credible interval) of the priordistribution

If true, then the hypothesis isaccepted, otherwise rejected


5/33

Posterior expectation

An expected value of the posterior,

It requires one to sample from theposterior, but a sampling methodfor all posterior may not be known,especially when the posterior has a

complex form We can work out math to make it

simpler

Remark a powerful method such as Markov chain Monte Carlo

][)|( DPE


6/33

Posterior expectation (2)

The definition of an expected value, So,

Using Bayes rule, From the definition of an expectation,

Now we have changed the distributionfrom using the posterior to the prior

We assume that known sampling method

for the prior distribution exists

dxXPXXE XP

= )(*][)(

= dDPE DP )|(*][)|(

dDP

PDP

=

)(

)(*)|(*

)(

)]|(*[)(

DP

DPEP =

[...][...] )()|( PDP EE


7/33

Hypothesis testing

We do the testing to see if thecalculated expected value of theposterior falls under 95% region of

prior distribution That is, to see if

95.0)(


8/33

Problems

However, we still have to solve theintegrals appeared in the denominator,P(D), and in the hypothesis testing

Analytical method may not work becauseclosed-form solution may not be found

Notice that we can convert back and forthbetween the integrals and the

expectations However, how can we really solve either

an integral or an expected value


9/33

Solutions

Monte Carlo integration (MCI) can beused to approximate anexpectation/integral involving a

random process

Thus, to find an expectation with MCI:

1.Sample X1..N according to the

distribution f

2.Calculate the sam le mean an

N

xf

xfE

N

i

i= 1

)(

)]([


10/33

Solutions (2)

Unfortunately, MCI also has itsdrawback

In general, the more samples, themore accurate of the final answer

However, with a lot more samples,the computation becomes much

slower!


11/33

GPUs and CUDA

GPU computing, leveraging graphicscards as an accelerator of thecomputation

Nvidia CUDA is a major frameworkfor programming GPUs

CUDA allows developers to exploit

parallelism in a form of blocks andthreads


12/33


13/33


14/33

Current work

Make use of our previous work, theparallel reduction module in GPUs

Speed up the computation in a real-world Bayesian application withGPU computing


15/33

Current work (2)

Calculate the posterior expectation

With this form, we can calculate theexpectations with MCI for both thenumerator and denominator

)(

)]|(*[][

)(

)|(DP

DPEE

P

DP

=

=

dPDP

DPEP

)(*)|(

)]|(*[)(

)]|([

)]|(*[

)(

)(

DPE

DPE

P

P=


16/33

Current work (3)

Given the computed value of the posteriorexpectation, one can test the hypothesisvia Monte Carlo methods as follows:

1.X1..N = sample from the prior2.count= the number of samples that its

value is less than the expected value

3.Ifcount/N< 0.95 then accept Else reject


17/33

Structures of the parallelprogram

1.Sample from the prior, (CPU)

2.

3.Calculate the posterior expectation (GPU)

The numerator part,

The denominator part,

4.

5.Do hypothesis testing,(CPU)

NDPN

i

ii /)|(*1=

=

N

i

i NDP1

/)|(

)(P

95.0)(


18/33

Extra issues

In addition to the parallelized Bayesianapplications, we also handle 2 issues found in ourprevious work in theparallel reduction step:

1. Further optimization

Although results from previous work show thatthe computational time is substantialreduced, but we find that it can be furtherimproved

Techniques: loop unrolling, enhance compactingcode

2.Scalability

The problem is that a certain block size canhandle a problem size up to a certain point,

so small blocks cannot afford larger problem


19/33

What about the likelihood andprior?

Prior ~ (broad prior)

Each observation ~ (normalmodel)

Likelihood = (observationsare independent)

The 23 observations weve used are from the

Cavendishs data:5.36, 5.29, 5.58, 5.65, 5.57, 5.53,5.62, 5.29, 5.44, 5.34, 5.79, 5.10,5.27, 5.39, 5.42, 5.47, 5.63, 5.34,5.46, 5.30, 5.78, 5.68, 5.85

)04.0,(N

)5.0,5(N

=

23

1 )04.0,;(i iDN


20/33

Platforms

CUDA 3.2

A workstation with followingspecification:Description CPU GPU

Model Intel Core i7 Nvidia GeForce GTX580Clock frequency (GHz) 2.8 1.56

# processors 2 16

# cores per processor 4 32

# total cores 8 512


21/33

Results


22/33

Results (2)

The calculated expected value isabout 5.483

It falls under 95% region, so thehypothesis is accepted


23/33

Results (3)

Running time: Sequential (CPU) vs Parallel(GPU)

Our maximum speed-up achieved is53.49x


24/33

Results (4)

However, we know that the parallelimplementation also contain asequential part

Currently only the portion of finding aposterior expectation is parallelized

If we compare the running time of

this specific portion between CPUand GPU versions, we would seegreater difference in performance

And the maximum speed-up we


25/33


26/33

Summary

Weve implemented a Bayesianapplication to do the hypothesistesting given a posterior

expectation We develop parallel programs

running on GPUs to help accelerate

the computation Our maximum speed-up obtained is

53.49x

In addition, we cope with the


27/33

Thank You

Q&A


28/33

Solving the scalability issues

We now use 2D blocks instead of 1Dblocks


29/33

Results (scalability issue)

We show that the smallest block size can also be used with

the largest problem size (this would not be possible inour previous work)

Problem Size Running Time (second)using Block Size = 128

65,535 0.011

131,070 0.021

262,140 0.041

524,280 0.080

1,048,560 0.159

2,097,120 0.317

4,194,240 0.631

8,388,480 1.261

16,776,960 2.523

33,553,920 5.076

67,107,840 10.368

134,215,680 20.516

268,431,360 40.332


30/33

Further optimization:Loop unrolling

(* parallel reduction in the reduce kernel *)FOR s from num_samples/2 to 64 having s/=2 Sync threads (* make sure that all threads are working on the same

level of the tree *) IF threadIdis less than s THEN Add s_data[threadId] to s_data[threadId+ s] END IFEND FOR(* loop unrolling *)IF threadIdis less than 32 THEN (* CUDA warp size is 32 *) Add s_data[threadId] to s_data[threaded+ 32]

Add s_data[threadId] to s_data[threaded+ 16] Add s_data[threadId] to s_data[threaded+ 8] Add s_data[threadId] to s_data[threaded+ 4] Add s_data[threadId] to s_data[threaded+ 2] Add s_data[threadId] to s_data[threaded+ 1]END IF


31/33

Further optimization:Enhance compacting kernel

Original version:kernel_reduce ()

Modified version:kernel_reduce

()


32/33

Effect of furtheroptimization

Unfortunately, each introduced optimizationon parallel reduction seems to have a littlegain

We find that this is due to the other hot spot

in the program that dominates thecom utation (that is, the time s ent on


33/33

Monte Carlo integration(MCI)

We want to integrate fin [a,b]

Divide by P(x), distribution that we know how to

sample from

Change into a form of expectation

Estimate the integral by sampling from P(x) andcalculate the sample mean

=b

a

dxxfI )(

= dxxPxP

xfI )(

)(

)(

])(

)([)(

xP

xfEI XP=

N

xP

xf

I

N

i i

i= 1

)(

)(

improving bayesian computational time and scalability with gpgpu

Documents