246 255 155 190 28 42 dark 1 light 1 dark 2 light 2 accent 1 … · 2020. 1. 14. · the pk/pd...

109 207 246

Dark 1

255 255 255

Light 1

131 56 155

Dark 2

0 99 190

Light 2

85 165 28

Accent 1

214 73 42

Accent 2

185 175 164

Accent 3

151 75 7

Accent 4

193 187 0

Accent 5

255 221 62

Accent 6

255 255 255

Hyperlink

236 137 29

Followed Hyperlink

127 175 221

Tata Blue 50%

203 215 238

Tata Blue 25%

179 149 197

Purple 50 %

212 195 223

Purple 25 %

255 242 171

Yellow 50 %

255 249 213

Yellow 25 %

229 205 186

Brown 50 %

248 241 235

Brown 25 %

180 213 154

Green 50 %

214 231 200

Green 25 %

241 240 202

Light Green 50%

251 251 241

Light Green 25%

Title and Content

Parallel Implementation of PK-PD Parameter

Estimation on Xeon Phi Using Grid Search Method

Nishant Agrawal, R. Narayanan, Manoj Nambiar, Payal Guha Nandy,

Rihab Abdulrazak, Ambuj Pandey, Shyamsundar Das

Performance Engineering Research Center , TCS Innovation Lab, Mumbai

Drug Development R&D Group, TCS Innovation Labs, Hyderabad

Agenda

Pharma R&D Productivity

o Reasons for Poor R&D Productivity

Model Based Drug Development

Generation of Insights from integrated data

PK-PD Modelling

o Initial Parameter Estimation

o Scope and Limitations

o Parallelized Grid Search on Xeon Phi - A new and effective

approach

o Miscellaneous Optimization on Xeon Phi

o Result Comparison

Summary

Problem Statement: Pharma R&D Productivity

Steven M. Paul “How to improve R&D productivity: the pharmaceutical industry’s grand challenge”, Nature Reviews in Drug Discovery,Vol.9, 203-214, 2010.

Potential Solution

Model Based Drug Development

is based on three themes,

Integration, Innovation, and

Impact: quantitative integration of multisource data

and knowledge through the application of

clinical, biomedical, biological,

engineering, statistical, and mathematical

concepts, resulting in

continuous methodological and

technological innovation enhancing

scientific understanding and knowledge,

which in turn

has an impact on discovery, research,

development, approval, and utilization of

new medicines

MBDD Focus Areas@TCS

Pharmacokinetics: How Drug is processed by Body i.e. Kinetics of the drug

Plasma Drug Concentration

vs. Time profile is studied

Pharmacodynamics: How Drug Affects Body i.e. Dynamics of the drug Effect vs Time profile is

studied

PK-PD Modeling: How Drug Effect is governed by plasma concentration (Dose) Effect vs. Concentration

profile is established

Urethra

Liver

PK-PD Modelling

PK-PD Parameters Estimation Approaches and Why?

CA- Multi Compartment Model

In our case study, we are considering this multi compartment model only.

Here, A, B, C, Alpha, Beta, Gamma are the parameters.

The goal is to find such parameters for this model which best fits the data

available.

Impact of the Estimated Parameters in Drug Development

Parameter PK Parameter derived Relevance for Drug

Development

A

B

C

ALPHA

BETA

GAMMA

1

2

3

4

5

PK-PD Parameter Estimation Steps

• Parameter Bounds: Universal?

• IP Estimation method:

Deterministic Vs random?

• Which is going to solve this

problem?

Is GRID SEARCH a potential universal method for initial parameter estimation ?

1. Induces a grid over parameter space defined by parameters range.

2. The grid is divided into a finite no of grid points (N).

3. Evaluate some objective function at each grid point.

4. The point where the objective function takes its minimum / maximum value

is considered to be the optimum solution.

Step 1: Creating the grid:

upper bound = UBi,

lower bound = LBi ,

# grid points = N.

Grid values for each parameter are

where ri = 0, 1, 2.. N-1 represents the coordinates for the grid points, and i = 1, 2.. p

Example: Let no. of parameters (p) = 2

Let no. of Grid Points (N) = 4

Total no. of points inside grid = Np = 16

Thus, grid point values are

which is the parameter set

Step 2: Evaluate objective function at each point

Our goal is to find the point in the grid which best fits with the observed data. In order to do that, calculate the objective function value (SRS ) for each point in the grid.

# of obs.

i =1

(yi observed − yi

predicted )2 SRS =

The goal then becomes finding the point in the grid with the minimum value of SRS .

Step 3: Optimal solution

At each point, compare with SRS value with the previous stored SRS value, and choose

the minimum of the two, along with their parameters.

Finally, return the minimum SRS value along with the parameters.

Objective:

To build PK/PD Model of a Low Molecular weight (LMW)

Heparin drug (Dose:100µg) to choose optimum dose

regimen (Loading and Maintenance doses) in case of an

patient suffering from acute angina pectoris

Method: Fit the PCT data of LMW Heparin to a PK model and

derive the final estimates of model parameters (CL & Vd)

and hence use these final estimates in building a PD

model and calculating the Loading and Maintenance

dose

The PK/PD Model developed is used for making

prediction’s of Response vs Concentration which would

help in optimizing the dosing regimen

Computational Methods: 1. Grid Search- Serial version

2. Parallelized Grid Search on Xeon Phi

Model Equation:

Data Set

Time(min) Conc(ug/L)

5 1.625

10 1.384

15 1.28

20 1.105

30 0.973

45 0.806

60 0.74

90 0.582

120 0.53

150 0.458

180 0.416

240 0.342

300 0.321

360 0.246

For Initial Estimate, critical factors are:

Number of parameters (p)

Range of parameters (R)

Number of grid points (N)

The figure above shows the effect of no. of grid points on the convergence.

Increasing the no. of grid points produces more optimal result.

Relation between no. of Grid Points and Execution Time

Goal

Goal: An optimal solution in a much shorter time-frame!

Is it possible through parallelization?

Serial naive implementation in Java – 153 seconds

Speedup – 1.9 X

Intel Xeon (Host) Specifications

• Intel Xeon IvyBridge E5-2697 V • 12 Cores @ 2.70 GHz, 30 MB cache • 64 GB RAM • GNU/Linux 2.6.32

Serial naive implementation in C – 77 seconds

- 20 -

For a grid with 20 points for 6 parameters – Grid of size 20^6 = 64000000 points

Pt0 (0;0;0;0;0;0)

Pt19 (0;0;0;0;0;19)

Pt20 (0;0;0;0;1;0)

Pt400(0;0;0;1;0;0)

Pt8000(0;0;1;0;0;0)

Pt64000000(19;19;19;19;19;19)

.

.

.

.

.

.

.

.

. Pt0

Pt20

Pt400

Pt8000

Pt64000000

Best Fit Parameters

Serial Approach Parallel Approach

64000000 points

divided among ‘n’

threads

Best Fit Parameters

Pt0 .

.

. Ptx

Thread1

Ptx+1 .

.

.

Thread2

Find best fit among the

result of all threads

Threadn

Ptnx+1 .

.

. Ptfin

Thread3

Pt2x+1 .

.

.

- 21 -

Grid Points

Processor cores

Divided among (t) OpenMP threads

Core Core Core Core

Optimum within

thread 0

Optimum within

thread t

for all t threads

………………...

Barrier

Optimum among all threads

- 22 -

Number of OpenMP threads Execution Time (sec)

60 11.62

120 7.59

180 6.62

240 6.16

Previous = 77 sec New = 6.16 sec

v1

92%

Intel Xeon Phi (Coprocessor) Specification

• Intel Xeon E5v2 • 61 core, 244 threads @ 1.2 GHz, 512 KB cache • 16 GB RAM • GNU/Linux 2.6.38.8+mpss3.1.2

- 23 -

•

•

•

•

https://www.tacc.utexas.edu/c/document_library/get_file?uuid=ed331f32-49db-4c4b-9ea7-f7d9547c79d9&groupId=13601

0

2

4

6

8

10

12

14

60 120 180 240

Exe

cuti

on

Tim

e (

seco

nd

s)

Number of Threads

Variation of Execution Time with Threads

Execution Time

- 24 -

OpenMP parallel region marked in

time: larger frame

OpenMP parallel regions are shown as frames on grid

This region is serial: marked outside frame

Smaller frame

- 25 -

Compiler Option Description Redeuction in run time

-fimf-accuracy-bits=11

defines the relative error, measured by the number of correct bits, for math library function results

~2.02 sec

-fimf-precision=low This is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions

~2.02 sec

-fimf-domain-exclusion=31 (all)

This option indicates the input arguments domain on which math functions must provide correct results. As more classes are excluded, faster code sequences can be used

~0.64 sec

-no-prec-div it enables optimizations that give slightly less precise results than full IEEE division

~0.82 sec

-no-prec-sqrt uses a faster but less precise implementation of square root ~0.82 sec

-fp-model fast=2 Enables more aggressive optimizations on floating-point data ~0.68 sec

Used in combination with each other ~2.6 sec

Previous = 6.16 sec New = 3.56 sec 42%

- 26 -

In the equation to calculate dose effect • exp() • expf() • exp2() • exp2f()

Trials with different exponentiation functions without affecting the accuracy of results

- 27 -

Arithmetic Equation for dose effect calculation Execution Time

Original Equation

amt = A1*exp(-alpha*tp) + B1*exp(-beta*tp) +

C1*exp(-gamma*tp);

3 sec, 560231 usec

Changing from double to float

amt = A1*expf(-alpha*tp) + B1*expf(-beta*tp) +

C1*expf(-gamma*tp);

2 sec, 574300 usec

Changing from base e to base 2

tmp = -1.0*tp*M_LOG2E;

amt = A1*exp2(alpha*tmp) + B1*exp2(beta*tmp) +

C1*exp2(gamma*tmp);

3 sec, 2729 usec

Changing from, base e to base 2 and from double to float

tmp = -1.0*tp*M_LOG2E;

amt = A1*exp2f(alpha*tmp) + B1*exp2f(beta*tmp) +

C1*exp2f(gamma*tmp);

2 sec, 572489 usec


- 28 -

Equation to calculate result changed due to high cost of “fmax” function From: res = res + fmax(0,amt);

To: ((amt > 0.0) ? (res = res + amt):0);

Previous = 2.57 sec New = 2.11 sec

v3

17%

- 29 -

Loops not getting

vectorized !!!

Need to clean up

code

- 30 -

Generate a vectorization report using the compile option “vec-report[0-6]”

Reports which loops are not vectorized and why

Sample Output of vec-report

Problems with code:

for(i=0; i<r; i++) {

double xf = X[i*Nfun + fn_no];

double yf = Y[i*Nfun + fn_no];

if(xf == 999999.9)

errorMat[i] = 0;

else {

fn = func_diff18(par,xf);

errorMat[i] = yf – fn;

}

}

Checks inside a loop

prevent vectorization

xf is an array element used

in a loop in ‘func_diff18’ –

creating a dependence

- 31 -

Problem : Removing Checks inside the loop

• Identify what the checks signify – in our case it was for preventing the

function being called for outliers

Solution: Remove the outliers from the set of parameters

• Performing the checks at initialization

• Send in ‘perfect’ set of parameters

v4


Problem : Removing dependence of variables between outer/inner loops

• Examine the function – in our case redundant assignments and single

iteration inner loop

Solution: collapse the inner loop into the outer loop


- 32 -

- 33 -

// code to choose parameter sets (0,0,0,0,0,0) – (19,19,19,19,19,19)

for(k=0; k<nPts; k++) {

tmp = k;

int gIdx[noOfParam];

for(j=noOfParam-1; j>=0; j--) {

gIdx[j] = tmp%gridPts;

tmp = tmp/gridPts;

parameter[j] = lowerBound[j] + (gIdx[j] + 1)*stepSize[j];

}

……

- 34 - v6

// convert to 6 loops – 1 for each parameter

for(k1=0; k1<gridPts; k1++) {






parameter[0] = k1*stepSize[0];






The Vtune hotspot analysis shows that the hotspot has been removed.


- 35 -

Previous New

#pragma omp for collapse(3) nowait schedule(dynamic)

- 36 -


v7

11%

- 37 -

Pre

Op

timiz

atio

n

Po

st

Op

timiz

atio

n

- 38 -

A call to malloc() is taking 12.8% of the time !!!

- 39 -

From:

double *error(double * par, double * X, double * Y, int fn_no, int * row)

{

……

double *errorMat = (double *)malloc(sizeof(double)*r);

……

}

--------------------------------------------------------------------------

To:

double *error(double * par, double * X, double * Y, int fn_no, int * row)

{

……

double errorMat[NO_OBS];

……

}


A call to malloc() is executed on Xeon Phi, making it expensive

35%

- 40 -

The entry for malloc() is no longer there

- 41 -

Grid Points

Xeon Phi cores

Divided among 24/48 (n) threads + 60/120/240 (t) OpenMP threads

Core Core Core Core

Optimum within

thread 0

Optimum within

thread t-1

for all t threads

………………...

MPI Barrier

Optimum among all threads

Xeon cores

Core Core

Optimum within

thread 0

Optimum within

thread n-1


- 42 -

0

100

200

300

400

Spe

ed

Up

Speedup w.r.t serial version on 1 CPU – 77 sec

6.16 sec

0.60 sec

|-------- Optimizations on Native Mode -------|

11.6 sec 3.56 sec 2.11 sec

0.39 sec

0.22 sec

Backup slides on OpenFOAM optimization

45 TCS Confidential TCS Confidential 45

213 187

635

194

518

135

0

100

200

300

400

500

600

700

Baseline Optimized

2S Intel® Xeon® processor E5-2697v2 (Native) - 24 Cores

Intel® Xeon Phi™ coprocessor 7120A (Native) - 60 Cores

2S E5-2697v2 + Intel® Xeon Phi™ coprocessor 7120A (Symmetric) - 24 + 60 Cores

1.4X Speedup due to MIC addition

3.3X 3.9X

Tim

e i

n s

ec

s

Runtime of Motorbike Case, 4.2M Workload

(lower is better)

Results: • 3.3X Speedup on Xeon Phi

native execution

• 1.4X Speedup w.r.t Xeon

optimized result (Xeon / Xeon +

Phi) = (187/135) = 1.4X

Code Optimization Strategy:

• Vectorization (AVX on CPU / 512-

bit vectorization Intrinsics on

Phi™)

• Prefetching

• Cache Optimizations

• Optimized Decomposition

Algorithm modification for

Symmetric runs

• Compiler Flags

• Added #pragma unroll to

improve loop performance on

both Intel® Xeon® processors

and Intel® Xeon Phi™

coprocessors

• IO Optimizations

• Cleaning the code, cache

blocking, helping auto-

vectorization, prefetch distance,

unroll factor.

• Detailed Profiling of Hotspots

Execution Model: Native, Symmetric Mode

Software: Intel C++ Compiler,

Intel MPI, Vtune Profiler, ITAC

Custom designed Decomposition algorithm for Xeon + MIC

1.14X

ITAC Message Profile for 84 cores Symmetric run – Shows Load Imbalance

Original Modified

MPI_WAITALL reduced drastically with Modified decomposition algorithm

Xeon

Xeon Phi

Xeon

Xeon Phi

- 47 -

initializations

omp parallel for

SRS Calculation

Determine local minimum SRS

omp single

Determine the first minimum SRS value

omp barrier

omp critical

Determine the minimum SRS among all threads

0.002896 seconds

0.370136 seconds

0.010665 seconds

0.000037 seconds

Total Execution Time 0.384007 seconds

Total Time in parallel region (0.37+0.01+.000037) 0.380838 seconds

Initialization 0.002896 seconds

246 255 155 190 28 42 dark 1 light 1 dark 2 light 2 accent 1 … · 2020. 1. 14. · the pk/pd...

Documents