246 255 155 190 28 42 dark 1 light 1 dark 2 light 2 accent 1 … · 2020. 1. 14. · the pk/pd...
TRANSCRIPT
109 207 246
Dark 1
255 255 255
Light 1
131 56 155
Dark 2
0 99 190
Light 2
85 165 28
Accent 1
214 73 42
Accent 2
185 175 164
Accent 3
151 75 7
Accent 4
193 187 0
Accent 5
255 221 62
Accent 6
255 255 255
Hyperlink
236 137 29
Followed Hyperlink
127 175 221
Tata Blue 50%
203 215 238
Tata Blue 25%
179 149 197
Purple 50 %
212 195 223
Purple 25 %
255 242 171
Yellow 50 %
255 249 213
Yellow 25 %
229 205 186
Brown 50 %
248 241 235
Brown 25 %
180 213 154
Green 50 %
214 231 200
Green 25 %
241 240 202
Light Green 50%
251 251 241
Light Green 25%
Title and Content
Parallel Implementation of PK-PD Parameter
Estimation on Xeon Phi Using Grid Search Method
Nishant Agrawal, R. Narayanan, Manoj Nambiar, Payal Guha Nandy,
Rihab Abdulrazak, Ambuj Pandey, Shyamsundar Das
Performance Engineering Research Center , TCS Innovation Lab, Mumbai
Drug Development R&D Group, TCS Innovation Labs, Hyderabad
Agenda
Pharma R&D Productivity
o Reasons for Poor R&D Productivity
Model Based Drug Development
Generation of Insights from integrated data
PK-PD Modelling
o Initial Parameter Estimation
o Scope and Limitations
o Parallelized Grid Search on Xeon Phi - A new and effective
approach
o Miscellaneous Optimization on Xeon Phi
o Result Comparison
Summary
Problem Statement: Pharma R&D Productivity
Steven M. Paul “How to improve R&D productivity: the pharmaceutical industry’s grand challenge”, Nature Reviews in Drug Discovery,Vol.9, 203-214, 2010.
Potential Solution
Model Based Drug Development
is based on three themes,
Integration, Innovation, and
Impact: quantitative integration of multisource data
and knowledge through the application of
clinical, biomedical, biological,
engineering, statistical, and mathematical
concepts, resulting in
continuous methodological and
technological innovation enhancing
scientific understanding and knowledge,
which in turn
has an impact on discovery, research,
development, approval, and utilization of
new medicines
MBDD Focus Areas@TCS
Pharmacokinetics: How Drug is processed by Body i.e. Kinetics of the drug
Plasma Drug Concentration
vs. Time profile is studied
Pharmacodynamics: How Drug Affects Body i.e. Dynamics of the drug Effect vs Time profile is
studied
PK-PD Modeling: How Drug Effect is governed by plasma concentration (Dose) Effect vs. Concentration
profile is established
Urethra
Liver
PK-PD Modelling
PK-PD Parameters Estimation Approaches and Why?
CA- Multi Compartment Model
In our case study, we are considering this multi compartment model only.
Here, A, B, C, Alpha, Beta, Gamma are the parameters.
The goal is to find such parameters for this model which best fits the data
available.
Impact of the Estimated Parameters in Drug Development
Parameter PK Parameter derived Relevance for Drug
Development
A
B
C
ALPHA
BETA
GAMMA
1
2
3
4
5
PK-PD Parameter Estimation Steps
• Parameter Bounds: Universal?
• IP Estimation method:
Deterministic Vs random?
• Which is going to solve this
problem?
Is GRID SEARCH a potential universal method for initial parameter estimation ?
1. Induces a grid over parameter space defined by parameters range.
2. The grid is divided into a finite no of grid points (N).
3. Evaluate some objective function at each grid point.
4. The point where the objective function takes its minimum / maximum value
is considered to be the optimum solution.
Step 1: Creating the grid:
upper bound = UBi,
lower bound = LBi ,
# grid points = N.
Grid values for each parameter are
where ri = 0, 1, 2.. N-1 represents the coordinates for the grid points, and i = 1, 2.. p
Example: Let no. of parameters (p) = 2
Let no. of Grid Points (N) = 4
Total no. of points inside grid = Np = 16
Thus, grid point values are
which is the parameter set
Step 2: Evaluate objective function at each point
Our goal is to find the point in the grid which best fits with the observed data. In order to do that, calculate the objective function value (SRS ) for each point in the grid.
# of obs.
i =1
(yi observed − yi
predicted )2 SRS =
The goal then becomes finding the point in the grid with the minimum value of SRS .
Step 3: Optimal solution
At each point, compare with SRS value with the previous stored SRS value, and choose
the minimum of the two, along with their parameters.
Finally, return the minimum SRS value along with the parameters.
Objective:
To build PK/PD Model of a Low Molecular weight (LMW)
Heparin drug (Dose:100µg) to choose optimum dose
regimen (Loading and Maintenance doses) in case of an
patient suffering from acute angina pectoris
Method: Fit the PCT data of LMW Heparin to a PK model and
derive the final estimates of model parameters (CL & Vd)
and hence use these final estimates in building a PD
model and calculating the Loading and Maintenance
dose
The PK/PD Model developed is used for making
prediction’s of Response vs Concentration which would
help in optimizing the dosing regimen
Computational Methods: 1. Grid Search- Serial version
2. Parallelized Grid Search on Xeon Phi
Model Equation:
Data Set
Time(min) Conc(ug/L)
5 1.625
10 1.384
15 1.28
20 1.105
30 0.973
45 0.806
60 0.74
90 0.582
120 0.53
150 0.458
180 0.416
240 0.342
300 0.321
360 0.246
For Initial Estimate, critical factors are:
Number of parameters (p)
Range of parameters (R)
Number of grid points (N)
The figure above shows the effect of no. of grid points on the convergence.
Increasing the no. of grid points produces more optimal result.
Relation between no. of Grid Points and Execution Time
Goal
Goal: An optimal solution in a much shorter time-frame!
Is it possible through parallelization?
Serial naive implementation in Java – 153 seconds
Speedup – 1.9 X
Intel Xeon (Host) Specifications
• Intel Xeon IvyBridge E5-2697 V • 12 Cores @ 2.70 GHz, 30 MB cache • 64 GB RAM • GNU/Linux 2.6.32
Serial naive implementation in C – 77 seconds
- 20 -
For a grid with 20 points for 6 parameters – Grid of size 20^6 = 64000000 points
Pt0 (0;0;0;0;0;0)
Pt19 (0;0;0;0;0;19)
Pt20 (0;0;0;0;1;0)
Pt400(0;0;0;1;0;0)
Pt8000(0;0;1;0;0;0)
Pt64000000(19;19;19;19;19;19)
.
.
.
.
.
.
.
.
. Pt0
Pt20
Pt400
Pt8000
Pt64000000
Best Fit Parameters
Serial Approach Parallel Approach
64000000 points
divided among ‘n’
threads
Best Fit Parameters
Pt0 .
.
. Ptx
Thread1
Ptx+1 .
.
.
Thread2
Find best fit among the
result of all threads
Threadn
Ptnx+1 .
.
. Ptfin
Thread3
Pt2x+1 .
.
.
- 21 -
Grid Points
Processor cores
Divided among (t) OpenMP threads
Core Core Core Core
Optimum within
thread 0
Optimum within
thread t
for all t threads
………………...
Barrier
Optimum among all threads
- 22 -
Number of OpenMP threads Execution Time (sec)
60 11.62
120 7.59
180 6.62
240 6.16
Previous = 77 sec New = 6.16 sec
v1
92%
Intel Xeon Phi (Coprocessor) Specification
• Intel Xeon E5v2 • 61 core, 244 threads @ 1.2 GHz, 512 KB cache • 16 GB RAM • GNU/Linux 2.6.38.8+mpss3.1.2
- 23 -
•
•
•
•
https://www.tacc.utexas.edu/c/document_library/get_file?uuid=ed331f32-49db-4c4b-9ea7-f7d9547c79d9&groupId=13601
0
2
4
6
8
10
12
14
60 120 180 240
Exe
cuti
on
Tim
e (
seco
nd
s)
Number of Threads
Variation of Execution Time with Threads
Execution Time
- 24 -
OpenMP parallel region marked in
time: larger frame
OpenMP parallel regions are shown as frames on grid
This region is serial: marked outside frame
Smaller frame
- 25 -
Compiler Option Description Redeuction in run time
-fimf-accuracy-bits=11
defines the relative error, measured by the number of correct bits, for math library function results
~2.02 sec
-fimf-precision=low This is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions
~2.02 sec
-fimf-domain-exclusion=31 (all)
This option indicates the input arguments domain on which math functions must provide correct results. As more classes are excluded, faster code sequences can be used
~0.64 sec
-no-prec-div it enables optimizations that give slightly less precise results than full IEEE division
~0.82 sec
-no-prec-sqrt uses a faster but less precise implementation of square root ~0.82 sec
-fp-model fast=2 Enables more aggressive optimizations on floating-point data ~0.68 sec
Used in combination with each other ~2.6 sec
Previous = 6.16 sec New = 3.56 sec 42%
- 26 -
In the equation to calculate dose effect • exp() • expf() • exp2() • exp2f()
Trials with different exponentiation functions without affecting the accuracy of results
- 27 -
Arithmetic Equation for dose effect calculation Execution Time
Original Equation
amt = A1*exp(-alpha*tp) + B1*exp(-beta*tp) +
C1*exp(-gamma*tp);
3 sec, 560231 usec
Changing from double to float
amt = A1*expf(-alpha*tp) + B1*expf(-beta*tp) +
C1*expf(-gamma*tp);
2 sec, 574300 usec
Changing from base e to base 2
tmp = -1.0*tp*M_LOG2E;
amt = A1*exp2(alpha*tmp) + B1*exp2(beta*tmp) +
C1*exp2(gamma*tmp);
3 sec, 2729 usec
Changing from, base e to base 2 and from double to float
tmp = -1.0*tp*M_LOG2E;
amt = A1*exp2f(alpha*tmp) + B1*exp2f(beta*tmp) +
C1*exp2f(gamma*tmp);
2 sec, 572489 usec
Previous = 3.56 sec New = 2.57 sec 27%
- 28 -
Equation to calculate result changed due to high cost of “fmax” function From: res = res + fmax(0,amt);
To: ((amt > 0.0) ? (res = res + amt):0);
Previous = 2.57 sec New = 2.11 sec
v3
17%
- 29 -
Loops not getting
vectorized !!!
Need to clean up
code
- 30 -
Generate a vectorization report using the compile option “vec-report[0-6]”
Reports which loops are not vectorized and why
Sample Output of vec-report
Problems with code:
for(i=0; i<r; i++) {
double xf = X[i*Nfun + fn_no];
double yf = Y[i*Nfun + fn_no];
if(xf == 999999.9)
errorMat[i] = 0;
else {
fn = func_diff18(par,xf);
errorMat[i] = yf – fn;
}
}
Checks inside a loop
prevent vectorization
xf is an array element used
in a loop in ‘func_diff18’ –
creating a dependence
- 31 -
Problem : Removing Checks inside the loop
• Identify what the checks signify – in our case it was for preventing the
function being called for outliers
Solution: Remove the outliers from the set of parameters
• Performing the checks at initialization
• Send in ‘perfect’ set of parameters
v4
Previous = 2.11 sec New = 1.35 sec 36%
Problem : Removing dependence of variables between outer/inner loops
• Examine the function – in our case redundant assignments and single
iteration inner loop
Solution: collapse the inner loop into the outer loop
Previous = 1.35 sec New = 1.28 sec 5%
- 32 -
- 33 -
// code to choose parameter sets (0,0,0,0,0,0) – (19,19,19,19,19,19)
for(k=0; k<nPts; k++) {
tmp = k;
int gIdx[noOfParam];
for(j=noOfParam-1; j>=0; j--) {
gIdx[j] = tmp%gridPts;
tmp = tmp/gridPts;
parameter[j] = lowerBound[j] + (gIdx[j] + 1)*stepSize[j];
}
……
- 34 - v6
// convert to 6 loops – 1 for each parameter
for(k1=0; k1<gridPts; k1++) {
for(k2=0; k2<gridPts; k2++) {
for(k3=0; k1<gridPts; k3++) {
for(k4=0; k4<gridPts; k4++) {
for(k5=0; k5<gridPts; k5++) {
for(k6=0; k6<gridPts; k6++) {
parameter[0] = k1*stepSize[0];
parameter[1] = k2*stepSize[1];
parameter[2] = k3*stepSize[2];
parameter[3] = k4*stepSize[3];
parameter[4] = k5*stepSize[4];
parameter[5] = k6*stepSize[5];
The Vtune hotspot analysis shows that the hotspot has been removed.
Previous = 1.28 sec New = 0.68 sec 46%
- 35 -
Previous New
#pragma omp for collapse(3) nowait schedule(dynamic)
- 36 -
Previous = 0.68 sec New = 0.60 sec
v7
11%
- 37 -
Pre
Op
timiz
atio
n
Po
st
Op
timiz
atio
n
- 38 -
A call to malloc() is taking 12.8% of the time !!!
- 39 -
From:
double *error(double * par, double * X, double * Y, int fn_no, int * row)
{
……
double *errorMat = (double *)malloc(sizeof(double)*r);
……
}
--------------------------------------------------------------------------
To:
double *error(double * par, double * X, double * Y, int fn_no, int * row)
{
……
double errorMat[NO_OBS];
……
}
Previous = 0.60 sec New = 0.39 sec
A call to malloc() is executed on Xeon Phi, making it expensive
35%
- 40 -
The entry for malloc() is no longer there
- 41 -
Grid Points
Xeon Phi cores
Divided among 24/48 (n) threads + 60/120/240 (t) OpenMP threads
Core Core Core Core
Optimum within
thread 0
Optimum within
thread t-1
for all t threads
………………...
MPI Barrier
Optimum among all threads
Xeon cores
Core Core
Optimum within
thread 0
Optimum within
thread n-1
Previous = 0.39 sec New = 0.22 sec 44%
- 42 -
0
100
200
300
400
Spe
ed
Up
Speedup w.r.t serial version on 1 CPU – 77 sec
6.16 sec
0.60 sec
|-------- Optimizations on Native Mode -------|
11.6 sec 3.56 sec 2.11 sec
0.39 sec
0.22 sec
Backup slides on OpenFOAM optimization
45 TCS Confidential TCS Confidential 45
213 187
635
194
518
135
0
100
200
300
400
500
600
700
Baseline Optimized
2S Intel® Xeon® processor E5-2697v2 (Native) - 24 Cores
Intel® Xeon Phi™ coprocessor 7120A (Native) - 60 Cores
2S E5-2697v2 + Intel® Xeon Phi™ coprocessor 7120A (Symmetric) - 24 + 60 Cores
1.4X Speedup due to MIC addition
3.3X 3.9X
Tim
e i
n s
ec
s
Runtime of Motorbike Case, 4.2M Workload
(lower is better)
Results: • 3.3X Speedup on Xeon Phi
native execution
• 1.4X Speedup w.r.t Xeon
optimized result (Xeon / Xeon +
Phi) = (187/135) = 1.4X
Code Optimization Strategy:
• Vectorization (AVX on CPU / 512-
bit vectorization Intrinsics on
Phi™)
• Prefetching
• Cache Optimizations
• Optimized Decomposition
Algorithm modification for
Symmetric runs
• Compiler Flags
• Added #pragma unroll to
improve loop performance on
both Intel® Xeon® processors
and Intel® Xeon Phi™
coprocessors
• IO Optimizations
• Cleaning the code, cache
blocking, helping auto-
vectorization, prefetch distance,
unroll factor.
• Detailed Profiling of Hotspots
Execution Model: Native, Symmetric Mode
Software: Intel C++ Compiler,
Intel MPI, Vtune Profiler, ITAC
Custom designed Decomposition algorithm for Xeon + MIC
1.14X
ITAC Message Profile for 84 cores Symmetric run – Shows Load Imbalance
Original Modified
MPI_WAITALL reduced drastically with Modified decomposition algorithm
Xeon
Xeon Phi
Xeon
Xeon Phi
- 47 -
initializations
omp parallel for
SRS Calculation
Determine local minimum SRS
omp single
Determine the first minimum SRS value
omp barrier
omp critical
Determine the minimum SRS among all threads
0.002896 seconds
0.370136 seconds
0.010665 seconds
0.000037 seconds
Total Execution Time 0.384007 seconds
Total Time in parallel region (0.37+0.01+.000037) 0.380838 seconds
Initialization 0.002896 seconds