exploring statistical tools for predicting binary outcome mark a. rizzardi joseph e. carroll
Post on 20-Dec-2015
219 views
TRANSCRIPT
Exploring Statistical Tools for Predicting Binary Outcome
Mark A. Rizzardi
Joseph E. Carroll
The Problem
Hepatitis C is a problem worth addressing in Humboldt County with an incidence rate about 0.2%, about 1½ to 2 times that in California as a whole during the last six years, and a preva-lence rate of 2.3%.
70% 85% of those acutely infected go on to chronic infection.
20% of those with chronic infection go on to cirrhosis, i.e. liver failure.
The Current Treatment
Right now, treatment consists of interferon and ribavirin for 48 weeks for genotype 1 and 24 weeks for other genotypes.
Side effects of treatment include continuous flu like symptoms among other problems.
Although successful treatment in the general population of other genotypes is 75% 80%, treatment of genotype 1 patients is effective only about 40% of the time.
What Might Be Useful
Since the treatment costs at least $2000 per month and has unpleasant side effects, it would be useful to have a tool based on some easily obtained patient parameters to predict when a given individual might or might not respond to treatment for hepatitis C.
That is the subject of this talk. We shall dis-cuss some generalized linear approaches to this problem.
Solving the Problem
Geometrically, we shall think of the patient as described by a vector of parameters (x1,…,xn) Rn. These might include, e.g., age, amount of alcohol consumed, certain lab values, etc.
The treated patients then comprise two point sets in Rn, corresponding to responders and nonresponders. Usually, the sets intermix in the sense that their convex hulls have nonvoid intersection.
18 20 22 24 26
81
01
21
4
variable 1
vari
ab
le 2
CuredNot cured
Generalized Linear Models
We need a rule to separate these two sets, hop-ing the rule will apply to future patients.
The simplest solution would be a hyperplane in Rn which separates the two groups with a minimal number of errors.
If its equation were a1x1 +…+ anxn = a0, the rule would be the sign of a0 + a1x1 +…+ anxn.
Generalized Linear Models
A generalized linear model is one for which the parameters x1, …, xn act through a function of the form A(x1,…,xn) = a0 + a1x1 +…+ anxn, where the ai’s are constants.
The coefficients a0, a1, …, an are fit by some method to optimize a value.
In evaluating each such model, it is important to consider both the model assumptions and what exactly is being optimized by it.
What Has Already Been Done?
Graham Foster, a British hepatologist pub-lished a paper which looked at data from two multinational studies, and collected the expla-natory variables age, race, weight, BMI, viral genotype and load, ALT, and histology.
The retained variables were x = viral load, y = age, z = ALT, u = BMI (all continuous), and v = histology (categorical: 0 = cirrhosis; 1 = no cirrhosis).
What Has Already Been Done?
From logistic regression analysis on genotype 1 patients only, Foster concluded:
• Viral load - lower is better
• Age - younger is better
• ALT - higher is better
• BMI - lower is better
• Cirrhosis is worse
More about logistic regression later…
What Has Already Been Done?
Only the genotype 1 patients were analyzed, using logistic regression, so the paper posits
P(x,y,z,u,v) = eA(x,y,z,u,v)/(1 + eA(x,y,z,u,v)) as the probability of response, where
A(x,y,z,u,v) = a0 1.446x 1.236y + 1.376z – 1.134u + 2.322v
for some constant a0 and appropriate units.
Predicting the Response to Treatment of Hepatitis C in
Humboldt County, California
Joseph E. Carroll, ODCHC and HSU Mark A. Rizzardi, Statistics HSU
Donald J. Iverson, Humboldt NeurologyAdil Wakil, Hepatology CPMC
Jennifer HamptonMia R. Kumar
The Data
We collected information from charts of patients treated for hepatitis C in Humboldt and Del Norte counties from about 2001 by the Eureka Liver Clinic (California Pacific Medi-cal Center) and the Open Door Clinics in Eureka and Crescent City.
Other patients have been treated by the San Francisco VA and Stanford’s local clinic.
The Data
The information retrieved included outcome, demographic parameters (e.g., age, gender, ethnic background, substance use), findings on physical exam (e.g., weight, BMI), numerous laboratory results dated before, but as close to the onset of treatment as possible, reports of pathology on liver biopsy and of liver ultra-sound, and interferon/ribavirin combination used. The parameters totaled 56.
The Data
We started with about 170 patients but, on account of missing data, especially missing outcomes (responder or not), and because we eliminated genotype other than 1 patients, the analysis you’ll see today is based on only about 60 patients.
We are working, so far unsuccessfully, to obtain the data from Stanford’s local liver clinic.
-6 -4 -2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
0 1x1 2x2
P(c
ure
) =
e
1
e
cure
d
cure
d
not c
ured
cure
dcu
red
not c
ured
cure
d
cure
d
not c
ured
cure
d
not c
ured
not c
ured
Logistic Regression
Logistic Regression (continued)
• Why Logistic instead of linear regression?– Binary data– Predicting probability:– Nonnormal error terms– Variance is not constant
• Commonly used in medical field: odds ratios • Solved via maximum likelihood estimation
10 P
)ˆexp(1
)ˆexp(ˆ
ˆ1ln1ˆln)ˆ|,,,(1
21
X
X
yyyyyl
i
n
iiiiin
Logistic Regression for dataset
-20 -15 -10 -5 0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
11.14 11.791viral500 3.97AST ALT 11.74Bilirubin
Pre
dic
ted
pro
ba
bili
ty o
f cu
re
CuredNot cured
Range of Bilirubin=[ 0.2, 1.8 ]
Range of AST/ALT=[ 0.5, 2.7 ]
Linear Discriminant Analysis
18 20 22 24 26
81
01
21
4
variable 1
vari
ab
le 2
CuredNot cured
LDA (continued)
18 20 22 24 26
81
01
21
4
variable 1
vari
ab
le 2
CuredNot curedMisclassified
LDA (continued)
Maximize variance between relative to variance within.
LDA (continued)
18 20 22 24 26
81
01
21
4
variable 1
vari
ab
le 2
CuredNot curedMisclassified
Classification trees
|BL < 97.5
MB < 135.5
MB < 138.5
NH < 53.5
BL < 98.5
MB < 134.5
NH < 51.5 BL < 103.5-200
-1850 -200
-200
-3300
-4000 -3300 -1850 -3300
Classifying Egyptian Skull time periods using
Time periods: 4000BC, 3300BC, 1850BC, 200AD, 150AD
MB=maximal breadthBH=Basibregmatic heightBL=Basialveolar lengthNH=Nasal height
Artificial Neural Networks
We can also employ a three-layer feed forward neural network, composed of input nodes for each patient variable, a second layer of hidden nodes, and an output node.
ANNs – Structure
If the hidden layer has m nodes, the network operates on an input x Rn as a composition of two functions from Rn Rm and then from Rm R, each itself being a composition of a linear function followed by a threshold func-tion.
Specifically, there are weights, wij and vi, and thresholds, i and , i = 1,…,m, and j = 1,…,n such that if we let
ANNs – Structure
yi = j wij xj (i.e., y = Wx) and zi = 0 or 1 according as to whether yi < i or yi > i,
Then the network output is success or failure depending on whether i vizi = vz exceeds or is exceeded by .
ANNs – Geometry
Given an input x, its image in the ith hidden node depends only on whether or not yi > i. This is equivalent to locating x as being on one side or the other of a hyperplane in Rn.
Therefore, x goes to a point in the discrete hypercube {0,1}m Rm which codes for the position of x with respect to m hyperplanes.
The output depends on the side of an (m1)-hyperplane on which the hypercube point lies.
ANNs – Training
The network starts with arbitrary weights and is trained by successively processing each in-put, retaining the weights if the computed out-put is correct, and adjusting them according to the back propagation method, a steepest de-scent technique, if it is not.
Unlike classical binary methods, this training tends to directly decrease miscategorizations, which may explain why neural networks often outperform other binary methods.
Our ANNs – The Sordid Truth
Training by backpropagation requires differen-tiability, which is not possessed by threshold functions. Therefore, these step functions are replaced by “activation” functions, e.g., a log-istic or hyperbolic tangent, while the linear functions yield to affine linear functions.
Neural Network
64.79066
-13.86376
-714
.650
7
-18.
5906
5
8.49
883
vG500
14.32534
-16.62989
-6.7520410.17957
-9.9
7335
astalt
-134.48026
-4.886722.54059
-3.78863
-1.71596
bili
0.87
043
-4.05303
5.84897
5.07042
4.19859
response
-17.33262
42.23782
3.61426
-7.1123
7.6298
1
-0.05208
1
Support Vector MachineLet y1, y2, …, yk be the patient vectors for res-ponders, z1, z2, …, zl for nonresponders, and suppose that there is a hyperplane in Rn sepa-rating the y’s and z’s with equation wx = b for some wRn. and bR. Then, wyi + b > 0 and wzj + b < 0.
If we alter w and b by multiplying both by the same adequately large positive number, we can ensure that wyi + b 1 and wzi + b 1.
Support Vector Machine
0 2 4 6 8 10
02
46
81
0
variable 1
vari
ab
le 2
Support Vector MachineAn SVM works by finding a widest rectangu-lar prism separator. This is equivalent to a convex programming problem.
That problem is to find w0 and b0 which mini-mize ||w||2 subject to wyi + b 1 for i = 1,…,k and wzi + b 1 for j = 1,…,l. This optimi-zation problem can be solved by a generaliza-tion of the Lagrange multiplier method and, because it is a convex programming problem, it has a unique solution.
Support Vector Machine
We call vectors yi or zj satisfying w0yi + b = 1 or w0zj + b = 1 support vectors.
It can be shown that the error rate for future patients should be no worse than the ratio of the lesser of the number of support vectors and n+1 to the number of training vectors.
However, more commonly, the two sets will not be linearly separable. There are two ways to solve this problem.
Support Vector Machine
One can embed the patient vectors in a higher dimensional space nonlinearly and try to find a best separating hyperplane there. This corres-ponds to finding a more general separating hy-persurface in Rn. For example, if we are un-able to separate by a line in R2, we can map
(x, y) (x, y, x2, xy, y2, x3, x2y, xy2, y3)
and try to separate by a line in R9, which cor-responds to separating by a cubic back in R2.
Support Vector Machine
Dimension increases rapidly but support vec-tors may not, keeping the error rate low.
Incidentally, there is always a polynomial sep-arator. Suppose y1, y2, …, yk and z1, z2, …, zl are as above and let P(x) = i || x – yi ||2. Then P(x) 0 for all x and its zeros are exactly the yi’s. Therefore, P has a positive minimum on the zj’s, say c, so if Q(x) = P(x) – ½c, then Q(zj) > 0 and Q(yi) < 0.
Support Vector Machine
Alternatively, there is a method to find a hyperplane, again using generalized Lagrange multipliers, which minimizes the sum of the distances of the misclassified vectors to it.
Then, as previously, this hyperplane can be expanded to a rectangular prism with bor-dering support vectors.
How well do the models predict future patients’ responses to
treatment?
Training set vs. Test set• Objective:
– Avoid over fitting model to particular dataset.– Simulates fitting future data.
• General approach: – Fit model to a large randomly selected subset of
the data (training set). – Use model to predict outcomes of remaining data
(test set). – Select model/method which “best” predicts test
set.
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.5
1.0
1.5
AST/ALT
Bili
rub
in
Viral load < 500
CuredNot cured
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.5
1.0
1.5
AST/ALT
Bili
rub
in
Viral load > 500
Data: n=59 patients
Variables: AST/ALT, Bilirubin, Viral load
Leave-one-out Cross-validation
• k-fold cross-validation: – Divide data set into k random equal size groups.– Rotate each group as being test set– Fit model k different times.– Note model/method with “best” prediction.
• Leave-one-out cross-validation:– k=n , n=sample size– Requires fitting model n times.
Neural network : 80% correct Actuality Cured Not cured
Predicted cured 18 4
Predicted not cured 8 29
Total 26 33
Cured Not cured
01
23
Neu
ral N
et o
utpu
t
Boxplots of Neural Network output
Logistic Regression: 75% correct Actuality Cured Not cured
Predicted cured 18 7
Predicted not cured 8 26
Total 26 33
-20 -15 -10 -5 0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
0 11viral5002AST ALT3Bilirubin
Pre
dic
ted
pro
ba
bili
ty o
f cu
re
CuredNot cured Something to think about:
•Which type of error is worse? •Should something other than 0.5 be used as dividing line?
Linear Discriminant Analysis: 80% correct Actuality Cured Not cured
Predicted cured 16 2
Predicted not cured 10 31
Total 26 33
Actuality Cured Not cured
Predicted cured 16 3
Predicted not cured 10 30
Total 26 33
Quadratic Discriminant Analysis: 78% correct
Support Vector Machine: 78% correct
Actuality
Cured Not cured
Predicted cured 16 3
Predicted not cured 10 30
Total 26 33
Classification tree: 76% correct Actuality Cured Not cured
Predicted cured 19 7Predicted not cured 7 26Total 26 33
|astalt < 1.58787
vG500 < 0.5 bili < 0.55
cure no cure
cure no cure
Random Forest: •Use random subsets of variables when building each branch of a tree. •Grow a forest of many trees•Forest of trees votes on classification for each observation •Classification with greatest number of votes wins.•Hepatitis data results: from 78% to 83% correct
Bibliography
Foster, Graham R., et al (2007). Prediction of sustained virological response in chronic hepatitis C patients treated with peginterferon -2a (40KD) and ribavirin. Scandinavian Journal of Gastroenterology 42, 247-55.
Breiman, Leo (2001). Statistical modeling: the two cultures. Statistical Science 16:3, 199-231.
Bibliography
Cortes, Corinna and Vladimir Vapnik (1995). Support-vector networks. Machine Learning, 20, 273-97.
Fisher, R.A. (1936). The use of multiple mea-surements in taxonomic problems. Annals of Eugenics, 7, 179-188.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Springer.