exploring statistical tools for predicting binary outcome mark a. rizzardi joseph e. carroll

Exploring Statistical Tools for Predicting Binary Outcome

Mark A. Rizzardi

Joseph E. Carroll

The Problem

Hepatitis C is a problem worth addressing in Humboldt County with an incidence rate about 0.2%, about 1½ to 2 times that in California as a whole during the last six years, and a preva-lence rate of 2.3%.

70% 85% of those acutely infected go on to chronic infection.

20% of those with chronic infection go on to cirrhosis, i.e. liver failure.

The Current Treatment

Right now, treatment consists of interferon and ribavirin for 48 weeks for genotype 1 and 24 weeks for other genotypes.

Side effects of treatment include continuous flu like symptoms among other problems.

Although successful treatment in the general population of other genotypes is 75% 80%, treatment of genotype 1 patients is effective only about 40% of the time.

What Might Be Useful

Since the treatment costs at least $2000 per month and has unpleasant side effects, it would be useful to have a tool based on some easily obtained patient parameters to predict when a given individual might or might not respond to treatment for hepatitis C.

That is the subject of this talk. We shall dis-cuss some generalized linear approaches to this problem.

Solving the Problem

Geometrically, we shall think of the patient as described by a vector of parameters (x1,…,xn) Rn. These might include, e.g., age, amount of alcohol consumed, certain lab values, etc.

The treated patients then comprise two point sets in Rn, corresponding to responders and nonresponders. Usually, the sets intermix in the sense that their convex hulls have nonvoid intersection.

18 20 22 24 26

81

01

21

4

variable 1

vari

ab

le 2

CuredNot cured

Generalized Linear Models

We need a rule to separate these two sets, hop-ing the rule will apply to future patients.

The simplest solution would be a hyperplane in Rn which separates the two groups with a minimal number of errors.

If its equation were a1x1 +…+ anxn = a0, the rule would be the sign of a0 + a1x1 +…+ anxn.

Generalized Linear Models

A generalized linear model is one for which the parameters x1, …, xn act through a function of the form A(x1,…,xn) = a0 + a1x1 +…+ anxn, where the ai’s are constants.

The coefficients a0, a1, …, an are fit by some method to optimize a value.

In evaluating each such model, it is important to consider both the model assumptions and what exactly is being optimized by it.

What Has Already Been Done?

Graham Foster, a British hepatologist pub-lished a paper which looked at data from two multinational studies, and collected the expla-natory variables age, race, weight, BMI, viral genotype and load, ALT, and histology.

The retained variables were x = viral load, y = age, z = ALT, u = BMI (all continuous), and v = histology (categorical: 0 = cirrhosis; 1 = no cirrhosis).


From logistic regression analysis on genotype 1 patients only, Foster concluded:

• Viral load - lower is better

• Age - younger is better

• ALT - higher is better

• BMI - lower is better

• Cirrhosis is worse

More about logistic regression later…


Only the genotype 1 patients were analyzed, using logistic regression, so the paper posits

P(x,y,z,u,v) = eA(x,y,z,u,v)/(1 + eA(x,y,z,u,v)) as the probability of response, where

A(x,y,z,u,v) = a0 1.446x 1.236y + 1.376z – 1.134u + 2.322v

for some constant a0 and appropriate units.

Predicting the Response to Treatment of Hepatitis C in

Humboldt County, California

Joseph E. Carroll, ODCHC and HSU Mark A. Rizzardi, Statistics HSU

Donald J. Iverson, Humboldt NeurologyAdil Wakil, Hepatology CPMC

Jennifer HamptonMia R. Kumar

The Data

We collected information from charts of patients treated for hepatitis C in Humboldt and Del Norte counties from about 2001 by the Eureka Liver Clinic (California Pacific Medi-cal Center) and the Open Door Clinics in Eureka and Crescent City.

Other patients have been treated by the San Francisco VA and Stanford’s local clinic.

The Data

The information retrieved included outcome, demographic parameters (e.g., age, gender, ethnic background, substance use), findings on physical exam (e.g., weight, BMI), numerous laboratory results dated before, but as close to the onset of treatment as possible, reports of pathology on liver biopsy and of liver ultra-sound, and interferon/ribavirin combination used. The parameters totaled 56.

The Data

We started with about 170 patients but, on account of missing data, especially missing outcomes (responder or not), and because we eliminated genotype other than 1 patients, the analysis you’ll see today is based on only about 60 patients.

We are working, so far unsuccessfully, to obtain the data from Stanford’s local liver clinic.

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

0 1x1 2x2

P(c

ure

) =

e

1

e

cure

d

cure

d

not c

ured

cure

dcu

red

not c

ured

cure

d

cure

d

not c

ured

cure

d

not c

ured

not c

ured

Logistic Regression

Logistic Regression (continued)

• Why Logistic instead of linear regression?– Binary data– Predicting probability:– Nonnormal error terms– Variance is not constant

• Commonly used in medical field: odds ratios • Solved via maximum likelihood estimation

10 P

)ˆexp(1

)ˆexp(ˆ

ˆ1ln1ˆln)ˆ|,,,(1

21

X

X

yyyyyl

i

n

iiiiin

Logistic Regression for dataset

-20 -15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

11.14 11.791viral500 3.97AST ALT 11.74Bilirubin

Pre

dic

ted

pro

ba

bili

ty o

f cu

re

CuredNot cured

Range of Bilirubin=[ 0.2, 1.8 ]

Range of AST/ALT=[ 0.5, 2.7 ]

Linear Discriminant Analysis

18 20 22 24 26

81

01

21

4

variable 1

vari

ab

le 2

CuredNot cured

LDA (continued)

18 20 22 24 26

81

01

21

4

variable 1

vari

ab

le 2

CuredNot curedMisclassified

LDA (continued)

Maximize variance between relative to variance within.

LDA (continued)

18 20 22 24 26

81

01

21

4

variable 1

vari

ab

le 2

CuredNot curedMisclassified

Classification trees

|BL < 97.5

MB < 135.5

MB < 138.5

NH < 53.5

BL < 98.5

MB < 134.5

NH < 51.5 BL < 103.5-200

-1850 -200

-200

-3300

-4000 -3300 -1850 -3300

Classifying Egyptian Skull time periods using

Time periods: 4000BC, 3300BC, 1850BC, 200AD, 150AD

MB=maximal breadthBH=Basibregmatic heightBL=Basialveolar lengthNH=Nasal height

Artificial Neural Networks

We can also employ a three-layer feed forward neural network, composed of input nodes for each patient variable, a second layer of hidden nodes, and an output node.

ANNs – Structure

If the hidden layer has m nodes, the network operates on an input x Rn as a composition of two functions from Rn Rm and then from Rm R, each itself being a composition of a linear function followed by a threshold func-tion.

Specifically, there are weights, wij and vi, and thresholds, i and , i = 1,…,m, and j = 1,…,n such that if we let

ANNs – Structure

yi = j wij xj (i.e., y = Wx) and zi = 0 or 1 according as to whether yi < i or yi > i,

Then the network output is success or failure depending on whether i vizi = vz exceeds or is exceeded by .

ANNs – Geometry

Given an input x, its image in the ith hidden node depends only on whether or not yi > i. This is equivalent to locating x as being on one side or the other of a hyperplane in Rn.

Therefore, x goes to a point in the discrete hypercube {0,1}m Rm which codes for the position of x with respect to m hyperplanes.

The output depends on the side of an (m1)-hyperplane on which the hypercube point lies.

ANNs – Training

The network starts with arbitrary weights and is trained by successively processing each in-put, retaining the weights if the computed out-put is correct, and adjusting them according to the back propagation method, a steepest de-scent technique, if it is not.

Unlike classical binary methods, this training tends to directly decrease miscategorizations, which may explain why neural networks often outperform other binary methods.

Our ANNs – The Sordid Truth

Training by backpropagation requires differen-tiability, which is not possessed by threshold functions. Therefore, these step functions are replaced by “activation” functions, e.g., a log-istic or hyperbolic tangent, while the linear functions yield to affine linear functions.

Neural Network

64.79066

-13.86376

-714

.650

7

-18.

5906

5

8.49

883

vG500

14.32534

-16.62989

-6.7520410.17957

-9.9

7335

astalt

-134.48026

-4.886722.54059

-3.78863

-1.71596

bili

0.87

043

-4.05303

5.84897

5.07042

4.19859

response

-17.33262

42.23782

3.61426

-7.1123

7.6298

1

-0.05208

1

Support Vector MachineLet y1, y2, …, yk be the patient vectors for res-ponders, z1, z2, …, zl for nonresponders, and suppose that there is a hyperplane in Rn sepa-rating the y’s and z’s with equation wx = b for some wRn. and bR. Then, wyi + b > 0 and wzj + b < 0.

If we alter w and b by multiplying both by the same adequately large positive number, we can ensure that wyi + b 1 and wzi + b 1.

Support Vector Machine

0 2 4 6 8 10

02

46

81

0

variable 1

vari

ab

le 2

Support Vector MachineAn SVM works by finding a widest rectangu-lar prism separator. This is equivalent to a convex programming problem.

That problem is to find w0 and b0 which mini-mize ||w||2 subject to wyi + b 1 for i = 1,…,k and wzi + b 1 for j = 1,…,l. This optimi-zation problem can be solved by a generaliza-tion of the Lagrange multiplier method and, because it is a convex programming problem, it has a unique solution.


We call vectors yi or zj satisfying w0yi + b = 1 or w0zj + b = 1 support vectors.

It can be shown that the error rate for future patients should be no worse than the ratio of the lesser of the number of support vectors and n+1 to the number of training vectors.

However, more commonly, the two sets will not be linearly separable. There are two ways to solve this problem.


One can embed the patient vectors in a higher dimensional space nonlinearly and try to find a best separating hyperplane there. This corres-ponds to finding a more general separating hy-persurface in Rn. For example, if we are un-able to separate by a line in R2, we can map

(x, y) (x, y, x2, xy, y2, x3, x2y, xy2, y3)

and try to separate by a line in R9, which cor-responds to separating by a cubic back in R2.


Dimension increases rapidly but support vec-tors may not, keeping the error rate low.

Incidentally, there is always a polynomial sep-arator. Suppose y1, y2, …, yk and z1, z2, …, zl are as above and let P(x) = i || x – yi ||2. Then P(x) 0 for all x and its zeros are exactly the yi’s. Therefore, P has a positive minimum on the zj’s, say c, so if Q(x) = P(x) – ½c, then Q(zj) > 0 and Q(yi) < 0.


Alternatively, there is a method to find a hyperplane, again using generalized Lagrange multipliers, which minimizes the sum of the distances of the misclassified vectors to it.

Then, as previously, this hyperplane can be expanded to a rectangular prism with bor-dering support vectors.

How well do the models predict future patients’ responses to

treatment?

Training set vs. Test set• Objective:

– Avoid over fitting model to particular dataset.– Simulates fitting future data.

• General approach: – Fit model to a large randomly selected subset of

the data (training set). – Use model to predict outcomes of remaining data

(test set). – Select model/method which “best” predicts test

set.

-0.5 0.0 0.5 1.0 1.5 2.0 2.5

0.5

1.0

1.5

AST/ALT

Bili

rub

in

Viral load < 500

CuredNot cured

-0.5 0.0 0.5 1.0 1.5 2.0 2.5

0.5

1.0

1.5

AST/ALT

Bili

rub

in

Viral load > 500

Data: n=59 patients

Variables: AST/ALT, Bilirubin, Viral load

Leave-one-out Cross-validation

• k-fold cross-validation: – Divide data set into k random equal size groups.– Rotate each group as being test set– Fit model k different times.– Note model/method with “best” prediction.

• Leave-one-out cross-validation:– k=n , n=sample size– Requires fitting model n times.

Neural network : 80% correct Actuality Cured Not cured

Predicted cured 18 4

Predicted not cured 8 29

Total 26 33

Cured Not cured

01

23

Neu

ral N

et o

utpu

t

Boxplots of Neural Network output

Logistic Regression: 75% correct Actuality Cured Not cured



Total 26 33

-20 -15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

0 11viral5002AST ALT3Bilirubin

Pre

dic

ted

pro

ba

bili

ty o

f cu

re

CuredNot cured Something to think about:

•Which type of error is worse? •Should something other than 0.5 be used as dividing line?

Linear Discriminant Analysis: 80% correct Actuality Cured Not cured



Total 26 33

Actuality Cured Not cured



Total 26 33

Quadratic Discriminant Analysis: 78% correct

Support Vector Machine: 78% correct

Actuality

Cured Not cured



Total 26 33

Classification tree: 76% correct Actuality Cured Not cured

Predicted cured 19 7Predicted not cured 7 26Total 26 33

|astalt < 1.58787

vG500 < 0.5 bili < 0.55

cure no cure

cure no cure

Random Forest: •Use random subsets of variables when building each branch of a tree. •Grow a forest of many trees•Forest of trees votes on classification for each observation •Classification with greatest number of votes wins.•Hepatitis data results: from 78% to 83% correct

Bibliography

Foster, Graham R., et al (2007). Prediction of sustained virological response in chronic hepatitis C patients treated with peginterferon -2a (40KD) and ribavirin. Scandinavian Journal of Gastroenterology 42, 247-55.

Breiman, Leo (2001). Statistical modeling: the two cultures. Statistical Science 16:3, 199-231.

Bibliography

Cortes, Corinna and Vladimir Vapnik (1995). Support-vector networks. Machine Learning, 20, 273-97.

Fisher, R.A. (1936). The use of multiple mea-surements in taxonomic problems. Annals of Eugenics, 7, 179-188.

Bishop, C.M. (2006). Pattern Recognition and Machine Learning, Springer.

exploring statistical tools for predicting binary outcome mark a. rizzardi joseph e. carroll

Documents

x n act

treatment of genotype

carroll slide

vector of parameters

effects of treatment

better cirrhosis

viral genotype

successful treatment