logistic regression for fast, accurate, and parameter -

90
Logistic regression for fast, accurate, and parameter free data mining Paul Komarek Auton Lab Robotics Institute Carnegie Mellon University [email protected] http://www.komarix.org http://www.autonlab.org Paul Komarek, Confidential Materials – p. 1/4

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic regression for fast, accurate, and parameter -

Logistic regression for fast, accurate,and parameter free data mining

Paul KomarekAuton Lab

Robotics InstituteCarnegie Mellon University

[email protected]

http://www.komarix.org

http://www.autonlab.org

Paul Komarek, Confidential Materials – p. 1/44

Page 2: Logistic regression for fast, accurate, and parameter -

Logistic Regression: Not Dead Yet

Logistic regression (LR) is a venerable, but capable probabilisticbinary classifier.

LR is well-understood, mature and comfortable =⇒ trusted

LR accuracy is comparable to new-fangled state-of-the-artSVMs

LR can be as fast or faster than linear SVMs

Yet LR is often ignored in data mining literature. LR gets morepress in text classification literature.

Why do we care about LR speed?

Why LR, instead of other binary classifiers?

Paul Komarek, Confidential Materials – p. 2/44

Page 3: Logistic regression for fast, accurate, and parameter -

LR is more useful when it is very fast

LR is useful already, even when estimation algorithms are O(

n3)

.If LR algorithms were really fast, we could

(obvious) do many logistic regressions

(obvious) do many binary classifications

which leads to . . .

Paul Komarek, Confidential Materials – p. 3/44

Page 4: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 5: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 6: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 7: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 8: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 9: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 10: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Paul Komarek, Confidential Materials – p. 4/44

Page 11: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Paul Komarek, Confidential Materials – p. 5/44

Page 12: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Paul Komarek, Confidential Materials – p. 5/44

Page 13: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Paul Komarek, Confidential Materials – p. 5/44

Page 14: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Paul Komarek, Confidential Materials – p. 5/44

Page 15: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Paul Komarek, Confidential Materials – p. 5/44

Page 16: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Video segmentation

CAUTION: we are discussing LR as a classifier. You cannot treatthe resulting models as explanatory. The size of modelparameters, taken one-at-a-time, might be misleading.

Paul Komarek, Confidential Materials – p. 6/44

Page 17: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Video segmentation

CAUTION: we are discussing LR as a classifier. You cannot treatthe resulting models as explanatory. The size of modelparameters, taken one-at-a-time, might be misleading.

Paul Komarek, Confidential Materials – p. 6/44

Page 18: Logistic regression for fast, accurate, and parameter -

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Text classification without feature selection

Video segmentation

CAUTION: we are discussing LR as a classifier. You cannot treatthe resulting models as explanatory. The size of modelparameters, taken one-at-a-time, might be misleading.

Paul Komarek, Confidential Materials – p. 6/44

Page 19: Logistic regression for fast, accurate, and parameter -

The logistic regression model

LR expectation function:

µi = µ(xi,β) =exp(βT xi)

1+ exp(βT xi)

LR model:

yi = µi + ε, ε ∼ b(1,µi)

LR likelihood:

L(β) =R

∏i=1

µ yii

(

1−µ(1−yi)i

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Dataexp(β0 + β1 x) / (1 + exp(β0 + β1 x))

0

0.2

0.4

0.6

0.8

1

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

err = 1-µ(x0)

err = µ(x0)

µ(x)x0

µ(x0)y0=0y0=1

Paul Komarek, Confidential Materials – p. 7/44

Page 20: Logistic regression for fast, accurate, and parameter -

What is the hard part of LR?I am pushing LR as a good binary classifier with aprobabilistic model.

Of course, LR has its difficulties.

The likelihood,

L(β) =R

∏i=1

µ yii

(

1−µ(1−yi)i

)

cannot be optimized analytically, and iterative methods areused instead.

The iterative method used differentiates LR implementations.

Paul Komarek, Confidential Materials – p. 8/44

Page 21: Logistic regression for fast, accurate, and parameter -

What is the hard part of LR?I am pushing LR as a good binary classifier with aprobabilistic model.

Of course, LR has its difficulties.

The likelihood,

L(β) =R

∏i=1

µ yii

(

1−µ(1−yi)i

)

cannot be optimized analytically, and iterative methods areused instead.

The iterative method used differentiates LR implementations.

Paul Komarek, Confidential Materials – p. 8/44

Page 22: Logistic regression for fast, accurate, and parameter -

What is the hard part of LR?I am pushing LR as a good binary classifier with aprobabilistic model.

Of course, LR has its difficulties.

The likelihood,

L(β) =R

∏i=1

µ yii

(

1−µ(1−yi)i

)

cannot be optimized analytically, and iterative methods areused instead.

The iterative method used differentiates LR implementations.

Paul Komarek, Confidential Materials – p. 8/44

Page 23: Logistic regression for fast, accurate, and parameter -

What is the hard part of LR?I am pushing LR as a good binary classifier with aprobabilistic model.

Of course, LR has its difficulties.

The likelihood,

L(β) =R

∏i=1

µ yii

(

1−µ(1−yi)i

)

cannot be optimized analytically, and iterative methods areused instead.

The iterative method used differentiates LR implementations.

Paul Komarek, Confidential Materials – p. 8/44

Page 24: Logistic regression for fast, accurate, and parameter -

Iterative methods for LRIteratively Re-weighted Least Squares (IRLS) is a popularstatistically-formulated quasi-Newton method. Useful for anygeneralized linear model (GLM). Finds solution through aseries of weighted least squares problems.

(XT WX)βi = XT Wz

Equivalent to Newton’s method for a subclass of GLMs, andNewton’s method is simple but slow (see next slide).

A variety of other nonlinear optimization methods are usedon the likelihood, including many variations of Newton’smethod. Nonlinear conjugate gradient, quasi-Newtonmethods, and cyclic coordinate descent are very popular.

We prefer IRLS for generality, interpretability, and simplicity,but it needs modification.

Paul Komarek, Confidential Materials – p. 9/44

Page 25: Logistic regression for fast, accurate, and parameter -

Iterative methods for LRIteratively Re-weighted Least Squares (IRLS) is a popularstatistically-formulated quasi-Newton method. Useful for anygeneralized linear model (GLM). Finds solution through aseries of weighted least squares problems.

(XT WX)βi = XT Wz

Equivalent to Newton’s method for a subclass of GLMs, andNewton’s method is simple but slow (see next slide).

A variety of other nonlinear optimization methods are usedon the likelihood, including many variations of Newton’smethod. Nonlinear conjugate gradient, quasi-Newtonmethods, and cyclic coordinate descent are very popular.

We prefer IRLS for generality, interpretability, and simplicity,but it needs modification.

Paul Komarek, Confidential Materials – p. 9/44

Page 26: Logistic regression for fast, accurate, and parameter -

Iterative methods for LRIteratively Re-weighted Least Squares (IRLS) is a popularstatistically-formulated quasi-Newton method. Useful for anygeneralized linear model (GLM). Finds solution through aseries of weighted least squares problems.

(XT WX)βi = XT Wz

Equivalent to Newton’s method for a subclass of GLMs, andNewton’s method is simple but slow (see next slide).

A variety of other nonlinear optimization methods are usedon the likelihood, including many variations of Newton’smethod. Nonlinear conjugate gradient, quasi-Newtonmethods, and cyclic coordinate descent are very popular.

We prefer IRLS for generality, interpretability, and simplicity,but it needs modification.

Paul Komarek, Confidential Materials – p. 9/44

Page 27: Logistic regression for fast, accurate, and parameter -

Newton’s method is simpleNewton’s method finds zeros of a function.

x(0,0)f(x_i)

x_i

y

x_(i+1)xi+1 = xi − f (xi)/ f ′(xi)

For optimization, find zeros of derivative. In general,xi+1 = xi −H−1∇ f (xi)

Paul Komarek, Confidential Materials – p. 10/44

Page 28: Logistic regression for fast, accurate, and parameter -

But Newton’s method is slowNewton’s method requires inverting the Hessian repeatedly, whichis generally an O

(

n3)

operation.

Same problem with IRLS, since for LR it is equivalent toNewton. The weighted least squares problem at eachiteration is

(XT WX)βi = XT Wz

Solving for βi can be done slowly with a matrix inverse:

βi = (XT WX)−1XT Wz

Note that solving linear systems such as Ax = b for positivedefinite A is equivalent to minimizing the quadratic form

12

xT Ax−bT x+ c

Paul Komarek, Confidential Materials – p. 11/44

Page 29: Logistic regression for fast, accurate, and parameter -

But Newton’s method is slowNewton’s method requires inverting the Hessian repeatedly, whichis generally an O

(

n3)

operation.

Same problem with IRLS, since for LR it is equivalent toNewton. The weighted least squares problem at eachiteration is

(XT WX)βi = XT Wz

Solving for βi can be done slowly with a matrix inverse:

βi = (XT WX)−1XT Wz

Note that solving linear systems such as Ax = b for positivedefinite A is equivalent to minimizing the quadratic form

12

xT Ax−bT x+ c

Paul Komarek, Confidential Materials – p. 11/44

Page 30: Logistic regression for fast, accurate, and parameter -

But Newton’s method is slowNewton’s method requires inverting the Hessian repeatedly, whichis generally an O

(

n3)

operation.

Same problem with IRLS, since for LR it is equivalent toNewton. The weighted least squares problem at eachiteration is

(XT WX)βi = XT Wz

Solving for βi can be done slowly with a matrix inverse:

βi = (XT WX)−1XT Wz

Note that solving linear systems such as Ax = b for positivedefinite A is equivalent to minimizing the quadratic form

12

xT Ax−bT x+ c

Paul Komarek, Confidential Materials – p. 11/44

Page 31: Logistic regression for fast, accurate, and parameter -

Linear conjugate gradient to the rescue

Conjugate gradient (CG) is an iterative but exact minimizationalgorithm that specializes in quadratic forms.

If the function to minimize is quadratic, we call the algorithm“linear CG”

linear CG is very simple, much simpler than “nonlinear CG”

linear CG is very fast, speed depends on spectrum of A

You can think of linear CG as a specialized highly-efficient versionof the “steepest descent” algorithm.

Paul Komarek, Confidential Materials – p. 12/44

Page 32: Logistic regression for fast, accurate, and parameter -

Liner CG versus steepest descent

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Paul Komarek, Confidential Materials – p. 13/44

Page 33: Logistic regression for fast, accurate, and parameter -

Truncated IRLS

Using linear CG to approximate the Newton update solutioncreates a “truncated Newton method”.

Thus we can create a “truncated IRLS” that applies to allGLMs, whether or not IRLS is equivalent to Newton’smethod.

Truncated Newton methods are somewhat well studied, haveconvergence guarantees, etc.

Linear CG and Newton’s method have few parameters, andthey rarely need tuning when used for LR (empiricallydemonstrated).

Side note: this isn’t the whole story. Regularization is alsoimportant, along with a few other details.

Paul Komarek, Confidential Materials – p. 14/44

Page 34: Logistic regression for fast, accurate, and parameter -

LR performs better in our experiments

Name Columns Rows Nonzero Pos

Link Analysisciteseer 105,354 181,395 512,267 299

imdb 685,569 167,773 2,442,721 824

Life Sciencesds2 1,143,054 88,358 29,861,146 423

ds1 6,348 26,733 3,732,607 804

ds1.100 100 26,733 NA 804

ds1.10 10 26,733 NA 804

Text Categorizationmodapte.sub 26,299 7,769 423,025 495

Paul Komarek, Confidential Materials – p. 15/44

Page 35: Logistic regression for fast, accurate, and parameter -

LR performs better in our experiments

(All experiments are 10-fold cross-validations)

citeseer imdb ds2

Classifier Time AUC Time AUC Time AUC

LR TR-IRLS 53 0.945 272 0.983 1460 0.722LR CG-MLE 70 0.946 310 0.983 2851 0.724

SVM LIN BEST 82 0.821 647 0.949 3729 0.704

SVM LIN FAST 79 0.810 564 0.938 2030 0.690

SVM RBF BEST 1150 0.864 4549 0.957 67118 0.700

SVM RBF FAST 408 0.798 1929 0.947 14681 0.680

BC 10 0.501 33 0.507 127 0.533

Paul Komarek, Confidential Materials – p. 16/44

Page 36: Logistic regression for fast, accurate, and parameter -

LR performs better in our experiments(All experiments are 10-fold cross-validations)

ds1 ds1.100 ds1.10

Classifier Time AUC Time AUC Time AUC

LR TR-IRLS 45 0.948 35 0.913 8 0.842LR CG-MLE 120 0.946 294 0.916 43 0.844

SVM LIN BEST 846 0.931 1744 0.882 373 0.741

SVM LIN FAST 183 0.918 123 0.874 73 0.675

SVM RBF BEST 3594 0.939 2577 0.934 167 0.876

SVM RBF FAST 1593 0.902 932 0.864 248 0.848

KNS2 K=1 424 0.790 74 0.785 9 0.753

KNS2 K=9 782 0.909 166 0.894 14 0.859

KNS2 K=129 2381 0.938 819 0.938 89 0.909

BC 4 0.884 8 0.890 2 0.863

Paul Komarek, Confidential Materials – p. 17/44

Page 37: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 38: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 39: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 40: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 41: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 42: Logistic regression for fast, accurate, and parameter -

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

Paul Komarek, Confidential Materials – p. 18/44

Page 43: Logistic regression for fast, accurate, and parameter -

SVMlight on ds1.10: very strange

0.4

0.5

0.6

0.7

0.8

0.9

1

0.01 0.1 1 10

100000 50000 20000 10000 5000 2000 1000 500 200 100 50 20 10 5 2 1

AU

C

Tim

e (s

econ

ds)

capacity

SVMlight linear kernel on train10.pca.csv

AUCTime

Paul Komarek, Confidential Materials – p. 19/44

Page 44: Logistic regression for fast, accurate, and parameter -

LIBSVM on ds1.10: slightly strange

0.4

0.5

0.6

0.7

0.8

0.9

1

0.01 0.1 1 10

100000 50000 20000 10000 5000 2000 1000 500 200 100 50 20 10 5 2 1

AU

C

Tim

e (s

econ

ds)

capacity

LIBSVM linear kernel on train10.pca.csv

AUCTime

Paul Komarek, Confidential Materials – p. 20/44

Page 45: Logistic regression for fast, accurate, and parameter -

Fun things we have done with LR

Fun things to do with fast LR:

high-throughput screening for active molecules inpharmaceutical datasets

collaborative filtering (think amazon.com “also bought”)

link analysis – more on this later

text classification/analysis

automatic quiz generation (coupled with associative rulelearning and dynamic AD-trees)

...

Anything with high-dimensional binary classification is fair game,even if you have to pervert the problem to make it fit.

Paul Komarek, Confidential Materials – p. 21/44

Page 46: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 47: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 48: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 49: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 50: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 51: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Paul Komarek, Confidential Materials – p. 22/44

Page 52: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

One machine learning approach:

Featurize molecules with many binary descriptors, e.g.1,000,000

Learn model from molecules to reported molecule-activities(noisy)

Use models to identify mislabeled compounds,

or use models to schedule untested molecules.

In both cases we care about ranking performance, andhence are interested in the ROC curves on the next slides.

There are many other opportunities for machine learning toimprove pharmaceutical research.

Paul Komarek, Confidential Materials – p. 23/44

Page 53: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

0

100

200

300

400

500

600

700

800

900

0 5000 10000 15000 20000 25000 30000

Tru

e po

sitiv

es

False positives

ROC curves for dataset ds1, with a linear False positives axis.

LR-CGEPSLR-CGDEVEPS

CG-MLESVM LINEAR

SVM RBF (gamma=0.001)BC

KNN k=1KNN k=9

KNN k=129

Paul Komarek, Confidential Materials – p. 24/44

Page 54: Logistic regression for fast, accurate, and parameter -

High-Throughput Screening for Drugs

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

Tru

e po

sitiv

es

False positives

ROC curves for dataset ds1, with a logarithmic False positives axis.

LR-CGEPSLR-CGDEVEPS

CG-MLESVM LINEAR

SVM RBF (gamma=0.001)BC

KNN k=1KNN k=9

KNN k=129

Paul Komarek, Confidential Materials – p. 25/44

Page 55: Logistic regression for fast, accurate, and parameter -

Collaborative Filtering

Prototypical collaborative filtering example:

People that bought the CD “Oops I did it again” alsobought “Dirty Deeds Done Dirt Cheap”

A more interesting example might rank a list of items missingfrom the customers shopping cart, in order of likely necessity.

People that buy cookies should also consider (indescending order) milk, frosting, weight-watchers’frozen entrees, and toothpicks.

Paul Komarek, Confidential Materials – p. 26/44

Page 56: Logistic regression for fast, accurate, and parameter -

Collaborative Filtering

Restate the problem in terms of binary classification.Assume historical cart data for training.

For each item ik in store, create dataset:

CartID i1 . . . ik−1 ik+1 . . . im ik also in cart?

000 1 . . . 0 1 . . . 0 No

001 0 . . . 1 1 . . . 0 Yes

002 1 . . . 0 1 . . . 1 Yes

. . . . . . . . .

Learn m classification models, mapping cart contents toprobability that item ik is also present.

A fast version of LR can be competitive with task-specificalgorithms in speed and accuracy (depends on precisedetails of problem).

Paul Komarek, Confidential Materials – p. 27/44

Page 57: Logistic regression for fast, accurate, and parameter -

Link Analysis

A “link” is a collection of “tokens”, best described using examples:

link=Research paper, tokens=Authors

link=Movie, tokens=Actors,Directors,Producers

link=Article, tokens=Names,Places,Dates

A link dataset might have one row per link (e.g. research paper).

Some link tasks:

Link completion: identical to collaborative filtering

TFF - Temporal Friend Finder

MNOP - Many Names, One Person

AFDL - Activity From Demographics and Links

Paul Komarek, Confidential Materials – p. 28/44

Page 58: Logistic regression for fast, accurate, and parameter -

Link Analysis

A “link” is a collection of “tokens”, best described using examples:

link=Research paper, tokens=Authors

link=Movie, tokens=Actors,Directors,Producers

link=Article, tokens=Names,Places,Dates

A link dataset might have one row per link (e.g. research paper).Some link tasks:

Link completion: identical to collaborative filtering

TFF - Temporal Friend Finder

MNOP - Many Names, One Person

AFDL - Activity From Demographics and Links

Paul Komarek, Confidential Materials – p. 28/44

Page 59: Logistic regression for fast, accurate, and parameter -

TFFTemporal Friend Finder: the diagram below shows publishing linksinvolving several people over several years.

20062005200420032002Andrew

(A,S)(A,D)

(A,P)

Paul(A,P)

(A,S,D)(A,T)(A,D)

(A,P,T)

(A,P,T) (P,J) ?

?

[A]ndrew, [D]avid, [J]eremy, [P]aul, [S]cott, [T]ing

Who will Paul publish with in 2006?

What is the probability that Andrew will publish with Jeremy?

Rank order all authors’ probability of publishing with Andrew.Paul Komarek, Confidential Materials – p. 29/44

Page 60: Logistic regression for fast, accurate, and parameter -

TFF

Approach:

Featurize time series:

X, Y connected 1 time unit ago, dist=1, strength=0.9+

X, Y connected 1 time unit ago, dist=1, strength=0.1+

X, Y connected 1 time unit ago, dist=2, strength=0.9+

X, Y connected 1 time unit ago, dist=2, strength=0.1+

X, Y connected 2 time units ago, dist=1, strength=0.9+

X, Y connected 2 time units ago, dist=1, strength=0.1+

. . .

Learn featurized time series model with LR

Many attributes, but not too many for a fast LRimplementation.

Paul Komarek, Confidential Materials – p. 30/44

Page 61: Logistic regression for fast, accurate, and parameter -

TFF

Approach:

Featurize time series:

X, Y connected 1 time unit ago, dist=1, strength=0.9+

X, Y connected 1 time unit ago, dist=1, strength=0.1+

X, Y connected 1 time unit ago, dist=2, strength=0.9+

X, Y connected 1 time unit ago, dist=2, strength=0.1+

X, Y connected 2 time units ago, dist=1, strength=0.9+

X, Y connected 2 time units ago, dist=1, strength=0.1+

. . .

Learn featurized time series model with LR

Many attributes, but not too many for a fast LRimplementation.

Paul Komarek, Confidential Materials – p. 30/44

Page 62: Logistic regression for fast, accurate, and parameter -

TFF

Approach:

Featurize time series:

X, Y connected 1 time unit ago, dist=1, strength=0.9+

X, Y connected 1 time unit ago, dist=1, strength=0.1+

X, Y connected 1 time unit ago, dist=2, strength=0.9+

X, Y connected 1 time unit ago, dist=2, strength=0.1+

X, Y connected 2 time units ago, dist=1, strength=0.9+

X, Y connected 2 time units ago, dist=1, strength=0.1+

. . .

Learn featurized time series model with LR

Many attributes, but not too many for a fast LRimplementation.

Paul Komarek, Confidential Materials – p. 30/44

Page 63: Logistic regression for fast, accurate, and parameter -

MNOP - Hsiung and Moore

Many Names, One Person: alias detection, e.g. are Clark Kentand Superman one or two people?

Featurize link dataset:

Compute many orthographic measures between “ClarkKent” and “Superman”, e.g. string edit distance.

Compute semantic measures from links, e.g. “ClarkKent” appears with “Lois Lane” 10 times, and “Lois Lane”appears with “Superman” 6 times.

There are many orthographic and semantic measures. Usemany, and allow LR to combine them.

In truth, MNOP did not stress our LR code, and slower LRwould probably have been fine. But MNOP is still a niceapplication for probabilistic binary classification.

Paul Komarek, Confidential Materials – p. 31/44

Page 64: Logistic regression for fast, accurate, and parameter -

MNOP - Hsiung and Moore

Many Names, One Person: alias detection, e.g. are Clark Kentand Superman one or two people?

Featurize link dataset:

Compute many orthographic measures between “ClarkKent” and “Superman”, e.g. string edit distance.

Compute semantic measures from links, e.g. “ClarkKent” appears with “Lois Lane” 10 times, and “Lois Lane”appears with “Superman” 6 times.

There are many orthographic and semantic measures. Usemany, and allow LR to combine them.

In truth, MNOP did not stress our LR code, and slower LRwould probably have been fine. But MNOP is still a niceapplication for probabilistic binary classification.

Paul Komarek, Confidential Materials – p. 31/44

Page 65: Logistic regression for fast, accurate, and parameter -

MNOP - Hsiung and Moore

Many Names, One Person: alias detection, e.g. are Clark Kentand Superman one or two people?

Featurize link dataset:

Compute many orthographic measures between “ClarkKent” and “Superman”, e.g. string edit distance.

Compute semantic measures from links, e.g. “ClarkKent” appears with “Lois Lane” 10 times, and “Lois Lane”appears with “Superman” 6 times.

There are many orthographic and semantic measures. Usemany, and allow LR to combine them.

In truth, MNOP did not stress our LR code, and slower LRwould probably have been fine. But MNOP is still a niceapplication for probabilistic binary classification.

Paul Komarek, Confidential Materials – p. 31/44

Page 66: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

no time component necessarily, just graph measures

however, graph measures (edge strength) can includeany demographic information: dates, places, favoritecolor, etc.

Paul Komarek, Confidential Materials – p. 32/44

Page 67: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Paul Komarek, Confidential Materials – p. 33/44

Page 68: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Paul Komarek, Confidential Materials – p. 33/44

Page 69: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Paul Komarek, Confidential Materials – p. 33/44

Page 70: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Paul Komarek, Confidential Materials – p. 33/44

Page 71: Logistic regression for fast, accurate, and parameter -

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Paul Komarek, Confidential Materials – p. 33/44

Page 72: Logistic regression for fast, accurate, and parameter -

Text Analysis

Performed train/test and k-fold experiments with Reuters’corpus.

Scored using AUC as well as micro- and macro-averagedprecision, recall, and F1

Scores were very similar to SVM scores, but LR ran a bitfaster

Paul Komarek, Confidential Materials – p. 34/44

Page 73: Logistic regression for fast, accurate, and parameter -

CosmoQuiz - Komarek and Moore

From a census database, a quiz to predict wealth=rich:

Score Question

[+1.68] Are you married?

[ -2.04] Was your capital gain below $10,000?

[ -0.87] Was your capital loss below $500?

[+0.47] Are you at least 36 years old?

[+0.64] Are you over 46 years old?

[+0.74] Do you work over 48 hours per week?

[+1.06] Are you a managing executive?

. . .

Prob(rich|total) = e(total−0.46)/(1+ e(total−0.46))

Paul Komarek, Confidential Materials – p. 35/44

Page 74: Logistic regression for fast, accurate, and parameter -

Computer-generated Quizzes

What is a CosmoQuiz?

Questions are conjunctions of att=val pairs

Exhaustive question scoring is hardest part

Questions chosen iteratively, and current questions areweighted by logistic regression

Resulting classification error is used for next iteration

LR shouldn’t slow things down, since question scoring requiresmore computation.

I am presenting this because it is a neat LR application, even if notmuch of a *fast* lr application.

Paul Komarek, Confidential Materials – p. 36/44

Page 75: Logistic regression for fast, accurate, and parameter -

Computer-generated Quizzes

What is a CosmoQuiz?

Questions are conjunctions of att=val pairs

Exhaustive question scoring is hardest part

Questions chosen iteratively, and current questions areweighted by logistic regression

Resulting classification error is used for next iteration

LR shouldn’t slow things down, since question scoring requiresmore computation.

I am presenting this because it is a neat LR application, even if notmuch of a *fast* lr application.

Paul Komarek, Confidential Materials – p. 36/44

Page 76: Logistic regression for fast, accurate, and parameter -

Computer-generated Quizzes

What is a CosmoQuiz?

Questions are conjunctions of att=val pairs

Exhaustive question scoring is hardest part

Questions chosen iteratively, and current questions areweighted by logistic regression

Resulting classification error is used for next iteration

LR shouldn’t slow things down, since question scoring requiresmore computation.

I am presenting this because it is a neat LR application, even if notmuch of a *fast* lr application.

Paul Komarek, Confidential Materials – p. 36/44

Page 77: Logistic regression for fast, accurate, and parameter -

Exhaustive Question SelectionEach question is a conjunction of att=val pairs:

a1=v1 AND a2=v2 AND ... AND ak=vk

Example: gender=Male AND capitalloss=v0:500-

We search over all questions up to some length, considering howmuch each reduces prediction errors. For this dataset, there are

162 one-att questions; 14,642 two-att questions; ...

130 trillion six-attribute questions

In 2.3 hours using Dynamic AD-trees and fast LR, we can

rank helpfulness of all 133 trillion questions with ≤ 6 atts

do this 100 times to find the 100 best questions

(helpfulness of question depends on previous questions)

For a 100-question quiz with ≤ 2 atts per question: 3.3 minutes

Paul Komarek, Confidential Materials – p. 37/44

Page 78: Logistic regression for fast, accurate, and parameter -

CosmoQuiz Applications

Any time you have historical, labeled data on a population, youcan create a CosmoQuiz. Some examples include

Assess risk of insurance customer of being defrauded.

Create safety checklist for uncertain environments.

Lie detection: use small, non-leading questions to assess.probability of truth.

Create troubleshooting and diagnostics procedures fromhistorical engine data.

Political profiling or identifying likely donors.

Paul Komarek, Confidential Materials – p. 38/44

Page 79: Logistic regression for fast, accurate, and parameter -

Future of LR: bigger, faster

Just how big of classification problem can we solve with LR? Howfast can we make it go?

Bottleneck for sparse LR is sparse dot product (DP) operator.

Sparse DP accounts for 30 times more cycles than next mostexpensive numerical operation (log()), according to Valgrind,on ds1.

1/3 of DPs are for matrix-vector multiply, 1/3 formatrix-transpose times vector, and 1/3 for computing currentmodel predictions.

We need to be very efficient with very large datasets, for examplethose stored in large relational databases. We may also want toconsider vectorization or parallelization of DP.

Paul Komarek, Confidential Materials – p. 39/44

Page 80: Logistic regression for fast, accurate, and parameter -

Future of LR: in database?

Many large datasets are stored in relational databases. What canbe done to improve LR performance on this data?

We assume the data is stored across many tables.

We assume LR would operate on the results of SQL queries.

Can we implement a DP operator that operates implicitly onSQL query results, without explicitly creating the resulttables?

Paul Komarek, Confidential Materials – p. 40/44

Page 81: Logistic regression for fast, accurate, and parameter -

Future of LR: vectorization?

Vectorization of sparse dot products might not improveperformance.

LR parameter vector is dense, but we take a dot product witha sparse (binary) vector.

Thus we are striding irregularly through the LR params.

Reduced processor cache efficiency.

Vectorization won’t help if data fetch is the bottleneck.

Paul Komarek, Confidential Materials – p. 41/44

Page 82: Logistic regression for fast, accurate, and parameter -

Future of LR: parallelization?

Parallelization of dot products could take two forms.

We usually compute many DPs against the same vector ofLR params. We could use different CPUs for different DPs.

We can break up each DP into smaller chunks. However,sparse DPs will require tricky partitioning of sparse vectors.E.g. partition this sparse index vector into decades of indices:

{0,1,3,14,27,300,301,302,310,1902}

The first approach does not help with cache problems, exceptinsofar as we are able to parallelize fetch latency.

The second approach would be complicated for sparse vectors,but the sparse vectors in LR come from the (static) data matrix.Possible?

Paul Komarek, Confidential Materials – p. 42/44

Page 83: Logistic regression for fast, accurate, and parameter -

Future of LR: parallelization?

Parallelization of dot products could take two forms.

We usually compute many DPs against the same vector ofLR params. We could use different CPUs for different DPs.

We can break up each DP into smaller chunks. However,sparse DPs will require tricky partitioning of sparse vectors.E.g. partition this sparse index vector into decades of indices:

{0,1,3,14,27,300,301,302,310,1902}

The first approach does not help with cache problems, exceptinsofar as we are able to parallelize fetch latency.

The second approach would be complicated for sparse vectors,but the sparse vectors in LR come from the (static) data matrix.Possible?

Paul Komarek, Confidential Materials – p. 42/44

Page 84: Logistic regression for fast, accurate, and parameter -

Future of LR: parallelization?

Parallelization of dot products could take two forms.

We usually compute many DPs against the same vector ofLR params. We could use different CPUs for different DPs.

We can break up each DP into smaller chunks. However,sparse DPs will require tricky partitioning of sparse vectors.E.g. partition this sparse index vector into decades of indices:

{0,1,3,14,27,300,301,302,310,1902}

The first approach does not help with cache problems, exceptinsofar as we are able to parallelize fetch latency.

The second approach would be complicated for sparse vectors,but the sparse vectors in LR come from the (static) data matrix.Possible?

Paul Komarek, Confidential Materials – p. 42/44

Page 85: Logistic regression for fast, accurate, and parameter -

Future of LR: parallelization?

Parallelization of dot products could take two forms.

We usually compute many DPs against the same vector ofLR params. We could use different CPUs for different DPs.

We can break up each DP into smaller chunks. However,sparse DPs will require tricky partitioning of sparse vectors.E.g. partition this sparse index vector into decades of indices:

{0,1,3,14,27,300,301,302,310,1902}

The first approach does not help with cache problems, exceptinsofar as we are able to parallelize fetch latency.

The second approach would be complicated for sparse vectors,but the sparse vectors in LR come from the (static) data matrix.Possible?

Paul Komarek, Confidential Materials – p. 42/44

Page 86: Logistic regression for fast, accurate, and parameter -

Conclusions

LR is mature, well-understood, and not dead yet.

Recent spotty interest has made it faster than ever forclassification tasks.

Many tasks can be restated as binary, probabilisticclassification problems.

So featurize your eyes out, and find a fast LR code.

(e.g. http://komarix.org/lr, or http://www.autonlab.org)

Paul Komarek, Confidential Materials – p. 43/44

Page 87: Logistic regression for fast, accurate, and parameter -

Conclusions

LR is mature, well-understood, and not dead yet.

Recent spotty interest has made it faster than ever forclassification tasks.

Many tasks can be restated as binary, probabilisticclassification problems.

So featurize your eyes out, and find a fast LR code.

(e.g. http://komarix.org/lr, or http://www.autonlab.org)

Paul Komarek, Confidential Materials – p. 43/44

Page 88: Logistic regression for fast, accurate, and parameter -

Conclusions

LR is mature, well-understood, and not dead yet.

Recent spotty interest has made it faster than ever forclassification tasks.

Many tasks can be restated as binary, probabilisticclassification problems.

So featurize your eyes out, and find a fast LR code.

(e.g. http://komarix.org/lr, or http://www.autonlab.org)

Paul Komarek, Confidential Materials – p. 43/44

Page 89: Logistic regression for fast, accurate, and parameter -

Conclusions

LR is mature, well-understood, and not dead yet.

Recent spotty interest has made it faster than ever forclassification tasks.

Many tasks can be restated as binary, probabilisticclassification problems.

So featurize your eyes out, and find a fast LR code.

(e.g. http://komarix.org/lr, or http://www.autonlab.org)

Paul Komarek, Confidential Materials – p. 43/44

Page 90: Logistic regression for fast, accurate, and parameter -

Logistic regression for fast, accurate,and parameter free data mining

Paul KomarekAuton Lab

Robotics InstituteCarnegie Mellon University

[email protected]

http://www.komarix.org

http://www.autonlab.org

Paul Komarek, Confidential Materials – p. 44/44