logistic regression for fast, accurate, and parameter -

Logistic regression for fast, accurate,and parameter free data mining

Paul KomarekAuton Lab

Robotics InstituteCarnegie Mellon University

[email protected]

http://www.komarix.org

http://www.autonlab.org

Paul Komarek, Confidential Materials – p. 1/44

Logistic Regression: Not Dead Yet

Logistic regression (LR) is a venerable, but capable probabilisticbinary classifier.

LR is well-understood, mature and comfortable =⇒ trusted

LR accuracy is comparable to new-fangled state-of-the-artSVMs

LR can be as fast or faster than linear SVMs

Yet LR is often ignored in data mining literature. LR gets morepress in text classification literature.

Why do we care about LR speed?

Why LR, instead of other binary classifiers?

LR is more useful when it is very fast

LR is useful already, even when estimation algorithms are O(

.If LR algorithms were really fast, we could

(obvious) do many logistic regressions

(obvious) do many binary classifications

which leads to . . .

Many apps for fast binary classifiers

Multiclass through voting (or an alternative)

Collaborative filtering – just a special multiclass problem, oneLR model per “item”

Link tasks

Link completion (just collaborative filtering)

TFF - Group detection

MNOP - Alias detection

AFDL - Links+Demographics Classification

Link tasks

Text classification without feature selection

Thorsten Joachims pushed this with SVMlight

LR can also be used

LR has even been used for successive approximation toSVMs in text classification

Link tasks

LR can also be used

Link tasks

LR can also be used

Link tasks

LR can also be used

Link tasks

LR can also be used

Link tasks

Video segmentation

CAUTION: we are discussing LR as a classifier. You cannot treatthe resulting models as explanatory. The size of modelparameters, taken one-at-a-time, might be misleading.

Link tasks

Video segmentation

Link tasks

Video segmentation

The logistic regression model

LR expectation function:

µi = µ(xi,β) =exp(βT xi)

1+ exp(βT xi)

LR model:

yi = µi + ε, ε ∼ b(1,µi)

LR likelihood:

L(β) =R

∏i=1

µ yii

1−µ(1−yi)i

0 20 40 60 80 100

Dataexp(β0 + β1 x) / (1 + exp(β0 + β1 x))

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

err = 1-µ(x0)

err = µ(x0)

µ(x)x0

µ(x0)y0=0y0=1

What is the hard part of LR?I am pushing LR as a good binary classifier with aprobabilistic model.

Of course, LR has its difficulties.

The likelihood,

L(β) =R

∏i=1

µ yii

1−µ(1−yi)i

cannot be optimized analytically, and iterative methods areused instead.

The iterative method used differentiates LR implementations.

The likelihood,

L(β) =R

∏i=1

µ yii

1−µ(1−yi)i

The likelihood,

L(β) =R

∏i=1

µ yii

1−µ(1−yi)i

The likelihood,

L(β) =R

∏i=1

µ yii

1−µ(1−yi)i

Iterative methods for LRIteratively Re-weighted Least Squares (IRLS) is a popularstatistically-formulated quasi-Newton method. Useful for anygeneralized linear model (GLM). Finds solution through aseries of weighted least squares problems.

(XT WX)βi = XT Wz

Equivalent to Newton’s method for a subclass of GLMs, andNewton’s method is simple but slow (see next slide).

A variety of other nonlinear optimization methods are usedon the likelihood, including many variations of Newton’smethod. Nonlinear conjugate gradient, quasi-Newtonmethods, and cyclic coordinate descent are very popular.

We prefer IRLS for generality, interpretability, and simplicity,but it needs modification.

(XT WX)βi = XT Wz

Newton’s method is simpleNewton’s method finds zeros of a function.

x(0,0)f(x_i)

x_(i+1)xi+1 = xi − f (xi)/ f ′(xi)

For optimization, find zeros of derivative. In general,xi+1 = xi −H−1∇ f (xi)

But Newton’s method is slowNewton’s method requires inverting the Hessian repeatedly, whichis generally an O

operation.

Same problem with IRLS, since for LR it is equivalent toNewton. The weighted least squares problem at eachiteration is

(XT WX)βi = XT Wz

Solving for βi can be done slowly with a matrix inverse:

βi = (XT WX)−1XT Wz

Note that solving linear systems such as Ax = b for positivedefinite A is equivalent to minimizing the quadratic form

xT Ax−bT x+ c

operation.

(XT WX)βi = XT Wz

xT Ax−bT x+ c

operation.

(XT WX)βi = XT Wz

xT Ax−bT x+ c

Linear conjugate gradient to the rescue

Conjugate gradient (CG) is an iterative but exact minimizationalgorithm that specializes in quadratic forms.

If the function to minimize is quadratic, we call the algorithm“linear CG”

linear CG is very simple, much simpler than “nonlinear CG”

linear CG is very fast, speed depends on spectrum of A

You can think of linear CG as a specialized highly-efficient versionof the “steepest descent” algorithm.

Liner CG versus steepest descent

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Truncated IRLS

Using linear CG to approximate the Newton update solutioncreates a “truncated Newton method”.

Thus we can create a “truncated IRLS” that applies to allGLMs, whether or not IRLS is equivalent to Newton’smethod.

Truncated Newton methods are somewhat well studied, haveconvergence guarantees, etc.

Linear CG and Newton’s method have few parameters, andthey rarely need tuning when used for LR (empiricallydemonstrated).

Side note: this isn’t the whole story. Regularization is alsoimportant, along with a few other details.

LR performs better in our experiments

Name Columns Rows Nonzero Pos

Link Analysisciteseer 105,354 181,395 512,267 299

imdb 685,569 167,773 2,442,721 824

Life Sciencesds2 1,143,054 88,358 29,861,146 423

ds1 6,348 26,733 3,732,607 804

ds1.100 100 26,733 NA 804

ds1.10 10 26,733 NA 804

Text Categorizationmodapte.sub 26,299 7,769 423,025 495

LR performs better in our experiments

(All experiments are 10-fold cross-validations)

citeseer imdb ds2

Classifier Time AUC Time AUC Time AUC

LR TR-IRLS 53 0.945 272 0.983 1460 0.722LR CG-MLE 70 0.946 310 0.983 2851 0.724

SVM LIN BEST 82 0.821 647 0.949 3729 0.704

SVM LIN FAST 79 0.810 564 0.938 2030 0.690

SVM RBF BEST 1150 0.864 4549 0.957 67118 0.700

SVM RBF FAST 408 0.798 1929 0.947 14681 0.680

BC 10 0.501 33 0.507 127 0.533

LR performs better in our experiments(All experiments are 10-fold cross-validations)

ds1 ds1.100 ds1.10

Classifier Time AUC Time AUC Time AUC

LR TR-IRLS 45 0.948 35 0.913 8 0.842LR CG-MLE 120 0.946 294 0.916 43 0.844

SVM LIN BEST 846 0.931 1744 0.882 373 0.741

SVM LIN FAST 183 0.918 123 0.874 73 0.675

SVM RBF BEST 3594 0.939 2577 0.934 167 0.876

SVM RBF FAST 1593 0.902 932 0.864 248 0.848

KNS2 K=1 424 0.790 74 0.785 9 0.753

KNS2 K=9 782 0.909 166 0.894 14 0.859

KNS2 K=129 2381 0.938 819 0.938 89 0.909

BC 4 0.884 8 0.890 2 0.863

Performance notesAll experiments were 10-fold cross-validations, and actualperformance of all algs is about ten times faster.

Those SVM times and scores came from extensiveper-dataset tuning, and times do not include the time spenttuning.

Using distance-from-boundary ranking for SVM may skewleast-confident SVM predictions and affect score. SVMregression may be better suited to rank-ordering predictions.

The LR algorithms used the same parameters for everydataset.

A favorite quasi-Newton method for likelihood optimization isBFGS. We have tried GSL’s BFGS, but we cannot make itcompetitive with TR-IRLS or nonlinear CG (CG-MLE).

We observed strange behavior with linear SVM on ds1.10.

SVMlight on ds1.10: very strange

0.01 0.1 1 10

100000 50000 20000 10000 5000 2000 1000 500 200 100 50 20 10 5 2 1

capacity

SVMlight linear kernel on train10.pca.csv

AUCTime

LIBSVM on ds1.10: slightly strange

0.01 0.1 1 10

100000 50000 20000 10000 5000 2000 1000 500 200 100 50 20 10 5 2 1

capacity

LIBSVM linear kernel on train10.pca.csv

AUCTime

Fun things we have done with LR

Fun things to do with fast LR:

high-throughput screening for active molecules inpharmaceutical datasets

collaborative filtering (think amazon.com “also bought”)

link analysis – more on this later

text classification/analysis

automatic quiz generation (coupled with associative rulelearning and dynamic AD-trees)

Anything with high-dimensional binary classification is fair game,even if you have to pervert the problem to make it fit.

High-Throughput Screening for Drugs

Background:

Roboticized chemistry labs test many compounds forreactivity with a target molecule.

Domain knowledge can reduce errors to 1/1000, e.g. 50 falsepositives and 50 false negatives out of 100,000 trials.

Might be 200 actual positives in 100,000 trials.

Thus the lab misses 25% of potential drugs.

This makes the pharmacists very unhappy.

Wasting lab time on 99,800 inactive compounds makeschemists unhappy.

Background:

One machine learning approach:

Featurize molecules with many binary descriptors, e.g.1,000,000

Learn model from molecules to reported molecule-activities(noisy)

Use models to identify mislabeled compounds,

or use models to schedule untested molecules.

In both cases we care about ranking performance, andhence are interested in the ROC curves on the next slides.

There are many other opportunities for machine learning toimprove pharmaceutical research.

0 5000 10000 15000 20000 25000 30000

False positives

ROC curves for dataset ds1, with a linear False positives axis.

LR-CGEPSLR-CGDEVEPS

CG-MLESVM LINEAR

SVM RBF (gamma=0.001)BC

KNN k=1KNN k=9

KNN k=129

1 10 100 1000 10000 100000

False positives

ROC curves for dataset ds1, with a logarithmic False positives axis.

LR-CGEPSLR-CGDEVEPS

CG-MLESVM LINEAR

SVM RBF (gamma=0.001)BC

KNN k=1KNN k=9

KNN k=129

Collaborative Filtering

Prototypical collaborative filtering example:

People that bought the CD “Oops I did it again” alsobought “Dirty Deeds Done Dirt Cheap”

A more interesting example might rank a list of items missingfrom the customers shopping cart, in order of likely necessity.

People that buy cookies should also consider (indescending order) milk, frosting, weight-watchers’frozen entrees, and toothpicks.

Collaborative Filtering

Restate the problem in terms of binary classification.Assume historical cart data for training.

For each item ik in store, create dataset:

CartID i1 . . . ik−1 ik+1 . . . im ik also in cart?

000 1 . . . 0 1 . . . 0 No

001 0 . . . 1 1 . . . 0 Yes

002 1 . . . 0 1 . . . 1 Yes

. . . . . . . . .

Learn m classification models, mapping cart contents toprobability that item ik is also present.

A fast version of LR can be competitive with task-specificalgorithms in speed and accuracy (depends on precisedetails of problem).

Link Analysis

A “link” is a collection of “tokens”, best described using examples:

link=Research paper, tokens=Authors

link=Movie, tokens=Actors,Directors,Producers

link=Article, tokens=Names,Places,Dates

A link dataset might have one row per link (e.g. research paper).

Some link tasks:

Link completion: identical to collaborative filtering

TFF - Temporal Friend Finder

MNOP - Many Names, One Person

AFDL - Activity From Demographics and Links

Link Analysis

A “link” is a collection of “tokens”, best described using examples:

link=Research paper, tokens=Authors

link=Movie, tokens=Actors,Directors,Producers

link=Article, tokens=Names,Places,Dates

A link dataset might have one row per link (e.g. research paper).Some link tasks:

Link completion: identical to collaborative filtering

TFF - Temporal Friend Finder

MNOP - Many Names, One Person

AFDL - Activity From Demographics and Links

TFFTemporal Friend Finder: the diagram below shows publishing linksinvolving several people over several years.

20062005200420032002Andrew

(A,S)(A,D)

Paul(A,P)

(A,S,D)(A,T)(A,D)

(A,P,T)

(A,P,T) (P,J) ?

[A]ndrew, [D]avid, [J]eremy, [P]aul, [S]cott, [T]ing

Who will Paul publish with in 2006?

What is the probability that Andrew will publish with Jeremy?

Rank order all authors’ probability of publishing with Andrew.Paul Komarek, Confidential Materials – p. 29/44

Approach:

Featurize time series:

X, Y connected 1 time unit ago, dist=1, strength=0.9+

X, Y connected 2 time units ago, dist=1, strength=0.9+

Learn featurized time series model with LR

Many attributes, but not too many for a fast LRimplementation.

Approach:

MNOP - Hsiung and Moore

Many Names, One Person: alias detection, e.g. are Clark Kentand Superman one or two people?

Featurize link dataset:

Compute many orthographic measures between “ClarkKent” and “Superman”, e.g. string edit distance.

Compute semantic measures from links, e.g. “ClarkKent” appears with “Lois Lane” 10 times, and “Lois Lane”appears with “Superman” 6 times.

There are many orthographic and semantic measures. Usemany, and allow LR to combine them.

In truth, MNOP did not stress our LR code, and slower LRwould probably have been fine. But MNOP is still a niceapplication for probabilistic binary classification.

AFDLActivity From Demographics and Links: Entities are “active” or“inactive”. Some entities are known to be active. What is theprobability that an unlabeled entity is active? Besides links, alsouse demographic information.

Similar featurization to TFF

no time component necessarily, just graph measures

however, graph measures (edge strength) can includeany demographic information: dates, places, favoritecolor, etc.

Thus we are constructing another combinatorial set ofmeasures of proximity from an unlabeled entity to knownactive entities.

“Learn in an automated, self-tuning-to-specific-data the rightfusion of this information.”

Read: use LR to combine measures and make predictions.

In this case, the featurizations can be extremely large, andfast LR is essential.

Text Analysis

Performed train/test and k-fold experiments with Reuters’corpus.

Scored using AUC as well as micro- and macro-averagedprecision, recall, and F1

Scores were very similar to SVM scores, but LR ran a bitfaster

CosmoQuiz - Komarek and Moore

From a census database, a quiz to predict wealth=rich:

Score Question

[+1.68] Are you married?

[ -2.04] Was your capital gain below $10,000?

[ -0.87] Was your capital loss below $500?

[+0.47] Are you at least 36 years old?

[+0.64] Are you over 46 years old?

[+0.74] Do you work over 48 hours per week?

[+1.06] Are you a managing executive?

Prob(rich|total) = e(total−0.46)/(1+ e(total−0.46))

Computer-generated Quizzes

What is a CosmoQuiz?

Questions are conjunctions of att=val pairs

Exhaustive question scoring is hardest part

Questions chosen iteratively, and current questions areweighted by logistic regression

Resulting classification error is used for next iteration

LR shouldn’t slow things down, since question scoring requiresmore computation.

I am presenting this because it is a neat LR application, even if notmuch of a *fast* lr application.

Exhaustive Question SelectionEach question is a conjunction of att=val pairs:

a1=v1 AND a2=v2 AND ... AND ak=vk

Example: gender=Male AND capitalloss=v0:500-

We search over all questions up to some length, considering howmuch each reduces prediction errors. For this dataset, there are

162 one-att questions; 14,642 two-att questions; ...

130 trillion six-attribute questions

In 2.3 hours using Dynamic AD-trees and fast LR, we can

rank helpfulness of all 133 trillion questions with ≤ 6 atts

do this 100 times to find the 100 best questions

(helpfulness of question depends on previous questions)

For a 100-question quiz with ≤ 2 atts per question: 3.3 minutes

CosmoQuiz Applications

Any time you have historical, labeled data on a population, youcan create a CosmoQuiz. Some examples include

Assess risk of insurance customer of being defrauded.

Create safety checklist for uncertain environments.

Lie detection: use small, non-leading questions to assess.probability of truth.

Create troubleshooting and diagnostics procedures fromhistorical engine data.

Political profiling or identifying likely donors.

Future of LR: bigger, faster

Just how big of classification problem can we solve with LR? Howfast can we make it go?

Bottleneck for sparse LR is sparse dot product (DP) operator.

Sparse DP accounts for 30 times more cycles than next mostexpensive numerical operation (log()), according to Valgrind,on ds1.

1/3 of DPs are for matrix-vector multiply, 1/3 formatrix-transpose times vector, and 1/3 for computing currentmodel predictions.

We need to be very efficient with very large datasets, for examplethose stored in large relational databases. We may also want toconsider vectorization or parallelization of DP.

Future of LR: in database?

Many large datasets are stored in relational databases. What canbe done to improve LR performance on this data?

We assume the data is stored across many tables.

We assume LR would operate on the results of SQL queries.

Can we implement a DP operator that operates implicitly onSQL query results, without explicitly creating the resulttables?

Future of LR: vectorization?

Vectorization of sparse dot products might not improveperformance.

LR parameter vector is dense, but we take a dot product witha sparse (binary) vector.

Thus we are striding irregularly through the LR params.

Reduced processor cache efficiency.

Vectorization won’t help if data fetch is the bottleneck.

Future of LR: parallelization?

Parallelization of dot products could take two forms.

We usually compute many DPs against the same vector ofLR params. We could use different CPUs for different DPs.

We can break up each DP into smaller chunks. However,sparse DPs will require tricky partitioning of sparse vectors.E.g. partition this sparse index vector into decades of indices:

{0,1,3,14,27,300,301,302,310,1902}

The first approach does not help with cache problems, exceptinsofar as we are able to parallelize fetch latency.

The second approach would be complicated for sparse vectors,but the sparse vectors in LR come from the (static) data matrix.Possible?

{0,1,3,14,27,300,301,302,310,1902}

Conclusions

LR is mature, well-understood, and not dead yet.

Recent spotty interest has made it faster than ever forclassification tasks.

Many tasks can be restated as binary, probabilisticclassification problems.

So featurize your eyes out, and find a fast LR code.

(e.g. http://komarix.org/lr, or http://www.autonlab.org)

Conclusions

Logistic regression for fast, accurate,and parameter free data mining

Paul KomarekAuton Lab

Robotics InstituteCarnegie Mellon University

[email protected]

http://www.komarix.org

http://www.autonlab.org

logistic regression for fast, accurate, and parameter -

Documents