streaming feature selection - stanford university

Streaming Feature Selection

Bob StineDepartment of Statistics

Wharton School, University of Pennsylvaniawww-stat.wharton.upenn.edu/~stine

Wharton Department of Statistics

PlanMotivating applications: predictive models

Credit default ratesLinguistics

Sequential testingAlpha investing

Robust standard errorsSandwich estimator

Auction frameworkBlend several streams, strategies

CollaboratorsDean FosterDongyu Lin

2


Applications


Spatial Temporal ModelsGoal

Predict default rates, such as in credit cards

4


Spatial Temporal ModelsGoal

Predict default rates, such as in credit cards

4

3,000 counties x 70 quartersn = 210,000

Plan to move to individual consumer next...


Spatial Dependence

Conditional AR? (Markovian)How many layers?Distance measure?One shoe fits all?

5


Spatial Temporal ModelsRefined goal: compare to benchmark

Predict default rates better than possible using only the local history of default.Implications for bank’s data needs

Possible predictorsMacroeconomic factorsDefault trends in nearby countiesNon-linear effects, interactionsSpatial variation in model structure

Modeling issuesDependence (spatial, temporal)Heterogeneity among countiesPopulation drift: EBay patterns, hiring model

6


Computational LinguisticsVariety of applications…

Word disambiguationDoes “Georgia” refer to a person, US state, or perhaps to a Nation?Speech taggingIdentifying noun, verb, adjective...Cloze (predicting the next word)“...in the midst of modern life the greatest, ___”

Huge corpus of datax,000,000 casesnovels, news feeds, web pagestext of Wikipedia used to seem huge

7


Challenges in TextCloze

Is the next word “the” or “her”?“...in the midst of modern life the greatest, ___”Balanced training data with 50/50 rate

Possible predictorsWord frequencies (bag of words)Neighboring sentences/wordsParts of speech, tree banks, stem words, synonyms

Transfer learningDo predictors based on Washington Post work for text from NY Times?Dependence, unobserved latent structure

8


Similarities

Predict word in new documents, different authorsLatent structure associated with corpus

Neighborhoods: nearby words

Vast possible corpus

Sparse

Predict rates in same locations, but changing economic conditionsLatent temporal changes as economy evolves

Neighborhoods: nearby locations, time periods

Only 3,000 counties but possible to drill lower

May be sparse

9

Text Credit


Methods


Modeling ChallengeWe like regression models

Familiar, interpretable, good diagnostics

Regression models have worked wellPredicting rare events, such as bankruptcyCompetitive with random forestFunction estimation, using wavelets and variations on thresholding

Extend to rich environmentsSpatial-temporal dataRetail credit default!! ! ! ! ! MRF, MCMCLinguistics, text miningWord disambiguation, cloze!! ! TF-IDF

Avoid overfitting...

11TF-IDF: term frequency-inverse document frequencyfrequency in document relative to frequency in corpus MRF: Markov random fields


Recent news

12


Lessons from Prior ModelingBankruptcy: n=500,000, p=60,000, 450 events“Breadth-first” search for best features

Slow, memory hogSevere penalty on largest z-score, sqrt(2 log p)

If tested features are mostly interactions, then selected features are mostly interactions

Exampleμ»0 and β1, β2 " 0, then X1*X2 ⇒ c + β1X1 + β2X2

Outliers cause problems even with large n

13

Real p-value # 1/1000, but

usual t-statistic # 10


Spatial Outliers Happen

14


Reaction to LessonsBreadth-first becomes streaming selection

Sequence of possible featuresExamining each is very fastOver-fitting? Multiplicity adjustments?

Fixed significance levels replaced by levels that vary with the type of the variable

Heuristic: Revised Bonferroni (ie, hard threshold)Divide α level equally between linear & interactions! p linear: test each at level α/(2p)! p2 interactions: test at level α/(2p2)

Rather than trust model to obtain standard errors, use a robust estimate.

15


Methods Overview“Linear” regression! ! ! Y = b0 + b1 X1 + b2 X2 + ...

Auction selection from multiple “experts”Explore expansive feature space, includinginteractions and nonlinear subspacesExploit exogenous information

Robust standard errors and p-valuesAccommodate dependence and heterogeneity

Alpha investingControl over-fitting adaptively

16


Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertN

α1 α2

αN

Collection of experts bid for the

opportunity to recommend feature

Ymodel


Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertNα2



Auction collects winning bid α2

Expert supplies recommended feature Xw

XwY

model


Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertN



Auction collects winning bid α2

Expert supplies recommended feature Xw

pw

Expert receives payoff ω if pw $ α2

Experts learn if the bid was accepted, not the effect size or pw.

Ymodel

Stat model returns p-value


Experts

18

Auction

X1,X2,X3,...

Elin

Xj

Equad

XjXk

Z1,Z2,Z3,...

Elin

Zj

Elag

Zt-s

Ecal

Scaven

gers

f(%)Eg

g(Xa1)

Xaccepted Xrejected

SubspacesESVD

ERKHSSj

Bj

Source Experts


ExpertsExpertStrategy for creating list of features. Experts embody domain knowledge, science of application.Source experts

A collection of measurements (eg, synonyms, clusters)Subspace basis (PCA, RKHS)Lags of a time series

Parasitic experts, scavengersInteractions - among features accepted into model- among features rejected by model- between those accepted with those rejectedTransformations- segmenting, as in scatterplot smoothing- polynomial transformations

19


Expert WealthExpert is rewarded if feature accepted

Experts have alpha-wealthIf recommended feature is accepted in the model, expert earns ω additional wealthIf recommended feature is refused, expert loses bid

As auction proceeds, the auctionRewards experts that offer useful features. These then can win later bids and recommend more X’sEliminates experts whose features are not accepted.

Taxes fund parasites and scavengersContinue control overall FDR

Criticalcontrol multiplicity in a sequence of hypothesesp-values determine useful features

20


Standard Errors


Robust Standard Errorsp-values depend on many things

p-value = f(effect size, std error, prob dist)Error structure likely heteroscedasticObservations frequently dependent

DependenceSpatial time series at multiple locationsDocuments from various news feedsTransfer learningWhen train on observations from selected regions or document sources, what can you infer to others?

What are the right degrees of freedom?Tukey story

22


Sandwich EstimatorUsual OLS estimate of variance

Assume your model is true

Sandwich estimatorsRobust to deviations from assumptions

23

var(b) = (X’X)-1X’E(ee’)X(X’X)-1 = σ2(X’X)-1(X’X) (X’X)-1 = σ2(X’X)-1

heteroscedasticityvar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = (X’X)-1 X’D2X (X’X)-1

diagonal

dependencevar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = σ2(X’X)-1 X’BX (X’X)-1

block diagonal

Essentially the “Tukey method”


Flashback...Heteroscedastic errors

Estimate standard error with outlierSandwich estimator allowing heteroscedastic error variances givesa t-stat # 1, not 10.

Dependent errorsEven more need for accurate SENetflix exampleBonferroni (hard thresholding) overfits due to dependence in responses.Credit modelingEverything seems significant unless incorporate dependence into the calculation of the SE

24


Sequential Testing


Alpha InvestingContext

Test possibly infinite sequence of m hypotheses! ! H1, H2, H3, … Hm … obtaining p-values p1, p2, ...Order of tests can depend prior outcomes

ProcedureStart with an initial alpha wealth W0 = αInvest wealth 0 $ αj $ Wj in the test of Hj

Change in wealth depends on test outcomeω $ α denotes the payout earned by rejecting

26

Wj - Wj-1 =ω if pj $ αj

-αj/(1-αj) if pj > αj


Alpha Investing MartingaleProvides uniform control of the expected false discovery rate. At any stopping time during testing, martingale argument shows

Flexibility in choice of how to invest alpha-wealth in test of each hypothesis

Invest more when just reject if suspectthat significant results cluster.Universal strategies

Avoids need to compute p-values in advance

28

E(#false rejects)E(#rejects)+1

supθ

$ α


ConnectionsOther methods of controlling false positives are special casesBonferroni test of H1,...,Hm

Set W0 = α and reward ω = 0Bid αj = α/m

Step-down test of Benjamini & HochbergSet W0 = α and reward ω = αTest all m at level α/mIf none are significant, doneIf one is significant, earn α back

Test remaining m-1 conditional on pj> α/m

29


Benefits of KnowledgeSimulationTest m = 200hypothesesCompare powerto Benjami-HochbergSignal from spikeand slab prior

30

Alpha investing

Oracle BH

correct order

random order


Next StepsReplace the martingale that controls alpha wealth by one that controls expected loss.Improved experts: more features

Neighborhood structure is an important method to create new types of features! - geographical! - temporalBoth are links to other rows.

Better softwareFront endBack endGet some of that faster matrix code

31


ReferencesFeature auction

www-stat.wharton.upenn.edu/~stine

Alpha investing“α-investing: a procedure for sequential control of expected false discoveries”, JRSSB, 2006

Early improved stepwise regression“Variable selection in data mining: Building a predictive model for bankruptcy”, JASA, 2004

Robust standard errors“Variable selection in models with blockwise dependence”, Lin and Foster.

32

Thanks!