streaming feature selection - stanford university

34
Streaming Feature Selection Bob Stine Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine

Upload: others

Post on 11-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Streaming Feature Selection

Bob StineDepartment of Statistics

Wharton School, University of Pennsylvaniawww-stat.wharton.upenn.edu/~stine

Wharton Department of Statistics

PlanMotivating applications: predictive models

Credit default ratesLinguistics

Sequential testingAlpha investing

Robust standard errorsSandwich estimator

Auction frameworkBlend several streams, strategies

CollaboratorsDean FosterDongyu Lin

2

Wharton Department of Statistics

Applications

Wharton Department of Statistics

Spatial Temporal ModelsGoal

Predict default rates, such as in credit cards

4

Wharton Department of Statistics

Spatial Temporal ModelsGoal

Predict default rates, such as in credit cards

4

3,000 counties x 70 quartersn = 210,000

Plan to move to individual consumer next...

Wharton Department of Statistics

Spatial Dependence

Conditional AR? (Markovian)How many layers?Distance measure?One shoe fits all?

5

Wharton Department of Statistics

Spatial Temporal ModelsRefined goal: compare to benchmark

Predict default rates better than possible using only the local history of default.Implications for bank’s data needs

Possible predictorsMacroeconomic factorsDefault trends in nearby countiesNon-linear effects, interactionsSpatial variation in model structure

Modeling issuesDependence (spatial, temporal)Heterogeneity among countiesPopulation drift: EBay patterns, hiring model

6

Wharton Department of Statistics

Computational LinguisticsVariety of applications…

Word disambiguationDoes “Georgia” refer to a person, US state, or perhaps to a Nation?Speech taggingIdentifying noun, verb, adjective...Cloze (predicting the next word)“...in the midst of modern life the greatest, ___”

Huge corpus of datax,000,000 casesnovels, news feeds, web pagestext of Wikipedia used to seem huge

7

Wharton Department of Statistics

Challenges in TextCloze

Is the next word “the” or “her”?“...in the midst of modern life the greatest, ___”Balanced training data with 50/50 rate

Possible predictorsWord frequencies (bag of words)Neighboring sentences/wordsParts of speech, tree banks, stem words, synonyms

Transfer learningDo predictors based on Washington Post work for text from NY Times?Dependence, unobserved latent structure

8

Wharton Department of Statistics

Similarities

Predict word in new documents, different authorsLatent structure associated with corpus

Neighborhoods: nearby words

Vast possible corpus

Sparse

Predict rates in same locations, but changing economic conditionsLatent temporal changes as economy evolves

Neighborhoods: nearby locations, time periods

Only 3,000 counties but possible to drill lower

May be sparse

9

Text Credit

Wharton Department of Statistics

Methods

Wharton Department of Statistics

Modeling ChallengeWe like regression models

Familiar, interpretable, good diagnostics

Regression models have worked wellPredicting rare events, such as bankruptcyCompetitive with random forestFunction estimation, using wavelets and variations on thresholding

Extend to rich environmentsSpatial-temporal dataRetail credit default!! ! ! ! ! MRF, MCMCLinguistics, text miningWord disambiguation, cloze!! ! TF-IDF

Avoid overfitting...

11TF-IDF: term frequency-inverse document frequencyfrequency in document relative to frequency in corpus MRF: Markov random fields

Wharton Department of Statistics

Recent news

12

Wharton Department of Statistics

Lessons from Prior ModelingBankruptcy: n=500,000, p=60,000, 450 events“Breadth-first” search for best features

Slow, memory hogSevere penalty on largest z-score, sqrt(2 log p)

If tested features are mostly interactions, then selected features are mostly interactions

Exampleμ»0 and β1, β2 " 0, then X1*X2 ⇒ c + β1X1 + β2X2

Outliers cause problems even with large n

13

Real p-value # 1/1000, but

usual t-statistic # 10

Wharton Department of Statistics

Spatial Outliers Happen

14

Wharton Department of Statistics

Reaction to LessonsBreadth-first becomes streaming selection

Sequence of possible featuresExamining each is very fastOver-fitting? Multiplicity adjustments?

Fixed significance levels replaced by levels that vary with the type of the variable

Heuristic: Revised Bonferroni (ie, hard threshold)Divide α level equally between linear & interactions! p linear: test each at level α/(2p)! p2 interactions: test at level α/(2p2)

Rather than trust model to obtain standard errors, use a robust estimate.

15

Wharton Department of Statistics

Methods Overview“Linear” regression! ! ! Y = b0 + b1 X1 + b2 X2 + ...

Auction selection from multiple “experts”Explore expansive feature space, includinginteractions and nonlinear subspacesExploit exogenous information

Robust standard errors and p-valuesAccommodate dependence and heterogeneity

Alpha investingControl over-fitting adaptively

16

Wharton Department of Statistics

Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertN

α1 α2

αN

Collection of experts bid for the

opportunity to recommend feature

Ymodel

Wharton Department of Statistics

Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertNα2

Collection of experts bid for the

opportunity to recommend feature

Auction collects winning bid α2

Expert supplies recommended feature Xw

XwY

model

Wharton Department of Statistics

Feature Auction

17

Auction

Expert1

Expert2 ...

ExpertN

Collection of experts bid for the

opportunity to recommend feature

Auction collects winning bid α2

Expert supplies recommended feature Xw

pw

Expert receives payoff ω if pw $ α2

Experts learn if the bid was accepted, not the effect size or pw.

Ymodel

Stat model returns p-value

Wharton Department of Statistics

Experts

18

Auction

X1,X2,X3,...

Elin

Xj

Equad

XjXk

Z1,Z2,Z3,...

Elin

Zj

Elag

Zt-s

Ecal

Scaven

gers

f(%)Eg

g(Xa1)

Xaccepted Xrejected

SubspacesESVD

ERKHSSj

Bj

Source Experts

Wharton Department of Statistics

ExpertsExpertStrategy for creating list of features. Experts embody domain knowledge, science of application.Source experts

A collection of measurements (eg, synonyms, clusters)Subspace basis (PCA, RKHS)Lags of a time series

Parasitic experts, scavengersInteractions - among features accepted into model- among features rejected by model- between those accepted with those rejectedTransformations- segmenting, as in scatterplot smoothing- polynomial transformations

19

Wharton Department of Statistics

Expert WealthExpert is rewarded if feature accepted

Experts have alpha-wealthIf recommended feature is accepted in the model, expert earns ω additional wealthIf recommended feature is refused, expert loses bid

As auction proceeds, the auctionRewards experts that offer useful features. These then can win later bids and recommend more X’sEliminates experts whose features are not accepted.

Taxes fund parasites and scavengersContinue control overall FDR

Criticalcontrol multiplicity in a sequence of hypothesesp-values determine useful features

20

Wharton Department of Statistics

Standard Errors

Wharton Department of Statistics

Robust Standard Errorsp-values depend on many things

p-value = f(effect size, std error, prob dist)Error structure likely heteroscedasticObservations frequently dependent

DependenceSpatial time series at multiple locationsDocuments from various news feedsTransfer learningWhen train on observations from selected regions or document sources, what can you infer to others?

What are the right degrees of freedom?Tukey story

22

Wharton Department of Statistics

Sandwich EstimatorUsual OLS estimate of variance

Assume your model is true

Sandwich estimatorsRobust to deviations from assumptions

23

var(b) = (X’X)-1X’E(ee’)X(X’X)-1 = σ2(X’X)-1(X’X) (X’X)-1 = σ2(X’X)-1

heteroscedasticityvar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = (X’X)-1 X’D2X (X’X)-1

diagonal

dependencevar(b) = (X’X)-1X’E(ee’)X(X’X)-1 = σ2(X’X)-1 X’BX (X’X)-1

block diagonal

Essentially the “Tukey method”

Wharton Department of Statistics

Flashback...Heteroscedastic errors

Estimate standard error with outlierSandwich estimator allowing heteroscedastic error variances givesa t-stat # 1, not 10.

Dependent errorsEven more need for accurate SENetflix exampleBonferroni (hard thresholding) overfits due to dependence in responses.Credit modelingEverything seems significant unless incorporate dependence into the calculation of the SE

24

Wharton Department of Statistics

Sequential Testing

Wharton Department of Statistics

Alpha InvestingContext

Test possibly infinite sequence of m hypotheses! ! H1, H2, H3, … Hm … obtaining p-values p1, p2, ...Order of tests can depend prior outcomes

ProcedureStart with an initial alpha wealth W0 = αInvest wealth 0 $ αj $ Wj in the test of Hj

Change in wealth depends on test outcomeω $ α denotes the payout earned by rejecting

26

Wj - Wj-1 =ω if pj $ αj

-αj/(1-αj) if pj > αj

Wharton Department of Statistics

Alpha Investing MartingaleProvides uniform control of the expected false discovery rate. At any stopping time during testing, martingale argument shows

Flexibility in choice of how to invest alpha-wealth in test of each hypothesis

Invest more when just reject if suspectthat significant results cluster.Universal strategies

Avoids need to compute p-values in advance

28

E(#false rejects)E(#rejects)+1

supθ

$ α

Wharton Department of Statistics

ConnectionsOther methods of controlling false positives are special casesBonferroni test of H1,...,Hm

Set W0 = α and reward ω = 0Bid αj = α/m

Step-down test of Benjamini & HochbergSet W0 = α and reward ω = αTest all m at level α/mIf none are significant, doneIf one is significant, earn α back

Test remaining m-1 conditional on pj> α/m

29

Wharton Department of Statistics

Benefits of KnowledgeSimulationTest m = 200hypothesesCompare powerto Benjami-HochbergSignal from spikeand slab prior

30

Alpha investing

Oracle BH

correct order

random order

Wharton Department of Statistics

Next StepsReplace the martingale that controls alpha wealth by one that controls expected loss.Improved experts: more features

Neighborhood structure is an important method to create new types of features! - geographical! - temporalBoth are links to other rows.

Better softwareFront endBack endGet some of that faster matrix code

31

Wharton Department of Statistics

ReferencesFeature auction

www-stat.wharton.upenn.edu/~stine

Alpha investing“α-investing: a procedure for sequential control of expected false discoveries”, JRSSB, 2006

Early improved stepwise regression“Variable selection in data mining: Building a predictive model for bankruptcy”, JASA, 2004

Robust standard errors“Variable selection in models with blockwise dependence”, Lin and Foster.

32

Thanks!