copyright © 2015, sas institute inc. all rights reserved.€¦ · case study - surrogate modeling...
TRANSCRIPT
Copyr ight © 2015, SAS Ins t i tu te Inc . A l l r igh ts reserved.
Copyr ight © 2015, SAS Ins t i tu te Inc . A l l r igh ts reserved.
BUILDING BETTER MODELSUsing Robust Data Mining Methods
Tom Donnelly, PhD, CAPSr. Systems Engineer & Co-Insurrectionist
JMP Federal Government [email protected]
302-489-9291
OUTLINE Building Better Models - Preventing Overfitting
• Case Study - Surrogate Modeling
• Honest Assessment Method
• CKPS Today? Tomorrow? Physics (timeless) vs People (shifty/adaptive)
• Case Study - FAA Data/Models
• Cross-Validation approaches – K-fold, Leave-One-Out
• Penalization criteria – BIC, AICc, ERIC, Adjusted R-Square
• Comparing model performance, statistics, visual
SURROGATE MODELING OF A COMPUTER SIMULATIONHELICOPTER SURVEILLANCE – IDENTIFYING INSURGENTS
• 2009 International Data Farming Workshop - IDFW21, Lisbon, Portugal• Largely German team (6 of 8) – their simulation• 6500 simulations run overnight on cluster in Frankfurt Space Filling Design of Experiments (DOE) 65 unique combinations of 6 factors (each factor at 65 levels) each case had 96 to 100 replications (lost a few)
• Response = Proportion of Insurgents Identified = PropIdentINS Data bounded between 0 and 1
• Explore data visually first• Fit many different models using “Train, Validate (Tune), Test” subsets • Compare Actual vs. Predicted for Test Subsets
SPACE FILLING DOE
Scatterplot 3D
Data Columns InsurgentCamouflage TigerHeight Tiger1_Distance
Scatterplot Matrix
500
900
1300Ti
gerH
eigh
t
1000
2500
4000
Tige
r1_D
istan
ce
35
55
75
Tige
rSpe
e
dRel
ativ
e
26
34
42
onvo
ySpe
ed
16
24
32
num
_INS
2_AK
47
15 55
InsurgentC
amouflage
500 1650
TigerHeight
1000 6250
Tiger1_D
istance
35 75
TigerSpee
dRelative
26 42
onvoySpeed
DISTRIBUTIONS OF RESPONSE AND 6
FACTORS
PROPIDENTINS VS. X FOR 6 FACTORS
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. TigerHeight
Prop
Iden
tINS
0.0
0.2
0.4
0.6
0.8
1.0
200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
TigerHeight
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. TigerSpeedRelative
Prop
Iden
tINS
0.0
0.2
0.4
0.6
0.8
1.0
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TigerSpeedRelative
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. Tiger1_Distance
Prop
Iden
tINS
0.0
0.2
0.4
0.6
0.8
1.0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
Tiger1_Distance
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. ConvoySpeed
Prop
Iden
tINS
0.00
0.20
0.40
0.60
0.80
1.00
20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
ConvoySpeed
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. num_INS2_AK47
Prop
Iden
tINS
0.0
0.2
0.4
0.6
0.8
1.0
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
num_INS2_AK47
Graph Builder
PropIdentINS & Mean(PropIdentINS) vs. InsurgentCamouflage
Prop
Iden
tINS
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
InsurgentCamouflage
2-D CONTOUR PLOT AND 3-D RESPONSE SURFACE
PROPIDENTINS VS. CAMOUFLAGE & HEIGHT
200
400
600
800
1000
1200
1400
1600Ti
gerH
eigh
t
PropIdentINS
0.80.60.40.2 00.80.60.40.2 0
0 20 40 60 80 100
InsurgentCamouflage
0InsurgentCamouflag
igerHeight
PropIdentINS
HONEST ASSESSMENT APPROACHUSING TRAIN, VALIDATE (TUNE), AND TEST SUBSETS
Used in model selection and estimating its prediction error on new dataValidation Group
Train
Validate (Tune)
Test
3874, 60%
1292, 20%
1292, 20%
The Elements of Statistical Learning – Data Mining, Inference, and PredictionHastie, Tibshirani, and Friedman – 2001 (Chapter 7: Model Assessment and Selection)
R-SQUARE VS. NUMBER OF SPLITS
TrainTestValidate (Tune)
COMPARE SEVERAL MODELS
ACTUAL VS. PREDICTED PLOTS FOR TEST DATA NEURAL NET, BOOTSTRAP FOREST AND GLM MODELS
opIdentINS & Mean(PropIdentINS) vs. BF - PropIdentINS Predictor
Validation Group
Test
Prop
Iden
tINS
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3MSE: 0.11 R²: 0.915
-0.1 0.1 0.3 0.5 0.7 0.9 1.1
BF - PropIdentINS Predictor
NS & Mean(PropIdentINS) vs. GLM train set - Pred Formula PropIdentINS
Validation Group
Test
Prop
Iden
tINS
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3MSE: 0.14 R²: 0.876
-0.1 0.1 0.3 0.5 0.7 0.9 1.1
GLM train set - Pred Formula PropIdentINS 2
pIdentINS & Mean(PropIdentINS) vs. NN - Predicted PropIdentINS
Validation Group
Test
Prop
Iden
tINS
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3MSE: 0.12 R²: 0.914
-0.1 0.1 0.3 0.5 0.7 0.9 1.1
NN - Predicted PropIdentINS
HOW I GOT HERE
• Physicist – theory, laws, experiments were “demonstrations”
• Engineer – solve problems – make it better, faster, cheaper, more reliable
• Experimental Designer• Data nicely conditioned – balanced – often orthogonal – minimize amount of data
• Models are empirical (adding the science makes the models extrapolate better)
• Skeptical of statistics – show me the checkpoints
• Data Miner • Data messy – sometimes highly correlated, missing, outliers
• Many modeling methods – often easy to fit them all – which is “best?”
• Use some criterion to evaluate models…
• Still skeptical of statistics – show me the checkpoints
“. . . the simple idea of splitting a sample in two and then developing the hypothesis on the basis of one part and testing it on the remainder may perhaps be said to be one of the most seriously neglected ideas in statistics. If we measure the degree of neglect by the ratio of the number of cases where a method could help to the number of cases where it is actually used.” G. A. Barnard in discussion following Stone [1974, p. 133]
Stone M., “Cross-validatory choice and assessment of statistical predictions.” JRSS B 1974; 36:111–147.
QUOTE AT BEGINNING OF
CHAPTER 11 ON VALIDATION
(Sometimes use DOE to intelligently split the data! HbA1c)
It is usual to set aside 20-30% of the data for evaluation (test).
When splitting data it is important to have influential cases fairly represented in each set – a process called “stratifying the sample.”
1. The first part, the training set, is used to build the initial model. 2. The second part, the validation set, is used to adjust the initial model to
make it more general and less tied to the idiosyncrasies of the training set. 3. The third part, the test set, is used to gauge the likely effectiveness of the model
when applied to unseen data.
Three sets are necessary because once data has been used for one step in the process, it can no longer be used for the next step because the information it contains has already become part of the model; therefore, it cannot be used to correct or judge.
NO SECRET SAUCE
• Splitting data – I usually start at 50/25/25 or 60/20/20. Run several different random subsets. If models are consistent, increase training set proportion.
• As sample sizes get smaller, expect to see more variability among subsets. Run multiple subset splits to convince yourself of consistency.
• If data sets are too small to support three subsets, consider just two subsets, training and validation.
• Other alternative strategies will be discussed after next case study.
DOWNLOAD AIR TRAFFIC DATA FROM
FAA SITE
DOWNLOAD ARTCC DATA
FROM FAA SITE
LARGE NUMBER OF FLIGHTS VS. SMALL
ACTUAL VS. PREDICTED TOTAL TRAFFIC FOR ALL DATA IN TEST SUBSET
ACTUAL VS. PREDICTED TOTAL TRAFFIC FOR NO FED HOLIDAYS IN TEST SUBSET
ACTUAL VS. PREDICTED TOTAL TRAFFIC FOR ALL FED HOLIDAYS IN TEST SUBSET
ACTUAL VS. PREDICTED TOTAL TRAFFIC FOR THANKSGIVING HOLIDAY IN TEST SUBSET
ACTUAL VS. PREDICTED TOTAL TRAFFIC FOR CHRISTMAS HOLIDAY IN TEST SUBSET
10 FEDERAL HOLIDAYS 2012
ACTUAL VS. PREDICTED TOTAL FOR OCTOBER 2012W & W/O “SANDY” DATA FOR 29TH, 30TH & 31ST
ACTUAL VS. PREDICTED TOTAL TRAFFIC -PARTITION - FOR 10 FEDERAL HOLIDAYS
BY HOLDBACK SUBSETTotal vs. Partition Pred Total w Fed Holiday
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
V
WVVV
WVCCKWKM
VVC
V
C
VCC
C
WWK
K
WWW
CVV
VM
MM
1
4
CWWCCWCKVWKWLLKCCKLLLLW
W
WK1
K
1K
X
1LKMK1KLMK1MMLMLKL1KX
X
C
C4
L
VC
W
V
VWWWKKW
CCCVCC
C
V
VV
X
X
XX
WW
W
WC
V
V
X
WKK
K
K
CCCV
MMWVWC1CC1
V
WW1KKWK
V
VVC
1L14
TTX
VVVV
V
VVVV
1
1VVCV
WWWV
V
KKKK
W
1
144444T1TT
XT
X
MMMX44
4X
TTT
V
VV
V
4
1444
1144XX1
4XTT
T
VVCCCVCVWCW1VC
4LML4M4
44
4
4
L
4
4
VCC
4LMLM4
M
1CC
VVVWVWVWWCCV4
X
1X
41M1L14XTXT1T
XT
CC
LL
V
CCCC
1
XL4KMLW
W
WWWWWWWWWWW
W
44L
CCVVV
CCC
1VV
4
4
44
TTT
LLCLLCM
CM
MLM
LL4MMMLTT4
4
XKK
K1
1
1
TTX
VV
CW
CVCCWWC
VKK
MKL
KLL44LLXMMMK4XX4
4
KXKK1KKKK1V
LLL
L
LLLLCCCMMCMCCCMMLML
1
4
CLL
MMM
WWWKCKKCWCKCC
1
K1
K1
W
VVVVKWKWVKK
X
X
XXTT
MM
141444MXX1XXMM
T1T1X4
T
44MLMLLLLMMLLMM4L4
MLKXKKKML141KLM41411M
XX
4
WWWKKWWWKWKKK1V
1
1LM1LML1LM1L44
VVV
1
1
1
TTT
WWKCKCC
WWCWCVWCW
44
1441X1X1X4XXX14X1
4
LLL
MM
M
XTXT
W
M
X4LKKLKKKLM4LMMMKML4V
1
11
1X1X1XX1X111X1X1X
X
W
CKMM
44TTTTT1K
MV
XL11M
XT
14X44XTXTT4T
CKCLMLWWKMK
TTTTTC
TTTTTTTTTTT
44141XXXXTTTT
VKWX
VWX
TTTTTTTT
X
X1
XTT
CCMLML
44
1441XTT
MLKM
1XXT
1
CLCML
VKLCCM
M
VVKWW
4444WW
1
X1X1T11TX
RMSE: 543.1
V
K
M
C
WVW
MMM
VC
4
W
K
L
C
L
1
W
K
1K1
X
LMK
XLMML11
X
C
M
L
KV
V
X
W
K
K
CV
W
1
V
WV
LL1
4X
T
V
41
KKW
XX
X
VW4
1
1X
4
VV
X
1
X
X
TTTT
V
CC
1
1
CC
LM
4
LL
L
M
4
CCWWC
LMM
ML
M4T
4CL
VCKMKLWW
W
W
4C
M
VV
L
L4
4
4T
11V
V
WW
1
K
M
XLKM
44XL4
1XKKCM
L
4
LCKK
K
K
1
WVKVKV
T
M
T
X
MMM4LMM
L
LXM
1
4X4
1
V
LMMVVT
WKWWWCVCCC
14
XX
X
1L
X
TXTTCMMK
4V
1
X1
1XXX
V
W
T4T
T
1MM
T
1X
T
CTTTT
X11
TT
T
1T
T
1
X
TT
X41
T
MVK
CLLM
VWKWX
TT
TX
RMSE: 757.5
V
W
C
M
V
V
W
CC
CWVV
MMW
WVVKVKKC
W
W1K
K
K
LM
CMMLML
4
K
KVC
X
W
W
1
W
W
K
M
V
VV
K
1
V
C
X
4
1
1
VKK
CXV
W
X44T4XTTTT
M
4
1X
4
4
X
WWWWC1
L
M
V
L
C
L
4V
CCC
444
4
L
M1LL
4T4
C
L
CMKML
4
WWLL
C
VLL
LM
M
T
TTX
X
X
X
TT
C
VW
V
K
KLKXM
LKKXKMM
4X
4
KXKV
L
M
L
CWVCVW
KK
WW
KKW
T
X1
4M
44XT
TTT
MLL
L4
4
L
K
X
W111M
4
VVCXX114
XM
MMX
T
K
11XLLLK
M
X
X
X
VVVKCV
T
WC
V4W4W
LL
XT
XX
44
WTCT
X4
TT
T
1K
TTTT
LK
1T
LCM
KVCLMVK44
TTX1
RMSE: 799.6
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Partition Pred Total w Fed Holiday
Where(Holiday Symbol = C, V, T...)
Graph Builder
Total vs. Partition Pred Total wo Fed Holiday
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
V
TT
WVCCW
M
V
T
4
T
VC
M
T
VVVV
4
4
TT
KK
1L14
X
C
M
444
V
44
4
4
CVC
MMM
X
W
1LWKK1KL11LL11X
C
WWKX
K1
X
T
C
4
M4MM
4
XT
XT
1WWKL
K
X
VWWWVK1VWKWLLKKLLLLW
11XX1
XTT
T
CCCCCC
4
W
WK1
K
1KC
C4
L
1
1
VC
W
M4
4
V
VWW1WKKW
1
TTX
CCCVW
V
VV
TT
X
X
XX
M
M
WW
W
WCC
4CLL
4
4
4
X
X
C
K
1
X
X
K
K
X
4
MMV
V
4
TTT
V
4
4
XKK
1
L1
X
CCC
4LMLM
M
MMC
WVWC1CCMK1
4
4
V
WXW1KKWK
X
XXTT
4VVVV
444TTTTT
4
4
1
4
VVVV
1V414XXX
T1T1X4
T
WWW
V
VV
KKKK1
1
1
TTT
L
X
MM
TTTTT
4
CV
LMMML
V
MMM4
4V
4
V
V
TTT
C
1
X
WXKK
K1
CCVCVWCW1KLLLLMMMK
X
4
444
4
VVV
1
X
L
4
4
4
VC
L
44
4
L
L4
VCC
LL
VVVVVVV
V
11X1X1XTXXXT1XTTTTTTTTT
1CC
V4
CC
LL
V
CC4MCCC4M
M4
W
W
CCC4MMMM
4
44
1WXWWLWWWKKWXWWKLWK1WKKKLLK1LLLWLLL
L
CCC
1
X
T
4
4
WWWMWKXKKKM11KM11M
X
LLK
MK1XMM
CLLCM
CM
4
T
C
CMCCM
MC
W1
K1VV
VVWKW1
TX
4
4
V
4
TT
1
X4TX
4T
VVVVV
X
111X
TXTTTT
LLL
L
LLLLCCCMMCMCCCMMLML
X
1CLLWWWW
4
W
VVVVKWKW1VKKX1X1XX1X1TTT
K1CLKKCCMKC1CLML1LML
X
4
WWWKKWWW1KWKKK
V
T
MMM1
44
41
4444
4
44
VVV
11XXXXTTTT
V
WKCKCX1M
WWCXWCLKWKLCKKKWLMLMMMKML
WCL1
X
1
4
LLLW
M
1
X
V
XTT
X
1
11
L1
W
CXMMCKCLMLWWKMK11X
KK
1
TT
1
4
MCMVK
1XXT
VCML
4
CMLVWXKWX
1
444
4
4
CLCML
T1
VLCCMM
VVKWXW1X
1TX
KWW111
RMSE: 898.6
V
4
T
M
T
C
M
KLL1
X4
VXWWLK
X
LMM
4
MM
4
MMMXL
11
XW
K
L
1LX
X
TTTT
C
4
VC
1
W
K
1K1
C
M
L
KV
1
M
M
V
T
X
W
L
4
4
1
1X
4
LL
K
K
4
V
T4T
M
4
LC
M
W
1M
V
W
T
4
V
444
T4
T
X
W
V
XX
XT
KK
M
L
V4
V
CC
1
W
1XLK1M
1
XL
41
LL
CCL
4
L
L
VV
1
XX
XTTTT
CCCL
CM
M
VWWC
MM4MM
W1XKKKKL
WL
T
WW
XM
X
K
C
M
VV
T
WW
X
K
K
11
TXT
4
V
V
1X
CM
L
XLC
4
1
WVKVKV
XXT
T
KLMKM
X
1V
T
1
4
4
VV
X11
TT
V
W
1M
WWWCCMCCMKK
M1LC
T
V
1
T
X
W
C
X
1V
T
1
T
1
MV
TX
X
CLLM
VWKWTT
K
RMSE: 1011
4
W
C
M
V
4
TT
V
V
4TTT
T
X
4
C
WKW
K
L
X
X
C
L
1
C
4M4M
4
4
1XVWVVVKVKK
X
C
M
W
W1K
CMMLML
4
K
L
X
KV
X
X
TT
W
MW
T
TT
XM
W
WC
L
MW
1
M
V
T
KLL
4
V
CK
KM4
VV
K
1
T
1
4
1
V
4XT
TTT
X
LX
L
X1
KK
M
T4
C
M
4
X
W
X
WWWKWKXCKM1M
V
X
L4
L
4
C
V
V
XX11
T
444
4
C
L
CMCM
M4
C
CC
4
WWKKLXKLL
L
LL
T
KLLKLM
4
CM
MM
XKKV
V
W
X
4
V
4
1
V
X
TXT
L
M
L
CWVVW
WXW
KKW
TT
CXX
1M
W
T
4
1
4
V
X
TT
LL
X
11CXLLLK
M
VKV
44L
WCCVW
VKV
CWWVK
1TT
1K
44LCM
V
TX
CLMVK
T1
K
RMSE: 1034
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Partition Pred Total wo Fed Holiday
Where(Holiday Symbol = C, V, T...)
Graph Builder
ACTUAL VS. PREDICTED TOTAL TRAFFIC –NEURAL - FOR 10 FEDERAL HOLIDAYS BY
HOLDBACK SUBSETTotal vs. Neural Pred Total w Fed Holidays
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
VW
VCC
WWV
WVVVKVCK
WVWV
CCWKL
WCCCC
WV
W
K
MVMK
WK
CVWWWW
1
WC
W
KLM
CCVC
KKV
WVM
WK11C
1
KKK
K
K
LM
K
ML
V
WMLMCW
W
M
WLWCV
L
WKCK
C
1
1W
VKWVLMLL
1
1
C
WK
1CVCKL
VMWV4W4
L
1
1
C4
V
1
VC
4
1
VKLCVMCV
4MM
1
CC
4
4
MKVMW
W
4
1
CCM4
X
1
4LKVCVW4
X
XVK4V
K
1
L
4
1
1VCC
MK
VX
VWC
M
1V
V
XWVV
L
1WV
X
WMM
4L
V1VWVVW
X
LM
4
4
K
CCCVCVKVCLT1
WV
CC
141XCT
4
K
4XVV
44KX
C
4
LWMCVLM
K
XCWCVL
1
VK
4
C
X
K
C
C4W1L
L4
X
T
M
VL
CWKW
K
VCK
4
WCXCLC4
L
WKLCV
T4XC
W
4MXVLW
WLCVV
4VTW
1
TLW4
X
L4
C
W
T
1
1CLT
1
W
XKW4VVMVVLCKK4KCCCLLCVCCW1LL
L
1LMCMW
V
WMMX4KL
K
1
1
1
KWW
X
VKC
XL
W
T
K
C
MXLMXC
KK
X
M
4
K1MVMLLTLXT1
KV
MKKLTT
1
MXC
W
WMW
MW
L
1
LTLWV4TVTV
4
L
WMM
4X4TWTC
X
XMKTL
T
CCWLC4LM
LLKM
C1M
4
44
LMMCK
C
V1M
X
LXC
4
K4MKL
TK14CMMT4
T
KW
1
W1KKKWL
M
4
4T
41
1
KW
4
1WWK
14
LMV
41
M
1X
4
K4TM
1
M
4
1
4
XM
4
X
4
4KM1
11
41
VM
1
K
4
4
1
4KLWMW4
W
1VK11L
WLVL
1
L
1LMX
X
CC1CMMC
11
CXWT
MXWXC
1
W
1
V1KKTKKVLKL4KKTL
X
LLKX4M1T
4
M
X
T
X
L1
XTM4XTLML
1
T
4MM
4
X
4
X
XX
X
4XCXL1X41LV
W
4
WT1T4TXX
X11CXT
X
41TTXCKX
X
LXTK
4TMM
1LMT
TTW
4
MT
X
WLKTTK1T
4
1T
X
TTTTXKT4VXV1MTTMTXTXTCTX
1TM
XTCLL4V
CT
14TTTKMWL
M
1
X
X
X
KWTTMTTT
TTT4
MT
41XXT
TTTX
X
XT
X1
CCL
VVLCL4CV4MK
WW44
4MKMW1W
111X
X
X
1XTT1
RMSE: 392.5
V
W
KW
C
KW
W
M
C
KVLCM
VW
1
K
L
1KL
M
VV
KC
K
M
L
V
M
4
1WM1
L
KMV
LL
1
MMM
KK1LVW
1
L1
CV1
X
XC
ML
1
M
C
VL1WL
MM
4
W4XVMX
4WVX
L
4C
CX
T
CV4CXCL
M
X
4
X
VKW
4
CKMLK
4KWLKWL
W4
W
MCW
W
VLL
1T
CCK
4TMVV
L
V
T
1K
K
V
1MLLMWL
4MLT1
T
CV
M
1
K14
1
MM
4
4CK
KT4
VMXK
1
1
4L1
4
1
T
4
X
WXWVX
M
L
WV
X
M
4
CVVWTCCX
X
TK
X
CK
X
1M4X
X
1XM
1C
X
XMMT
XX
W
L1X
VV1TCT
1
44
TTT
1
XT
X
TTX
TT1
T
1T4
1
XTTT
X
TTT
VTVLCLMKKMW
WX
XTTT
RMSE: 501.8
CW
WW
VV
V
KWV
W
W
CCV
V
C
V
M
C
WKK
KW
KKVM
WM
1
CVKC
VM
KVL
W
4
MWV
L
M
M
MM
KKKL
1
C
4
KLK
4
VM
4
4
L
X
1WLWKVMLXL
C
M
M
4
KW
K
W
1
W4VXM
L
K
4
V
W
X
CLL
VCCTC
1
WLXCKCM
L
X
M
T
LXMV
1
1
LK4
TCCC
4
X
LL
M
4
4
X
K
M
14
L
WW
X
LKKV
KLV
L
4
M
L
X
VK
1
L
W
TVLW
TW
MCM
4
T
4
WVTCMKW4
T
T
4
K
T
4
1
4
X
V411
L
4
MVMX
1CL
X
4
XXT
4
K
XL
X
XT
1
4
X
MTV
XX
X
1
L
V4C
1
LWCV
X
TT
X
K
4
K
X
VWT
X
X
X
C4TVXTLTTTKW
W
TTTK
4
TTT
4
11
TTV
CLCL4V
K
M4KMK
X1TT
RMSE: 557.8
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Neural Pred Total w Fed Holidays
Where(Holiday Symbol = C, V, T...)
Graph Builder
Total vs. Neural Pred Total wo Fed Holidays
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
TT
V
T
4
T
VV
TT
4
4
W
VC
4
X
VVV
M
4
C
1
TT
4
T
VVW
1
T
M
W
L
MM
X
T
K
1
VC
4
VK
T
4
1M4
C
C
T
L
C
4
T
C
4
1
LL
C
X
MV
4
1
1
K
X
MCW
T
M
C
V
X
1
M
K
1
T
4
L
4
W
XX
1X
W
T
KLXV
X
V
4
T
W
W
1WW
C
X
C
V
L
WW
LWW
L
C
W
L
1
X
C
W
K
VK
K
4
K
X
1
K
1L
CCKK
T
1
1CC
W
L
1
CVK
M
4
W
T
W
4
W
4
K
4
K
1
4
L
V
W
KC
X
W
W
X
VVX
K
KV
M
T
C
X
V
M
T
X
W
T
1
V1
L
X
T
4
M
T
X
VK1
XT
L
T
C
W
T
1
4
K
4
4
K
CK
1X
T
4
T
X
X
4
T
V
4
C
V
WM
4
X
VV
V
4
K
VC
X
W
M
V
W
T
M
C
W
1
V
T
W
T
T4
C
T
4
V
T
C
V
X
1ML
4
X
X
XX
VL
W
X
CW
L
4
C
T
M
1
M
C
C
4
W
T
X
T
VXC
1
4T
KV
M
V
1
L
1
V4
M
LV
T
T
M
1K
M
LKL
M
C
4
T
V
X
V
1
1
M
V
KV
T
WV
4X
T
V
T
4
M
M
CW
M
T
CL
T
T
MX
M
4V
4
VW
1
K1
44
L
4
4
T
C
4
4
C
K
WC
1
W
K
1X
K
T
4
XT
C
1
4K44L
X
VVW
4
1
V
MML
LC
4
X
C1
K
T
4C
M
VK
X
M
M
C
M
KV
V
K1C
X
MM
CV
M1
X
1
W
1
4
W
V
V
M
T
W
4
4
4
M
4
V
L
4L
T
4
V
T
WML
LW
X
CL
M
T
M
MCWW
X
W
T
L4
WCWW
MX
V
T
L
X
14
T
LK
C
LCCCCCKC
T
M
W
1
T
LC
X
C
K
1
W
LK
1
W
W
K1V
L
KW
V1
44
V
T
C
L
CLK
1
CL
CCKV
T
KV
L
L
L
L
M
1KKM
V
V
4
X
C
1
KC
L
L
4
X
W
MMK
W
K
1
L
X
L
CW
L
1
LC
XTT
4
T
L
T
C
L
1
MK
L
4
X
L
VW
KLC
4
K1C
1
V
T
M
4
1
C
L
X
L
W
MM
X
C
T
M
CC
1
XV
M
C
W
X
C
M
L
WMV
X
WC
W1
WWVWL
X1
X
L1
WKM
W1
L
LM
L
CMC
X
1K
W
1
LKK
4
K1VK
KK1
C
MKWMWK1
W
KM
1
K
WKKXV
TT
4
WKK
4
T
K
144
X4
4
L
X
V
X
1
TT
V1
1
V4
X
VC
W
1
C
4
C
X
ML4
W
4
C4MMM
W
1LML
X
W
MXWC
T
K1
WLW
T
KKKCKKL1KLK
VL
X
L
L
TTX
L
1
1
LLC
MCMMMX
C
X
MM1
1W
W
T
W1LM
4X
KW
4
K
1
K
V
X1VK1
KM1
M
X
LCLCM
1XLM
VKCWK
XW
4
1
44
44
T
LCCML
L
1
V
1
C
1
VCK
T
MWMWVW
X
X
X1
1XWWK
1
RMSE: 987.2
T
V
4
T
M
4X
TT
LK
C
L
4
T
ML
X
V
1
K
W
MM
X
L
4
M
M
X
T
W
X
C
M
V
ML
C
W
V
W
1
XM
4
X
L1
LV1
V
K
K
M
T
M
T
W
C
1
1
L
1
L
T
1L
4
4
1
T
K
4
K
M
T
KV
4
XML
T
4
1
X
X
W
T
T
MV
4
4
K
4
M
V
X
1W
L
C
X
V
LK
4
T
K
T
4
X
1
CVL4
MW
4
X
C
X
M
T
W
VV
C
L1
C
1
X
44
LL
X
K1
M
CM
L
WC
X
T
M
M1
T
1
C
W
L
V
4
X
KXCC
M
1
K
XWM
T
M
WK
W
X
L
L
KLCKV
T
L
WVVV
M4
V
X
WW
XT
1L
K
1
K
X
M
1
C
V
ML
T
X
K
4
LK
X
MCVX
W
KC
1
1
TT
1
VV
T
K
1
VM
4
V
X
4
1
M
WWM
W
CCWCC1
M
T
K1K
T
VCV
T
C
1
L
X
W
X
1
1
T
1
T
VMLKXL
XT
VC
TMKW
W
RMSE: 1082
T4
T
T
T
T
C
M
W
T
44
V
V
V
X
W
X
4X
VV
KC
CVWCV
4M
X
4L
T
4
W
V
M
4T
X
K
W
C
T
X
1
K
MM
V
1
K
X
M
T
ML
KW
4
T
M
WK
T
L
X
W
L
VC
X
WKM
L
KCV
T
1
L
M
V
LK
WV
4
T
4
1
C
T
M
1
T
M
1
C
W
T
VK
KVWL
X4W
M
1
V
L
X
T
X
1
W
KK
X
K
4
L
X
K1M
XK
WC
X4
W
T
M
4
X
C
4K
4
MCC
4
X
X1
W
1
M
L
4
L4
M
1
W
K
L
4
C
T
V
4
CC
T
VCM
L
K
4
C
L
T
XM
L
KLLKL
LL4LVV
X
MM
X
KL
4
KV
4
X
MWV
T
WK
T
C
M
V
L
XW
1XX
W
C
1
WW
M
T
X
T
WK
X
K
T
L1L
4
VVC
4
X
X
M
L
1
LKL
1
V
4
VVC4V
WCK
T
K
V
W
1
T
CLW
WK
1K
44L
T
V
MC
X
LKCM
T
V
1
K
RMSE: 1090
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Neural Pred Total wo Fed Holidays
Where(Holiday Symbol = C, V, T...)
Graph Builder
ACTUAL VS. PREDICTED TOTAL TRAFFIC -2ND ORDER - FOR 10 FEDERAL HOLIDAYS
BY HOLDBACK SUBSETTotal vs. 2nd Order Pred Total w Fed Holiday (except M X HS & WD X HS) Holdback = 0
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
WV
WVVWVVCVX
VVWWWWC1C
CKVVKVK
W
C
X
KCC
V
LWC
1
VWCMKCMCCC1WV
KW
V
X
WW
K
W
X
1L
XWKCCKWKKM
1WKV
L
KMLLMK
W
CCL1MKWVCM
W1LKWCMLMWLCL
W
1
1W
V
K1XVLVK
14
1W
L4WK
4
MKC1V
VWCKKM
C4CVM1CMV4
4
MVKC4VWV
11XC4C4V44
4LXMM
W
WCM1CV
X
W1C4V
4
4L4VKL1M
V
VK11
VVL
LVK
K
1VV4
X
M
MM
1MWWVLV
X
W
LCVCVKKWLX4
V14VC4K
WVC
4X
1CKW
1WV4XV1CCLX
XL
V
X
C1LVVCWLTVCCLCV
1
VWTLTCLVWW
WCL
1
WCCM
K41X44VC1CM
W
4L4MVKKL4M
L
L
CCWC
X
KC1CC1WXTC4LWTKLCLXTVVCCVK
WCX4LCLLK4VMTLXLTWTVCXKLCKKWCWVCCVM
C
VLV
WM1VM4V
W
TCM
K
MW4M
XC
4V
4KMTL1
WK
TLMW
W
LMWCKWMX4M1XWTWLM
C
KX4KTWT4
W
TK
144
4
1T4XL1V
4
VXLM11X
M
X
V1KKMCKLK4TWM1CLL
X
TKLCMML4T4
L
KWWK4XM
CKC1MM1LMMXTMKKTLKVVKW4114TKKLL
WKXKL
M
TMWW4MCW
4
L1KCCX
X
X
41
CTX41
MM4
L1
X
W1MK
WLMLW1VM111M4V4VLMK1K
WXK
41MC1L4
1VL4C
WKW4WTM4LL4K4KX114CCK
1MM11LV44LTK
TXTCWMXKM
CX
W
X
KLKX4K1VXLLXL
4TMLW441T4LM
1XT1CMT4XL4
TTKT41XC14KXTVXMTT11TXXXKTX4T4W
X1MLX4LCX1TTXT111MX
XMTKXTTTT1WM1
X
T1MTMTT1L
VCMTXTV44TMTK1TLMXTK
TCT
4
TT1LX4TVTM
CTT
4TL41XTTTMK
4
WWTXTT
X
TXXT
4
1
X
TTX
T
X
XXT
41
V
4L1
VKVLWCC
4CWK
1MM1
WWLM4W
T
4
X
T11XX1
RMSE: 211.2
W
VW
X
C
W
1
VW
K
MLKK
LLK
CM
X
V
K
C
1
WX
K
V
LK
V
M
MCVML
11
L
LM
4
KMMM
WM
K
1
VK1M1LCVV
1
WLL
M4
L
V
VX
M1L
1
X
XWC
W
X
XCXM1
L
LXM
LWC
WC
VXW
4
CVCTT
1LC1
WMM4W
WL44KKVKXKCL
CV
1
WKL4C
V
C1MML
WKVL1
MVMW
4
4VM1L
L
K
LTTTM
M
KCMTKM14W4L
4
TKCT
X
1M
14
4VKV
W441KKX
W41VVT
11X
X
LMK
X
4
4
W4
MCCV1TXTC
C
4
VWXCM1MX
TVLM4
X
1XT
CXVX1
X4TT1TXTT
X
XTXTXT
X
T
T
11
T1T4
TTT
X
T
VLC
T
KKLV
MMW
WTT
TX
RMSE: 345.9W
W
W
CW
VKVCWKV
VW
X
V
1
CC
MKWVVK
KLCV
VMWK
WM
4
1
C
X
WKCKV
WMMMK
X
MV
K
VL
M
L
1
1
X
MV
4
4
M
L
C1VMKX
XLK
4
L
4WL
X
1WLL
KM
MV4
X
K
4KLXX
WMWVC
WL
X
CWLL
X
CCX
4TCK
WMXWMC
T
K
C
L
X
L4
LL
X
CVK
X
KLVTWC
MC
TW
4
L
4
WL
11M
TK1K
X
VKV4K
4
XVLTTM
LVWL
4VMLM
K
WK
1C
XTMMC
4
4
1T
K
TW1XLL4VVM1M
K
4CTT
M
1K
44
LX1
W
4
4
X
K
1LX
XVTT4V
1
L
4
4V4VT
W4CC4
X
CT4XLTT
TK
X
WXV
T
1
TT
W
TT4
TK
X
TT1T
C
TL
VLCMVKK
M
K
TT
44
X
RMSE: 339.2
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
2nd Order Pred Total w Fed Holiday (except M X HS & WD X HS) Holdback = 0
Where(Holiday Symbol = C, V, T...)
Graph Builder
Total vs. 2nd Order Pred Total wo Fed Holiday Holdback = 0
Holdback
0 1 2
Tota
l
0
2000
4000
6000
8000
10000
T
V
T
VW
44
CC
4
4
V
4
WV
T
M
X
C
4
M
VVV
L
TT
X
WV
C
1
1
V
4
1M
C
M
T
L14
K
CCC
M
4
KV
4
4
L
W
T
4L
C
1
K
V
1
T
M
W
M
T
WC
4
V
X
VW
1
T
X4
L
LC
X
W
T
L
X
C
MM
K
1
V1W
1
K
L
W
T
C
C
1
W
L
W
T
K1WWK
1
X
X
WW
W
L
K
LC
W
K
T
L
4
V
W
K
X
C
V
X
1
X
KCC
TTX4
K4
M
K
X
V
C
1
X
K
1
T
CWK
K
W
X
4
1
4
VK
1
W
1
4
L
T
WWW
4
K
K
X
1
X
X
C
L
CXV
L
VVV
T
X
M
C
K
M
K
4
X
4
W1
VK
1
T
W
1
CWV
4
MC
4
V
C
M
4
W
414
VC
X
C
X
C
X
T
MW
V
K
4T
4
L
X
LM
4
V
1
C1
X
4
T
C
M
X
V
4X
M
M
VW
T
VW
C
V
W
T
L
1K
X
T
VX
V
X
X
CV
4
L
L
M
T
MVKV
4T
1
4
K
14
M
T
X
V
4V
4V
T
4
T
M1
L
4
M
X
L
W4MCK
4
CV
1
M
X
VM
1
1
T
V
4
W
X
T
C
M
WCW
L4
T
44
W
4
1
1
W
L
V
M
C
4
K
V
T
C
4
M
V
T
W
4
CW
14
TT
1
K
T
C
T
XV
T
K1
V
1K
VW
K
4M
C
M4K1
V
X
V
C
W
L
C
M4
K
4
T
T
C
1K
4
V
4
T
4
V
V
X
K
M
V4
M
VC4
X
X1
X
W
M
X
L
X
T
M
C1
4
4
L
C
V
T
M
MW
V
M
1
L
X
M
T
ML
T
LWVWWKK
1
CW
L
T
M
CL
X
M
C
L
C
W
VCL
T
L
T
W
M
T
4
L
VC
VWCLCWXLCL
V
4
L
T
L
WCL
W
MLM
C
W
TX
M
L
T
C
1K
1
K
1X
L
L4
M
L
1KMK
L
CCL
KK
1C
K
X
1K
C
K
L
CCL
WCML
VLV
T
KK
C
4
CX
4
C
1
L
K1L
X41
W1
X
WWMWVV
4
WMW
WC
1
X
K
T
X
V
L
T
4
X
1
M
LWC
L
M
1
LW
T
X
C
T
M
T
C
WM
KW
V
C
W
L
1
T
LM
4
1
V
T
L
WL
X
C
1X
1K
1MC
KLC
LL
T
K1C
MWK
X
CVKM
T
T
K1KM
V
1T
K1K1K
4
W
TX
MW
X
M
1
K
T
K
KWX
WV
T
K
K4K
VM
V
K
4
4KV
1
W
1LC4
14
WW
X
L4
WCC
1W
X
4
44
T
CW
X
X
11M
4
LMLV
KKWM
4
C
X
KMXK1MLLKK
X
V
T
K
M
1
T
K
CV
LWLLC
X
L
1
X
T
L
L
T
L
T
1
T
MC
W
C
1WV1
XMMXMWM
1K
XT
MK
V
4
MK
1
1K
4
MLK1
1
TX
L
M
1
C
T
LC
X
KMLW
T
MC
XWX
VKV
1
44
4
4L4
VL
C
1
WCMWKCCX
WMWM
1
11
X
X
W1TX
KV
1
LV
T
RMSE: 875.3
4
T
V
M
4
L
X
C
T
LM
K
4
LK
L
W
LM
1
W
X
X
M
TT
MM
4
X
WCV
T
WX
X
C
M
W
M
VK
1
M
V
1
1
V
L
L
4
X
T
1
M
1
CK
1
M
K
L
LL
1
K
X
K
1
1
X
T
4
V
44
K
M
L
4
V
T
V
T
W
4
M
M
T
W
T
4
C1
X
M4
VKL
4
X
V
T
K
V
L
L
X
W
44
M
C
W
1
4
CM
TTX
4L
C
X4X
CL
L
T
VC1X
11
VMV
M
4
CK1
M
W
L
X
L
CC
1
KWW
1
4
C
T
KK
M
K
X
X
WLC
X
X
W
L
T
M
M
W
M
L
W
M
L
K
TX
4X
T
V
T
LW
C
1
X
V
V
KM
W
K
1
T
L
V
1
KKV
4
LM
X
C1VM
X
KVX
1
C
T
WW
X
1
V
K
W
W
4
V
11
T
MMCM
KCV
X
CK1V
4
1
T
C
C
T
M
T1
VLCV
W
X
T
1
1
X
TT
1
TLC
VL
X
V
T
WXM
WKKM
TT
RMSE: 966.7
4
C
W
4
M
TT
X4
W
L
T
CK
T
X
V
V
4
CV
T
CV
X
V
4
W
W
V
T
4
W
M
X
V
4
M
V
V
K
4
C
K
1
K
W
1
W
X
T
X
M
K
MM
4
1
LMM
T
K
L
W
L
X
X
X
CVV
T
LL
WCKM
L
V
T
V
T
K
M4
W
K
W
T
C
1
T
V
KM
1
4
K
1
LMWL
W
1
K
4X
C
1
M
X
T
KK
4
XKW
T
X
WL
K1
T
T
4
X
1
X
4
M
M
VWW
V
X
C
T
MKL
C
4
X
C
WC
M
4
4
4
VX
1
M
1X
CC
L
MC
L4
KK
T4
L
C
44
L
CLMLLXL4
V
VK
L
LK
L
VL
MVWW
4
K
M
4
V
KXLW
T
T
1
TX
MCW
M
KW
X
XV
X
WXX
1
C
M
K
W1L
K
L
TX
T
V
4
C
X
4
X
LK
T
1
M
L1
L
T
V
VV
T
W4
W
4
KKCC
LVC
W
VV
1
WK
TT
1K
44LC
X
LVMM
VKCK
1TT
RMSE: 973.0
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
2nd Order Pred Total wo Fed Holiday Holdback = 0
Where(Holiday Symbol = C, V, T...)
Graph Builder
PREDICTION PROFILER AND ACTUAL VALUES AT 5 ARTCCSFOR THANKSGIVING DAY TEST DATA FOR FISCAL YEAR 2010
ARTCC Total Traffic
ZAB 2126ZAN 683ZAU 3699ZBW 2879ZDC 4265
K-FOLD CROSSVALIDATION
• Smaller datasets may not have enough rows to split into train/validate/test, so instead we can use k-fold cross-validation: Randomly divide data into k separate groups (“folds”) (k=5 to k=10
is recommended)
Hold out one of the folds from model building and fit a model to the rest of the data.
Use the held out fold as a validation set
1Validate
2Train
3Train
4Train
5Train
5-fold CV
K-FOLD CROSSVALIDATION
• Smaller datasets may not have enough rows to split into train/validate/test, so instead we can use k-fold cross-validation: Randomly divide data into k separate groups (“folds”) (k=5 to k=10
is recommended)
Hold out one of the folds from model building and fit a model to the rest of the data.
Use the held out fold as a validation set Repeat across all folds The model giving the best validation R-Square is chosen as the
final model.
1Train
2Validate
3Train
4Train
5Train
5-fold CV
Training & K-Fold R-Square vs. Model Term History for Stepwise Regression
RSquare History
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
RSqu
are
0 2 4 6 8 10 12 14 16 18 20 22
P
Training
K-Fold
K-FOLD CROSSVALIDATION
RSquare History
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
RSqu
are
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
P
Training
ExcludedTraining & Excluded R-Square vs. Model Term History for Stepwise Regression
DATA CAN BE EXCLUDED AND USED FOR VALIDATION – E.G. CHECKPOINTS
AICC AND BIC CRITERION
AICc and BIC deal with the trade-off between the goodness of fit of the model and the complexity of the model.
AICc is AIC with a correction for finite sample sizes:where n denotes the sample size. Thus, AICc is AIC with a greater penalty for extra parameters.
For any statistical model, the Akaike Information Criterion (AIC) value is where k is the number of parameters in the model, and L is the maximized value of the likelihood function for the model.
For large n, the Bayesian Information Criterion can be approximated by:
The BIC generally penalizes free parameters more strongly than does the AIC, though it depends on the size of n and relative magnitude of n and k.
AICC AND BIC CRITERION VS. MODEL
TERM HISTORY FOR STEPWISE REGRESSION
Criterion History
-150
-100
-50
0
50
100
150
200
Crite
rion
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
P
AICc
BIC
Both AICc and BIC are measures of model fit that are helpful when comparing models. Smaller values indicate a better fit.
Use AICc & BIC stopping criteria and pick “simpler model” – Occam’s razor
“This approach (complexity penalty) does allow you to use all the data for training a model and can be used very effectively to protect against overfit. But there are several problems with defining complexity as the number of parameters…” from Miner, et. al.
AICC AND BIC CRITERION
CAN COMPARE STATISTICS METRICS
HERE MOST ARE NEARLY IDENTICAL
GO TO VISUAL COMPARISONS
FINAL CASE FOR WHY TO PLOT THE DATA
MODELS AND STATISTICS ARE
VIRTUALLY THE SAME
ANSCOMBEQUARTET
TAKEAWAYS
• Prevent overfitting models by subsetting data into training, validation and test subsets and using validation error to stop adding complexity to model
• If data set is too small consider alternatives like K-fold cross-validation, or Leave-one-out, or using criteria that penalize model complexity (AICc, BIC)
• Always plot the data and models – do the pictures make sense?
JMP FED GOV TEAM … AND HOW TO CONTACT US
• Robyn Godfrey, Program Manager: 910-233-2737 and [email protected]
• Anna-Christina De La Iglesia, Account Representative: 919-531-2593 and [email protected]
• Tom Donnelly, PhD, CAP, Sr. Systems Engineer: 302-489-9291 and [email protected]
• Technical Support 919-677-8008 and [email protected]
Copyr ight © 2015, SAS Ins t i tu te Inc . A l l r igh ts reserved.
Questions or Comments?
Copyright © 2016 SAS Institute Inc. All Rights Reserved.