validation methods - pydata israel

Validation methods

Nathaniel ShimoniPyData Israel 16/2/2017

Validation techniques

basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window

Anchored sliding window

Time based group Kfold

Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3

Hyper-parameter tuning

𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

We use validation to balance two things:

1. Fit our training data as well as we can (while…)2. Generalize well to get best performance on

unseen data (aka refrain from over-fitting)

We use validation to: • Select best model• Select best hyper-parameters• Early stopping of training process


basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window



Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3


𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Train test split

The most basic validation technique

It is based on hold-out-sample

We split the training data randomly and test our performance on the unseen data

Select folds in a way that keeps equal proportion of target variable in each fold

Stratification

The train-test split validation method is very common. its main benefits:

Cross Validation

computational efficiency simplicity

we’re loosing large amount of data

Might suffer from skew / bias

but it has two disadvantages:

Data

Fold 1Fold 2Fold 3


basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window



Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3


𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Yes use Kfold /

stratified Kfold

What if the samples are drown from different groups?

Can we retrain for new groups?

No use group-based folding methods


basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window



Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3


𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Time series data?Are we predicting a specific time frame?Are we predicting future events? • Use time based folds / split• Use sliding window• Use anchored sliding window• Random split is more like an imputation problem

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

trainvalidation


basics

Model selection

Early stopping

Train-test split

Kfold

leave one out (loo)

Leave P out

Group Kfold

Leave one group out

Time series

Sliding window



Unbalanced data

Stratified methods

Why? Grouped data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov De

c

train01

Data

Fold 1Fold 2Fold 3


𝝁 ,𝜽 ,𝜸 ,𝜷 ,𝜶 ,𝜹

Thank you !

linkedInGithub

https://il.linkedin.com/in/nathaniel-shimoni-16b11081

https://github.com/nathanie

validation methods - pydata israel

Data & Analytics