an introduction to the data science universe · an introduction to the data science universe mark...

6
07/04/2016 1 An Introduction to the Data Science Universe Mark Lee Insight Risk Consulting [email protected] 07 April 2016 The Data Science Universe 07 April 2016 2 Data Mining

Upload: others

Post on 14-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

1

An Introduction to the Data Science

Universe Mark Lee – Insight Risk Consulting [email protected]

07 April 2016

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

The Data Science Universe

07 April 2016 2

Data Mining

Page 2: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

2

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

The Data Science Universe

07 April 2016 3

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

Machine Learning

07 April 2016 4

Supervised Learning Data with a dependent variable/classification

Prediction

New

Data

Training

Data

Learning

Algorithm Model

Unsupervised Learning Data exploration

Data

Algorithm

New

Representation

A B

C

Model

Cat

Dog

??

Cat

Cat Dog White

Pet Black

Pet

Brown

Pet

Page 3: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

3

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72 Forests

Comparing Algorithms

07 April 2016 5

Linear

models

Neural

Networks

Single

Trees

Support

Vector

Machines

K-N

ea

rest

Neig

hbo

urs

High Low Interpretability

Inflexib

le

Fle

xib

le

Transparent Black box

GLMs

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

Bias-Variance Trade-off

07 April 2016 6

Trade-off

Model Complexity

Err

or

Random noise

underlying data.

Whether assumptions

of model fit the

underlying process.

Variability if the model

is trained on different

samples of data.

Combat by adding

model complexity. Combat by reducing

model complexity.

(Process Error) (Model Error) (Parameter Error)

Irreducible

Error = + + Error Bias2 Variance

Page 4: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

4

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

Machine Learning

07 April 2016 7

Supervised Learning Data with a dependent variable/classification

Prediction

New

Data

Training

Data

Learning

Algorithm

Unsupervised Learning Data exploration

Data

Algorithm

New

Representation

A B

C

Model

Hyper-

parameters

Hyper-

parameters

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

So which algorithm is best?

• All algorithms are biased in some manner.

• The “No Free Lunch Theorem”: no algorithm is best over all

problems.

• Methods may work well for a certain class of problems.

• Best result may be an aggregate of methods.

• In other words, “it depends”. Human judgement is required.

07 April 2016 8

Page 5: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

5

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

Issues for insurers

• New data sources – implications for better models:

– Telematics for motor insurance.

– Fraud detection via social media – several examples to date, but can it be

automated?

– Some health insurers are pricing based on health related data.

• More powerful algorithms:

– Can GLM’s keep up? Can new algorithms give a competitive advantage?

• How could London Market insurers gain from these techniques?

07 April 2016 9

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

Issues for actuaries

• Will actuaries be replaced by machines/computer scientists?

• Human judgement is required - technically and statistically trained

actuaries are well placed.

• Do actuaries need to train?

– E.g., CAS Institute accreditation in data science.

07 April 2016 10

Page 6: An Introduction to the Data Science Universe · An Introduction to the Data Science Universe Mark Lee – Insight Risk Consulting mark.lee@insightriskconsulting.co.uk 07 April 2016

07/04/2016

6

Colour palette for PowerPoint presentations

Dark blue

R17 G52 B88

Gold

R217 G171 B22

Mid blue

R64 G150 B184

Secondary colour palette

Primary colour palette

Light grey

R63 G69 B72

Pea green

R121 G163 B42

Forest green

R0 G132 B82

Bottle green

R17 G179 B162

Cyan

R0 G156 B200

Light blue

R124 G179 B225

Violet

R128 G118 B207

Purple

R143 G70 B147

Fuscia

R233 G69 B140

Red

R200 G30 B69

Orange

R238 G116 29

Dark grey

R63 G69 B72

References

• www.insightriskconsulting.co.uk

– We mainly use R and Python platforms.

• No free lunch theorem:

– D. Wolpert, Neural Computation 8, p1341 (1996).

• Machine learning algorithms, book and video lectures:

– Introduction to Statistical Learning, by James, Witten, Hastie and Tibshirani.

www.statlearning.com

• LinkedIn group – search for “Data science for insurers group” on LinkedIn.com.

07 April 2016 11