an introduction to the data science universe · an introduction to the data science universe mark...
TRANSCRIPT
07/04/2016
1
An Introduction to the Data Science
Universe Mark Lee – Insight Risk Consulting [email protected]
07 April 2016
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
The Data Science Universe
07 April 2016 2
Data Mining
07/04/2016
2
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
The Data Science Universe
07 April 2016 3
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
Machine Learning
07 April 2016 4
Supervised Learning Data with a dependent variable/classification
Prediction
New
Data
Training
Data
Learning
Algorithm Model
Unsupervised Learning Data exploration
Data
Algorithm
New
Representation
A B
C
Model
Cat
Dog
??
Cat
Cat Dog White
Pet Black
Pet
Brown
Pet
07/04/2016
3
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72 Forests
Comparing Algorithms
07 April 2016 5
Linear
models
Neural
Networks
Single
Trees
Support
Vector
Machines
K-N
ea
rest
Neig
hbo
urs
High Low Interpretability
Inflexib
le
Fle
xib
le
Transparent Black box
GLMs
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
Bias-Variance Trade-off
07 April 2016 6
Trade-off
Model Complexity
Err
or
Random noise
underlying data.
Whether assumptions
of model fit the
underlying process.
Variability if the model
is trained on different
samples of data.
Combat by adding
model complexity. Combat by reducing
model complexity.
(Process Error) (Model Error) (Parameter Error)
Irreducible
Error = + + Error Bias2 Variance
07/04/2016
4
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
Machine Learning
07 April 2016 7
Supervised Learning Data with a dependent variable/classification
Prediction
New
Data
Training
Data
Learning
Algorithm
Unsupervised Learning Data exploration
Data
Algorithm
New
Representation
A B
C
Model
Hyper-
parameters
Hyper-
parameters
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
So which algorithm is best?
• All algorithms are biased in some manner.
• The “No Free Lunch Theorem”: no algorithm is best over all
problems.
• Methods may work well for a certain class of problems.
• Best result may be an aggregate of methods.
• In other words, “it depends”. Human judgement is required.
07 April 2016 8
07/04/2016
5
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
Issues for insurers
• New data sources – implications for better models:
– Telematics for motor insurance.
– Fraud detection via social media – several examples to date, but can it be
automated?
– Some health insurers are pricing based on health related data.
• More powerful algorithms:
– Can GLM’s keep up? Can new algorithms give a competitive advantage?
• How could London Market insurers gain from these techniques?
07 April 2016 9
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
Issues for actuaries
• Will actuaries be replaced by machines/computer scientists?
• Human judgement is required - technically and statistically trained
actuaries are well placed.
• Do actuaries need to train?
– E.g., CAS Institute accreditation in data science.
07 April 2016 10
07/04/2016
6
Colour palette for PowerPoint presentations
Dark blue
R17 G52 B88
Gold
R217 G171 B22
Mid blue
R64 G150 B184
Secondary colour palette
Primary colour palette
Light grey
R63 G69 B72
Pea green
R121 G163 B42
Forest green
R0 G132 B82
Bottle green
R17 G179 B162
Cyan
R0 G156 B200
Light blue
R124 G179 B225
Violet
R128 G118 B207
Purple
R143 G70 B147
Fuscia
R233 G69 B140
Red
R200 G30 B69
Orange
R238 G116 29
Dark grey
R63 G69 B72
References
• www.insightriskconsulting.co.uk
– We mainly use R and Python platforms.
• No free lunch theorem:
– D. Wolpert, Neural Computation 8, p1341 (1996).
• Machine learning algorithms, book and video lectures:
– Introduction to Statistical Learning, by James, Witten, Hastie and Tibshirani.
www.statlearning.com
• LinkedIn group – search for “Data science for insurers group” on LinkedIn.com.
07 April 2016 11