session 42: visualization: a picture speaks a thousand words · 2018 predictive analytics symposium...
TRANSCRIPT
2018 Predictive Analytics Symposium
Session 42: Visualization: A Picture Speaks a Thousand Words
SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer
Telling Your Data StoryMARY PAT CAMPBELL, FSA, MAAA, PRM
VP, Insurance Research, Conning
21 September 2018
https://en.wikipedia.org/wiki/Charles_Joseph_Minard
3
The Why of Data Visualization. https://www.soa.org/News-and-Publications/Newsletters/Compact/2016/march/The-Why-of-Data-Visualization.aspx
Evaluate Your Visualization
Completeness
Perceptibility
Intuitiveness
Source: http://www.perceptualedge.com/articles/visual_business_intelligence/data_visualization_effectiveness_profile.pdf
No relevant data
All relevant data
Unclear and difficult
Clear and easy
Unfamiliar; hard to understand
Familiar; easy to understand
What is Your Story?
Distribution
Change over time
Correlation or Relationship
Comparison between items (ranking)
Comparison over space (maps)
Parts of a whole
Things to Try to Improve Readability
REMOVEGridlinesLegend – replace with data labels
Instead:Add explanatory textHighlight key elementsUse multiples of same graph
Some Data Stories
9
Let’s Go On a Journey!
Photo by Daniel McCullough on Unsplash
10
Data Set 1: Modeled Income Percentiles
Data source: http://go.epi.org/unequalstates2018data
Report: Sommeiller, Estelle and Price, Mark. “The New Gilded Age”. Economic Policy Institute. 19 July 2018. https://www.epi.org/publication/the-new-gilded-age-income-inequality-in-the-u-s-by-state-metropolitan-area-and-county/
11
Starting Out in a Mess
Photo by Alex Block on Unsplash
12
Source: “See How Much the Top 1% Earn in Every State”, 30 Aug 2018 https://howmuch.net/articles/average-annual-income-of-the-top-1-percent
13
Connecticut, #1
District of Columbia, #5 Massachusetts, #3
New York,#2
Wyoming, #4
$0
$500,000
$1,000,000
$1,500,000
$2,000,000
$2,500,000
$3,000,000
The Long Tail of High Income99th percentile Average income of top 1%
14
Average Income of Top 1% Taxpayers
15
AlabamaAlaska
ArizonaArkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
HawaiiIdaho
Illinois
IndianaIowa
Kansas
KentuckyLouisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
MissouriMontana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North CarolinaNorth Dakota
OhioOklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
R² = 0.7231
$0.0
$0.5
$1.0
$1.5
$2.0
$2.5
$3.0
$0.2 $0.3 $0.4 $0.5 $0.6 $0.7 $0.8
Average Income
of Top 1%Taxpayers less the
99th PercentileIncome
99th Percentile Income
Higher Percentile, Longer Tail(circle size scales by number of taxpayers, $ in millions)
16
Alabama
Alaska
Arizona
ArkansasCalifornia
Colorado
Connecticut
Delaware
District of Columbia
Florida
GeorgiaHawaiiIdaho
Illinois
Indiana
Iowa
KansasKentuckyLouisiana
Maine
Maryland
Massachusetts
MichiganMinnesota
Mississippi
Missouri
MontanaNebraska
Nevada
New HampshireNew Jersey
New Mexico
New York
North Carolina
North DakotaOhio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
VermontVirginia
Washington
West Virginia
Wisconsin
Wyoming
R² = 0.1053
100%
150%
200%
250%
300%
350%
400%
0 50,000 100,000 150,000 200,000
Percent Difference BetweenAverage Income
of Top 1% and
99th Percentile
Number of Taxpayers in the 1%
Low Correlation Between Population and Income Tail Length
17
California
Connecticut
District of ColumbiaFlorida
Illinois
Massachusetts
New Jersey
New York
Texas
Wyoming
R² = 0.1476
$0.0
$0.5
$1.0
$1.5
$2.0
$2.5
0 50,000 100,000 150,000 200,000
Average Incomeof Top 1%Taxpayers,
$ in millions
Number of Taxpayers in the 1%
Geographic Outliers of Top Income
18
Data Set 2: Mortality by Cause
Source: National Center for Health Statistics
Data Visualization Gallery
https://www.cdc.gov/nchs/data-visualization/index.htm
19Source: https://www.cdc.gov/nchs/data-visualization/mortality-trends/
20
Accidents
Cancer
Heart Disease
Influenza and Pneumonia
Stroke
0
100
200
300
400
500
600
700
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Age
-Ad
just
ed D
eath
Rat
esAccidents Cancer Heart Disease Influenza and Pneumonia Stroke
21
Accidents, 66Accidents, 43
Cancer, 196
Cancer, 159
Heart Disease, 543
Heart Disease, 169
Influenza and Pneumonia, 47 Influenza and
Pneumonia, 15
Stroke, 166
Stroke, 38
1965 2015
Age-Adjusted Death Rates, per 100,000
23
The most frequently used return assumption is
7.5%
24
Return Assumptions Are Concentrated, And Shifting Down
25
Return Assumptions Are Concentrated, And Shifting Down
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
6.0% 6.5% 7.0% 7.5% 8.0% 8.5% 9.0% 9.5%
Cumulative Percentage
of Public Plans
Investment Return Assumption
In FY 2001, 19% did
In FY 2016, 82%of plans in the Public Plans Databaseused return assumptions of 7.75% or less
26
Public Plan Funded Ratios
2011 201620062001
Choosing a Visualization Type
What Kind of Data Do You Have?Dimensionality:
One: histogram, box-and-whisker, pie chart, table with summary statsTwo: line, bar/column, scatterplotMany: multiples
Numerical or categoricalCategorical: bar/column (may want to sort categories), histogram
GroupedClustered columns, multiple graphs
Large set – or just a few numbersLarge: will generally need to simplify/summarize/group along some dimensionFew: consider table or just a number
GeographicDoes location actually count?Tile grid when entities equally weighted
What is Your Story?
DistributionDensity plot, histogram, box-and-whisker
Change over time Line, slope
Correlation or RelationshipScatterplot, bubble plot
Comparison between items (ranking)Slope, list/table, conditional formatting on table
Comparison over space (maps)Choropleths, tile maps
Parts of a wholePie, stacked bar/column
Additional Resources
Additional Resources
Storytelling with Data
Looks at how to design graphs and other displays for maximum effect
Most can be done in Excel
Websites
The Chartmaker Directory
http://chartmaker.visualisingdata.com/
Visualization Universe
Chart types: http://visualizationuniverse.com/charts/
Charting books: http://visualizationuniverse.com/books/
PolicyViz
https://policyviz.com/
33
Article Series in CompAct
• The Why... (Feb 2016)
• ...The Who... (May 2016)
• ...The Where... (Dec 2016)
• ...The What... (Oct 2017)
• and The How of Data Visualization (Apr 2018)
Can You See It?
CLIMBING THE ZEN MOUNTAINCLIMBING THE ZEN MOUNTAIN
WHAT WE’LL TALK ABOUTWHAT WE’LL TALK ABOUT
Seeing numbersSeeinghypothesesSeeing models
SEEING NUMBERSSEEING NUMBERS
THE TREACHERY OF IMAGESTHE TREACHERY OF IMAGES
Image taken from a University of Alabama site, “Approaches toModernism”: [1], Fair use,https://en.wikipedia.org/w/index.php?curid=555365
THE NUMBER 7THE NUMBER 7
WE WE CANNOTCANNOT SEE NUMBERS SEE NUMBERS
Arabic or sanskrit are no more legitimate than any other representationof numbers.
We can no more see numbers than we can hear, smell or taste them.
SCALING THE ZEN MOUNTAINSCALING THE ZEN MOUNTAIN
“Before I studied Zen, I saw mountains as mountains and rivers asrivers. When I had studied Zen for thirty years I no longer sawmountains as mountains and rivers as rivers. But now that I havefinally mastered Zen, I once again see mountains as mountains andrivers as rivers.”
Ch’an master Ch’ing Yuan
MANY NUMBERS - STATISTICSMANY NUMBERS - STATISTICS
Statistics maps a set of many numbers into a set of fewer numbers.
set.seed(1234)
meanlog_actual <- log(10e3)
sdlog_actual <- 0.5
tbl_obs <- tibble(
x = rlnorm(5e3, meanlog = meanlog_actual, sdlog = sdlog_actual)
)
tbl_obs$x %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1830 7132 9970 11266 13947 49429
MANY NUMBERS VISUALLYMANY NUMBERS VISUALLY
MANY NUMBERS VISUALLYMANY NUMBERS VISUALLY
Looking at summary statistics is always reduced information.
Looking at a visualization represents all of the data, but forces our eyesto compute the statistics.
Increased efficiency vs. decreased accuracy
SEEING HYPOTHESESSEEING HYPOTHESES
STATISTICAL HYPOTHESESSTATISTICAL HYPOTHESES
Many different sorts:
Were the data generated by this form of distribution?Were these two samples generated by different processes?Is there a relationship between these two variables?
[list influenced by ]http://had.co.nz/stat645/graphical-
inference.pdf
SAMPLE DATASAMPLE DATA
SAMPLE AND HYPOTHESISSAMPLE AND HYPOTHESIS
HYPOTHESIS TESTINGHYPOTHESIS TESTING
Kolmogorov-SmirnovParameter significance\(\chi^2\) test
Also:
Test against other candidates, visually
COULD THE DATA HAVE COME FROM SOMEWHERECOULD THE DATA HAVE COME FROM SOMEWHEREELSE?ELSE?
EXERCISE FOR THE STUDENTEXERCISE FOR THE STUDENT
The same, but with:
p-p or q-q plotCumulative distribution functionIsolate important areas of the distribution
BUT NOW …BUT NOW …
Test the null itself!!
GRAPHICAL INFERENCEGRAPHICAL INFERENCE
Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja
H/T -> Xan Gregg @xangregg
Graphical inference helps us answer the question“Is what we see really there?”
http://had.co.nz/stat645/graphical-inference.pdf
HOW IT WORKSHOW IT WORKS
Visual test
1. Generate many (or 19) samples of the NULL2. Add your actual data3. Shuffle4. Observe5. Power may be increased by using more than one observer
CAN YOU SPOT THE SAMPLE DATA?CAN YOU SPOT THE SAMPLE DATA?
HOW ABOUT NOW?HOW ABOUT NOW?
NOW?NOW?
A BIT EASIERA BIT EASIER
A BIT HARDERA BIT HARDER
THE STATISTICAL LINEUPTHE STATISTICAL LINEUP
If can pick my data out of a lineup, I may reject the null hypothesis.
SEEING MODELSSEEING MODELS
SEEING MODELSSEEING MODELS
A “good” model is one which displays noise. We are most interested inseeing something which isn’t there.
MOVE ALONG, NOTHING TO SEE HEREMOVE ALONG, NOTHING TO SEE HEREsegment adj.r.squared sigma
1 0.6294916 1.236603
2 0.6291578 1.237214
3 0.6292489 1.236311
4 0.6296747 1.235696
NOTHING TO SEE?NOTHING TO SEE?
RESIDUALSRESIDUALS
MISSING VARIABLESMISSING VARIABLES
Let’s look at ozone data from mlbench package.
At first, we will only fit to las_wind_speed.
A simple model may tell us more than we think!
BASIC EDABASIC EDA
OUR APPROACHOUR APPROACH
A very messy PoissonFit a GLM with a subset of predictorsPlot residuals against all predictorsLook for pattern
MISSING VARIABLESMISSING VARIABLES
AUGMENT OUR MODELAUGMENT OUR MODEL
Let’s add lax inversion temperature!
MISSING VARIABLES REDUXMISSING VARIABLES REDUX
TREESTREES
Simple trees are easy to visualizeThey’re also not too usefulEnsemble models are tough to see
VARIABLE IMPORTANCEVARIABLE IMPORTANCE
PARTIAL PLOTSPARTIAL PLOTS
PARTIAL PLOTSPARTIAL PLOTS
CONCLUSIONCONCLUSION
THE ZEN MOUNTAINTHE ZEN MOUNTAIN
-Me
Numbers are not numbers, models are notmodels …
THANK YOU!THANK YOU!
REFERENCESREFERENCES
http://dicook.github.io/nullabor/index.html
WHERE TO FIND THISWHERE TO FIND THIS
This presentation may be found at:
Code to produce the examples and slides:
http://pirategrunt.com/soa_symposium_2018/#/
https://github.com/PirateGrunt/soa_symposium_2018
Understanding the Layers of Your DataSession 42 – Visualization: A Picture Speaks a Thousand Words
September 2018 – Predictive Analytics Symposium
Good Graphics Get to the Point
2
Bad Graphics Do More Harm than Good
3
Identify Possible Solutions
More Bad Graphics
4
LinkedIn Body Language for Leaders
…. what?
5
Using a Layered Approach to Displaying Data
6
Guide: ggplot2 R package ggplot2 is an implementation of the concept of the grammar of graphics
Basics of the grammar: Data Geometric objects (e.g. points, lines, bars) Aesthetic attributes (e.g. color, size, shape)
Additional components: Statistical transformations of data (e.g. count, mean) Coordinate system (generally assumed to be Cartesian)
The combination and layering of these components defines the grammar
7
Variable Description Examples
manufacturer manufacturer name Audi, Chevrolet, Nissanmodel model name A4, Corvette, Altimadispl engine displacement, in liters 2.0, 4.2, 6.0year year of manufacture 1999 or 2008cyl number of cylinders 4, 6, 8trans type of transmission auto, manualdrv front-wheel, rear-wheel, 4wd f, r, 4wdcty city miles per gallon 14, 16, 20hwy highway miles per gallon 15, 20, 27fl fuel type e: E85, d: diesel, r: regular, p: premium, c: CNGclass type of car compact, midsize, SUV
Sample Dataset ‘mpg’Fuel economy data from 1999 and 2008 for 38 popular models of car
Basic Comparisons – Density
8
Basic Comparisons – The Structure of Data Matters
9
## City mpg density (basic)ggplot(data = mpg, aes(x = cty)) +
geom_density()
## City mpg density (full prettied) ggplot(data = mpg, aes(x = cty)) +
geom_density(col = 'lightblue', fill = 'lightblue') +
scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('City MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
## Highway mpg density (basic)ggplot(data = mpg, aes(x = hwy)) +
geom_density()
## Highway mpg density (full prettied) ggplot(data = mpg, aes(x = hwy)) +
geom_density(col = 'lightblue', fill = 'lightblue') +
scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('Highway MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Basic Comparisons – The Structure of Data Matters
10
Basic Comparisons – The Structure of Data Matters
11
## Create a new format for our dataplot_data <- mpg %>%
gather(key = 'mpg_type', value = 'mpg', cty, hwy)
## Plot city and highway mpg under same plot controls (basic)ggplot(plot_data, aes(x = mpg)) +
geom_density() +facet_wrap(~ mpg_type, nrow = 2)
## Plot city and highway mpg under same plot controls (prettied) ggplot(plot_data, aes(x = mpg)) +
geom_density(col = 'lightblue', fill = 'lightblue') +
facet_wrap(~ mpg_type, nrow = 2, labeller = as_labeller(c('cty' = 'City',
'hwy' = 'Highway'))) +scale_y_continuous(labels = scales::percent) +ylab('% of data') +xlab('MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Scatterplots – More Than Just Dots
12
## Highway mpg as a function of city mpg (basic)ggplot(data = mpg, aes(x = cty, y = hwy)) +
geom_point()
## Highway mpg as a function of city mpg (prettied) ggplot(data = mpg, aes(x = cty, y = hwy)) +
geom_point() +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Scatterplots – More Than Just Dots
13
## Highway mpg as a function of city mpg (basic)## Add color based on classggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +
geom_point()
## Highway mpg as a function of city mpg (prettied)## Add color based on classggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +
geom_point() +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Scatterplots – More Than Just Dots
14
## Highway mpg as a function of city mpg (basic)## Add a trend lineggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +
geom_count() +geom_smooth(aes(group = 1), method = 'lm', se = FALSE,
linetype = 'dashed')
## Highway mpg as a function of city mpg (prettied)## Add a trend lineggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +
geom_count() +geom_smooth(aes(group = 1), method = 'lm', se = FALSE,
linetype = 'dashed') +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Scatterplots – More Than Just Dots
15
## Highway mpg as a function of city mpg (basic)## Add multiple trend linesggplot(data = mpg, aes(x = cty, y = hwy, col = class)) +
geom_count() +geom_smooth(method = 'lm', se = FALSE)
## Highway mpg as a function of city mpg (prettied)## Add multiple trend linesggplot(data = mpg, aes(x = cty, y = hwy , col = class)) +
geom_count() +geom_smooth(method = 'lm', se = FALSE) +xlab('City MPG') +ylab('Highway MPG') +theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 16))
Bar Charts – Not So Boring After All
16
## Plot count of cars by manufacturer (basic)ggplot(data = mpg, aes(x = manufacturer)) +
geom_bar(stat = 'count')
## Plot count of cars by manufacturer (prettied)ggplot(data = mpg, aes(x = manufacturer)) +
geom_bar(stat = 'count') +theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 16))
Bar Charts – Not So Boring After All
17
## Plot count of cars by manufacturer (basic)## Add transmission type as a “fill”ggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +
geom_bar(stat = 'count', position = ‘dodge’)
## Plot count of cars by manufacturer (prettied)ggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +
geom_bar(stat = 'count', position = ‘dodge’) +theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 16))
Bar Charts – Not So Boring After All
18
## Plot count of cars by manufacturer (basic)## Facet on no. of cylindersggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +
geom_bar(stat = 'count', position = ‘dodge’) +facet_grid(cyl ~ .)
## Plot count of cars by manufacturer (prettied)## Facet on no. of cylindersggplot(data = mpg, aes(x = manufacturer, fill = factor(trans))) +
geom_bar(stat = 'count‘, position = ‘dodge’) +facet_grid(cyl ~ .) +theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 16))
Conclusion: Layers Help Tell the Story
19
Coordinate system Data Coordinates of where shot was taken Make or miss
Geometrics Bins of court coordinates Percentages within bins
Aesthetics Size of hexagons Color based on relative percentage
Statistical Transformations Count of shots, makes within bin
Thank youMike Hoyer, Actuary and Product ManagerMilliman IntelliScript
Telling Your Data StoryMARY PAT CAMPBELL, FSA, MAAA, PRMVP, Insurance Research, Conning21 June 2018
https://en.wikipedia.org/wiki/Charles_Joseph_Minard
23
The Why of Data Visualization. https://www.soa.org/News-and-Publications/Newsletters/Compact/2016/march/The-Why-of-Data-Visualization.aspx
Evaluate Your VisualizationCompleteness
Perceptibility
Intuitiveness
Source: http://www.perceptualedge.com/articles/visual_business_intelligence/data_visualization_effectiveness_profile.pdf
No relevant data All relevant data
Unclear and difficult
Clear and easy
Unfamiliar; hard to understand
Familiar; easy to understand
What is Your Story?Distribution
Change over time
Correlation or Relationship
Comparison between items (ranking)
Comparison over space (maps)
Parts of a whole
Things to Try to Improve ReadabilityREMOVE
GridlinesLegend – replace with data labels
Instead:Add explanatory textHighlight key elementsUse multiples of same graph
Some Data Stories
29
Let’s Go On a Journey!
Photo by Daniel McCullough on Unsplash
30
Examples To ComeI will be telling some data stories in the session, and full slides will be available after the meeting.
Photo by Casey Horner on Unsplash
Choosing a Visualization Type
What Kind of Data Do You Have?• Dimensionality:
• One: histogram, box-and-whisker, pie chart, table with summary stats• Two: line, bar/column, scatterplot• Many: multiples
• Numerical or categorical• Categorical: bar/column (may want to sort categories), histogram
• Grouped• Clustered columns, multiple graphs
• Large set – or just a few numbers• Large: will generally need to simplify/summarize/group along some dimension• Few: consider table or just a number
What is Your Story?• Distribution
• Density plot, histogram, box-and-whisker
• Change over time • Line, slope
• Correlation or Relationship• Scatterplot, bubble plot
• Comparison between items (ranking)• Slope, list/table, conditional formatting on table
• Comparison over space (maps)• Choropleths, tile maps
• Parts of a whole• Pie, stacked bar/column
Additional Resources
Additional Resources
Storytelling with Data
Looks at how to design graphs and other displays for maximum effect
Most can be done in Excel
WebsitesThe Chartmaker Directoryhttp://chartmaker.visualisingdata.com/
Visualization UniverseChart types: http://visualizationuniverse.com/charts/Charting books: http://visualizationuniverse.com/books/
PolicyVizhttps://policyviz.com/
Can You See It?