Lies, damned lies & dataviz

Download Lies, damned lies & dataviz

Post on 11-Sep-2014

784 views

Category:

Data & Analytics

0 download

Embed Size (px)

DESCRIPTION

Bad visualization, and how to avoid it

TRANSCRIPT

  • Lies, Damned Lies & Dataviz

    Bad visualization, and how to avoid it

    Dr. Andrew CleggDirector, Learner Analytics & Data SciencePearson

    @andrew_clegg

  • Part I Why Visualize?

    What are the benefits when its done right?

    Part II Bad Dataviz

    How to spot the failures and how to avoid them yourself

    Warning: Contains Opinion!

    Introduction

  • Part I Why Visualize?

  • Summarizing and communicating numbers

    Drawing attention to trends and patterns

    Exploring data interactively

    Capturing attention

    Telling stories

    What is the goal?

  • Playing to your neural hardwares strengths

    Your visual system excels at pattern detection & parallel processing.

    Representing data graphically means you can leverage this for free.

    How does visualization help?

  • Challenge: estimate x when y = 0

    x y x y x y

    27.38 24.05 32.31 31.61 75.67 14.83

    62.64 7.31 51.84 28.61 34.23 31.65

    50.76 16.30 59.04 18.29 51.21 7.69

    42.94 26.78 74.63 1.15 47.26 22.90

    8.72 42.35 56.15 11.37 66.60 3.21

    30.62 30.87 47.23 19.49 17.46 40.31

    62.63 9.14 59.36 8.82 65.70 12.79

    63.21 18.66 44.58 19.12 52.24 12.92

    40.49 23.29 47.85 20.55 62.56 14.17

    22.07 41.46 68.21 11.99 40.43 19.77

  • Challenge: estimate x when y = 0

  • Challenge: estimate x when y = 0

  • Challenge: find most similar data point

    x y x y x y

    54.88 71.52 97.86 79.92 35.95 43.70

    60.28 54.49 46.15 78.05 69.76 6.02

    42.37 64.59 11.83 63.99 66.68 67.06

    43.76 89.18 14.34 94.47 21.04 12.89

    96.37 38.34 52.18 41.47 31.54 36.37

    79.17 52.89 26.46 77.42 57.02 43.86

    56.80 92.56 45.62 56.84 98.84 10.20

    7.10 8.71 1.88 61.76 20.89 16.13

    2.02 83.26 61.21 61.69 65.31 25.33

    77.82 87.00 94.37 68.18 46.63 24.44

  • Challenge: find most similar data point

  • Challenge: find the outlier

    x y x y x y

    54.88 71.52 97.86 79.92 35.95 43.70

    60.28 54.49 46.15 78.05 69.76 6.02

    42.37 64.59 11.83 63.99 66.68 67.06

    43.76 89.18 14.34 94.47 21.04 12.89

    96.37 38.34 52.18 41.47 31.54 36.37

    79.17 52.89 26.46 77.42 57.02 43.86

    56.80 92.56 45.62 56.84 98.84 10.20

    7.10 8.71 1.88 61.76 20.89 16.13

    2.02 83.26 61.21 61.69 65.31 25.33

    77.82 87.00 94.37 68.18 46.63 24.44

  • Challenge: find the outlier

  • Avoiding limitations of statistics

    Showing patterns in large data sets with minimal information loss.

    Revealing structure of tricky data sets where typical summary statistics do a poor job.

    How does visualization help?

  • Showing patterns in large data sets

    https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

  • Describing statistically tricky data

    http://www.stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html

    All four have the same:

    mean(x)variance(x)mean(y)variance(y)correlation coefficientregression coefficients

    Anscombes Quartet

    (Francis Anscombe, 1973)

  • Describing statistically tricky data

    Much web data, especially involving human preferences or choices, looks like this.

    There is no central tendency so typical descriptive statistics are useless.

    Zipfian distribution, an example of a power law.

  • How does visualization help?

    Illustrating a story

    Visualizations are often used simply to clarify or reinforce the main points of a story, narrative or message.

    This process fails when the conclusions suggested by the graphic are irrelevant to the narrative, or even contradict it.

    It can also fail when the graphic has no clear message or multiple conflicting interpretations, or is largely incomprehensible.

    Many of the following examples illustrate these mistakes.

  • Part II Bad Dataviz

  • 1. Axes of evil

    Bad dataviz

    http://fluffware.tumblr.com/post/4580822773/axes

  • http://www.google.co.uk/trends/explore#q=%22data science%22

    Unlabelled axes

  • Firearms (skjutvapen) seizures report: http://bit.ly/1dHnFzC (PDF) via Junk Charts

    Axis scale manipulation

  • https://twitter.com/jk_keller/status/410498080765919232/photo/1

    Axis scale manipulation (totally shameless version)

    Version published by Reuters Version fixed by @jk_keller

  • Example from Stephen Few (PDF)

    Dual axes: caution

    Natural interpretation:

    Units sold dipped below revenue (A) and is now catching up (B).

    But these impressions are meaningless.

    They are just artefacts of the chosen axis scales.

    A

    B

  • Proportionality errors

    From an Australian document found at The Guardian

    1 row of people = roughly 43,000 nurses.

    10 rows = roughly 48,000 nurses.

    ?!?

  • Cheating outright?

    All found via The Guardian

  • Quick quiz: what happened in 2005?

  • Axis inversion: when down means up?!?

    From Thomson Reuters via Business Insider

    Version published by Reuters Version fixed by @PFedewa

  • Bad dataviz

    2. Distance vs. area vs. volume

    http://muhammadfamizwanabdullah.blogspot.co.uk/2010/11/10-introduction-of-teaching-volume-of.html

  • Pie charts: avoid

    Bad

    Colours used for separating slices, so cant easily be put to another use.

    No way to show time dimension statically.

    Comparing relative sizes of slices is hard.

    Doing it in 3D is harder. Perspective inflates nearer slices, and the similar volume of the objects is a red herring.

    Doing it with deep, discontinuous 3D objects is even harder.

    Worse

    Worst

  • Perhaps justifiable (in 2D) if numbers are sufficiently different.

    Otherwise, use a much simpler design and avoid all those problems.

    Pie charts: avoid

  • Pie chart horrors

    http://junkcharts.typepad.com/junk_charts/2014/03/two-charts-that-fail-self-sufficiency.html

    Pie charts are supposed to show proportions of a whole.

    People expect the %s to add up to 100%.

    This one shows proportions of separate quantities.

  • Pie chart horrors

    From a World Bank report (PDF) found at The Guardian

    These ones show 96% and 40% as full circles.

    This one is falling apart.

    This one thinks 76% is less than three quarters.

  • Even worse uses of 3D

    https://www.tableausoftware.com/public/blog/2011/01/viz-wiz-1-11and http://www.simplexnumerica.com/Gallery/gallery_pyramid.html

    Cones, pyramids, spheres etc

    Are we comparing width, height, area or volume? Nobody knows!

    26.76% = tiny peak

    23.32% = massive slab

    ?!?

  • Stacked charts: caution

    Stacked charts show how a data series breaks down by another attribute of the data.

    But people often misread these as two distinct data series, reading off a separate y-axis value for each one.

  • Bubble charts: avoid

    http://commons.wikimedia.org/wiki/File:Bubble_Chart_Chicago_Deposit_Market.jpg

  • Bubble charts: avoid

    http://bit.ly/1okS3nE and http://bit.ly/1hdZQtO

  • Bad dataviz

    3. Bad maps

    http://xkcd.com/1138/

  • Non-normalized quantities are useless

    http://personal.frostburg.edu/jibandy0/starbucks%20map.jpg

    Dont use absolute values without a very good reason.

    Normalize appropriately:

    per capita, per adult, per student, per household, per square km, per journey, per voter

  • Remember: geopolitical boundaries are artificial

    This map shows all the countries Ive visited.

    The relative size of USA makes me seem much more widely travelled than I really am.

    Is country the right level of aggregation?

  • Remember: map projections lie

    http://en.wikipedia.org/wiki/File:Hobo%E2%80%93Dyer_projection_SW.jpg

  • http://en.wikipedia.org/wiki/File:Choropleth-density.png

  • Consider using fixed-size bins

    http://bit.ly/O9EPta

  • Drawbacks of maps

    Cant easily show time dimension, without animation

    Hard to show multiple attributes of data at once

    Physical proximity can obscure demographic/cultural differences, and vice versa

    Just because you can map the data, doesnt mean you should.

    Save maps for when geographical trends are the key focus.

  • 4. Colour choice

    Bad dataviz

    Good colour palettes from RColorBrewer

  • Sequential data

    http://mapsdeguap.blogspot.co.uk/2012/04/choropleth-map.html

    Use a smooth transition from min to max.

    Dont cycle more than once.

    This map goes purple-green twice.

    A better choice would be:

  • Diverging data

    http://www-03.ibm.com/press/us/en/pressrelease/35359.wss

    Here the yellow section indicates the median. Red/green = above/below median.

    However, the red and green ranges are not scaled well. 75 (close to median) is almost the same colour as 108 (max).

    Sequential data, but with a well-defined midpoint.

    Two directions from this midpoint -- two poles:

    above/below average, positive/negative, female/male, Democrat/Republican etc.

  • Categorical data

    Also known as nominal or qualitative.

    Colours should not form a pattern, as this can imply a false relationship.

    The ethnicity colours here are reasonable, although quite close in colour space.

    The location colours are badly chosen. They suggest a linear progression, which is meaningless.

    http://www.visualizing.org/full-screen/10886

  • Consisten

Recommended

View more >