an introduction to designing and building data visualizations
DESCRIPTION
An introduction to designing and building data visualizations. Kristen Sosulski [email protected]. About me. One of many influences…. Agenda. I. What is data visualization? II. What types of stories can you tell with a visualization? III. How to approach c reating visualizations? - PowerPoint PPT PresentationTRANSCRIPT
An introduction to designing and building data visualizations
Kristen [email protected]
About me
One of many influences…
Agenda
I. What is data visualization?II. What types of stories can you tell with a visualization?III. How to approach creating visualizations?III. Try it and apply it.
I. DEFINING VISUALIZATION
Visualization is a kind of narrative, providing aclear answer to a question without extraneous
details.
-- Ben Fry, 2008, p. 4.
Visualization is a graphical representation of some data or concepts
-- Colin Ware, 2008, p. 20
Visual design is mapping datato visual form. It should conveythe unique properties of thedata set it represents.
Pathways of City Runners: A year of runs in NYC
http://yesyesno.com/nike-city-runs
Traffic in Lisbon
http://www.visualcomplexity.com/vc/project_details.cfm?id=728&index=728&domain=
Visualizations
• Help us think• Use perception to
offload cognition• Serves as an external
aid to augment working memory
• Boost our cognitive abilities
Visualizations are helpful in communication and analysis
Dual channels
Limited capacity
Active Processing
However, visualizations can hinder our message when designed poorly.
Wong, 2010, p. 15
Good Chart Design
Use natural increments for the y-axis scale
Include a zero baseline in all bar charts
Place the larger segments of a pie chart on top at the 12 o’clock
Wong, 2010, p. 143
Data visualization enables us to record, analyze, and communicate
Past Present Future
Rationale
• Traditional reports using tables, rows, and columns do not paint the whole picture or, even worse, lead an analyst to a wrong conclusion.
• Firms need to use data visualization because information workers: – Cannot see a pattern without data visualization– Cannot fit all of the necessary data points onto a single
screen– Cannot effectively show deep and broad data sets on a
single screen.
Source: Evelson, B. & Yuhanna, N. (2012). The Forrester Wave: Advanced data visualization (ADV) Platforms, Q3, 2012. Forrester Research, July 17.
Patterns: Violence in Video Games News Stores: Using a filled density plot
Source: http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization.html. Begin at 4:45
Data Points: U.S. Unemployment Rate using a choropleth map
Source: Forbes
Data Points: Student Loan Debt using a bar, line, and area charts
Source: The Federal Reserve Bank of New York: http://www.newyorkfed.org/studentloandebt/
Data Points: Small Multiples of the Number of Unemployed Workers
Source: http://hci.stanford.edu/jheer/files/zoo/
Deep and Broad: Four Ways to Slice Obama’s 2013 Budget Proposal using a bubble pie chart
Source: New York Times: http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html
II. WHAT STORIES CAN YOU TELL WITH DATA VISUALIZATION?
Hans Rosling on Poverty using a bubble chart with sliders
http://www.ted.com/talks/hans_rosling_reveals_new_insights_on_poverty.html
All medalists racing the 100 meter sprint
Source: http://www.nytimes.com/interactive/2012/08/05/sports/olympics/the-100-meter-dash-one-race-every-medalist-ever.html
Old vs. New Data Visualization
• Dynamic data = = Dynamic Visualizations• Visual querying. Drill downs. Drop downs.• Animated visualization.
– If a particular dimension, such as time, has hundreds or thousands of values (i.e. daily values over multiple years), manually clicking through every day is not practical.
– An animated scroll up/down is more practical.
You could tell a story like this… or
Patterns: How people spend their time using a stacked area/line graph
Source: New York Times
The growth of Target from 1962 to 2008 using an animated graduated symbol map
Source: Flowing data
How long does it take to afford a beer? Using a horizontal bar chart.
III. HOW TO APPROACH CREATING VISUALIZATIONS
A framework to get started…
Who’s the audience?
What’s the task?
What’s the data?
What’s the best visual display?
What’s the best visual display?
What do these charts have in common?
Scatter plot Matrix chart Network diagram
They show a relationship between points.
What do these charts have in common?
Bar Chart Block Histogram
Bubble Chart
They compare a set of values.
What do these charts have in common?
Line Graph Stacked Line/Area Graph
Track rises and falls over time
What do these charts have in common?
Pie Chart Treemaps
Seeing parts of the whole
What do these charts have in common?
Phrase Nets Word Clouds Word Trees
Edward Tufte: On exploring forms of display
http://www.youtube.com/watch?v=Th_1azZA2OY&noredirect=1
After we select our display, we need to apply effective design principles.
Let’s test our knowledge with a graph IQ test.
Graph Design IQ Test
This test will ask you 10 questions to determine how well you understand the
principles of good table and graph design. Good luck!
1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?
International Stock
Large Cap US Stock
Bonds
Real Estate
Mid-Cap US Stock
Investment Portfolio Breakdown
Small Cap US Stock
Commodities
1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?
International Stock
Large-Cap U.S. Stock
Bonds
Real Estate
Mid-Cap U.S. Stock
Small-Cap U.S. Stock
Commodities
Investment Portfolio Breakdown
0% 4% 8% 12% 16% 20%
1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?
A. Pie ChartB. Bar Graph
Pie Chart Bar Graph
2: Which of these line graphs is easier to read?2-D Line Graph
60
Millions of USD
50
40
30
20
10
0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Company Sales
2: Which of these line graphs is easier to read?3-D Line GraphMillions
of USD
60
50
40
30
20
10
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Company Sales
2: Which of these line graphs is easier to read?
A. 2-D Line GraphB. 3-D Line Graph
2-D Line Graph 3-D Line Graph
3: Which of these two tables is easier to read?Table A
Region Revenue % of Total Revenue
Expenses Profit % of Total Profit
Europe $75,904,604 31.06% $40,988,486 $34,916,117 22.31%
Canada $51,572,694 21.10% $17,534,715 $34,037,978 21.75%
Western US $42,660,178 17.46% $11,944,849 $30,715,328 19.63%
Eastern US $33,977,385 13.90% $7,135,150 $26,842,134 47.15%
Central US $26,139,598 10.70% $3,920,939 $22,218,658 14.20&
Asia $14,135,278 5.78% $6,360,875 $7,774,402 4.97%
Total (or Avg) $244,389,737 100.00% $87,885,117 $156,504,619 100.00%
Sales Summary by Region
1st Quarter, 2007Regions are Sorted by Revenue
3: Which of these two tables is easier to read?Table B
Region Revenue % of Total Revenue
Expenses Profit % of Total Profit
Europe $75,904,604 31.06% $40,988,486 $34,916,117 22.31%
Canada $51,572,694 21.10% $17,534,715 $34,037,978 21.75%
Western US $42,660,178 17.46% $11,944,849 $30,715,328 19.63%
Eastern US $33,977,385 13.90% $7,135,150 $26,842,134 47.15%
Central US $26,139,598 10.70% $3,920,939 $22,218,658 14.20&
Asia $14,135,278 5.78% $6,360,875 $7,774,402 4.97%
Total (or Avg) $244,389,737 100.00% $87,885,117 $156,504,619 100.00%
Sales Summary by Region(USD) 1st Quarter, 2007Regions are Sorted by Revenue
3: Which of these two tables is easier to read?
A. Table AB. Table B
Table A
Table B
4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual
values.
Bar Graph
Unique Visitors
Page Views
Millions
3.0
2.5
2.0
1.5
1.0
0.5
0.0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 Web Traffic
4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual
values.
Line Graph
Millions
3.0
2.5
2.0
1.5
1.0
0.5
0.0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 Web Traffic
4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual values.
A. Bar GraphB. Line Graph
Line Graph
Bar Graph
5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading
manner. Which graph presents the data accurately?
Graph ANumbers of Shareholders
2,500
2,400
2,300
2,200
2,100
2000Yes No Undecided
5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading
manner. Which graph presents the data accurately?
Graph BNumbers of Shareholders
2,500
2,000
1,500
1,000
500
0Yes No Undecided
5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading
manner. Which graph presents the data accurately?
A. Graph AB. Graph B
Graph A Graph B
6: Which map makes it easier to find all of the counties with positive growth rates?
Map A2006 Growth Rate by County
-3% 0% +3%
6: Which map makes it easier to find all of the counties with positive growth rates?
Map B2006 Growth Rate by County
-3% 0% +3%
6: Which map makes it easier to find all of the counties with positive growth rates?
A. Map AB. Map B
Map A Map B
7: Which graph makes it easier to determine R&D’s travel expense?
USD 70
60
50
40
30
20
10 0
Payroll
Equipment
Travel
Supplies
Software
Misc.
R&D Sales
Management
Accounting
2006 Expenses by Department 3D Bar Graph
7: Which graph makes it easier to determine R&D’s travel expense?
R&D Sales Management Accounting Payroll
Equipment
Travel Supplies
Software
Misc.
2006 Expenses by Department
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
2D Bar Graph
7: Which graph makes it easier to determine R&D’s travel expense?
A. 3D Bar Graph (left)B. 2D Bar Graph (below)
8: In which graph are the labels easier to read?
Graph A
2006 Marketing Expenditures By CountryThousands of USD
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0 United States
Canada United Kingdom
Japan France Germany Mexico China
8: In which graph are the labels easier to read?Graph B
2006 Marketing Expenditures By Country
Thou
sand
s of U
SD
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0,000
Uni
ted
Stat
es
Cana
da
Uni
ted
King
dom
Japa
n
Fran
ce
Germ
any
Mex
ico
Chin
a
8: In which graph are the labels easier to read?
A. Graph AB. Graph B
Graph A Graph B
9: Which graph is easier to look at?
Graph A
Nebraska Oklahoma Kansas
USD in Thousands
100
80
60
40
20
0
Human Accounting Management Sales Manufacturing Resources
Median Employee Salary by Department and State
9: Which graph is easier to look at?
Graph B
Nebraska Oklahoma Kansas
USD in Thousands
100
80
60
40
20
0
Human Accounting Management Sales Manufacturing Resources
Median Employee Salary by Department and State
9: Which graph is easier to look at?
A. Graph AB. Graph B
Graph B
Graph A
10: Which table allows you to see the areas of poor performance more quickly?
Table A
Region Overall Revenue Expenses Profit Avg. Order Size
East Good $4,652,462 $2,682,765 $1,969,697 $6,845
West Fair 3,705,426 2,211,773 1,493,653 4,266
North Fair 3,215,789 2,712,984 502,805 4,568
South Poor 2,215,752 1,562,735 653,017 1,358
Overall Fair $13,789,429 $9,170,257 $4,619,172 $4,259
2006 Key Metrics
10: Which table allows you to see the areas of poor performance more quickly?
Table B
Region Overall Revenue Expenses Profit Avg. Order Size
East Good $4,652,462 $2,682,765 $1,969,697 $6,845
West Fair 3,705,426 2,211,773 1,493,653 4,266
North Fair 3,215,789 2,712,984 502,805 4,568
South Poor 2,215,752 1,562,735 653,017 1,358
Overall Fair $13,789,429 $9,170,257 $4,619,172 $4,259
2006 Key Metrics
10: Which table allows you to see the areas of poor performance more quickly?
A. Table AB. Table B
Table B
Table A
Above all else show the data
---Edward Tufte
Sometimes decorations can help editorializeabout the substance of the graphic. But it’swrong to distort the data measures—the inklocating values of numbers—in order to makean editorial comment or fit a decorative scheme.
--Edward Tufte
Principles• Chartjunk• Data-ink ratio• Data integrity
– Lie Factor• Data Richness• Scales
– Pie chart. Zero point.• Color.
– Color blindness– Using color sparingly– Use red for negative earnings
• Attribution
Avoid chart junk
Useless, non-informative, or information-obscuring elements of quantitative information displays.
Chart Junk: Remove grid lines
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0
1
2
3
4
5
6
7
8
9
Sales
Sales
Chart Junk: Remove the frame around the visual
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0
1
2
3
4
5
6
7
8
9
2010 Sales Data (in millions)
Chart Junk: Consider if tick marks are necessary
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0
1
2
3
4
5
6
7
8
92010 Sales Data (in millions)
Tables and Charts: Remove Gridlines
2010 Forecast vs. Performance (U.S. $)
Forecast Performance
Qtr 1 $85,000 $95,000
Qrt 2 $80,000 $75,000
Qtr 3 $75,000 $65,000
Qtr 4 $60,000 $60,000
Total $300,000 $295,000
Tables and Charts: Remove Gridlines
2010 Forecast vs. Performance (U.S. $)
Forecast Performance
Qtr 1 $85,000 $95,000
Qrt 2 $80,000 $75,000
Qtr 3 $75,000 $65,000
Qtr 4 $60,000 $60,000
Total $300,000 $295,000
Data Ink Ratio
Reduce the amount of “ink” used to represent the data.
Data Ink Ratio: Too many bars to represent a single data point
Data Ink Ratio: Consider bin size.
Data Ink Ratio: Would an area chart work better?
Data Ink Ratio: Or a line Chart?
Data Integrity: Lie Factor
Lie Factor = size of effect shown in graphic size of effect of data
Data Integrity: Lie Factor = 14.8
Data Integrity: Decorate data without lying
Data Integrity: Does a change in perspective help tell your story?
1st Qtr2nd Qtr
3rd Qtr4th Qtr
0123456789
2010 Sales Data (in millions)
Data Integrity: Ensure a zero point scale
Proportions
Sales
1st Qtr2nd Qtr3rd Qtr4th Qtr
Proportions
Proportions
Proportions: What else is wrong?
8.23.2
1.4
1.2
Sales
1st Qtr2nd Qtr3rd Qtr4th Qtr
Doesn’t add up to 1 or 100%
Better. What’s still wrong?
0.5860.229
0.100
0.086
Sales
1st Qtr2nd Qtr3rd Qtr4th Qtr
Qtr 1 Qrt 2 Qtr 3 Qtr 4$0
$20,000
$40,000
$60,000
$80,000
$100,000
$120,000
$140,000
$160,000
$180,000
$200,000
PerformanceForecast
Sales performance compared to forecasted sales 2010U.S. $
Data RichnessRich data means quality data – accurate data from reputable sources plus effective filtering of data for the audience.
Wong, 2010, p. 28
Data Richness. Tell the whole story with an excerpt
Wong, 2010, p. 29
This Year Last Year
Data Richness. However, don’t be misleading….
Wong, 2010, p. 29
This Year Last Year
Data Quantity!= Data Richness
Wong, 2010, p. 29
Inconclusive
An upward trend
Color
• Minimize the use of color
• Use shading instead– From lightest to darkest
(no zebra pattern• Consider using red for
negative earnings.
Color. Some people are color blind
Labeling and Attribution• Explain encodings. • The design of every graph has a similar flow. You get the
data; encoded it with circles, bars, and colors; and then you let others read it.
• The readers have to decode your encodings at this point. • Describe what do the circles, bars, and colors represent.• Label directly on the data instead of/or in addition to
using a legend. • Cite your data source.
Source: Wong (2010); Yau (2011), p. 13
IV. TRY IT. APPLY IT.
Which MBA?
Let’s try to create something similar
Run mbarankpart1.pyDefault Sorted
01 – Bigger Figure Size
02 - Removing gray background and frame
03 - Make room for others… Remove frame
04 – Iterate to remove tick marks
05 – Bar height and bar color and edge
mbarank_part3.py: Now, plot 2 others and add ranks…
Refine your visual display in Adobe Illustrator
Tips for saving your image file
• If you are going to modify the image in Illustrator save your file as a PDF from PYTHON– Use savfig(filename)
savfig(mbarankings.pdf)– Or save from the function show() that launches
the interactive window in ipython
Working in Adobe Illustrator1. Open pdf document in Adobe Illustrator2. If you don’t see the Tools window, go to the Window menu and click
Tools to turn it on.3. The black arrow is called the Selection tool. Select it, and your
mouse pointer becomes a black arrow. 4. Click and drag it over the border. The border appears highlighted.
This is know as a clipping mask.5. Press delete on your keyboard to get rid of it.6. If this deletes the graphic, undo the edit, and use the Direct
Selection tool, which is represented by a white arrow, to highlight the clipping mask instead.
7. Use the Selection tool to change fonts, change colors, add text, etc.
One Mistake: Don’t Average Percentages
• You must go back to the original data source to recalculate the new percentage.
RESOURCES
Edward Tufte
Nathan Yau and Dona Wong
Casey Reas and Ben Fry
Stephen Few
Colin Ware and Richard Mayer
Seth Godin and Andy Goodman