the art of data visualization
Post on 15-Jul-2015
165 Views
Preview:
TRANSCRIPT
THE ART OF DATA VISUALISATION
S Anand, Chief Data Scientist, Gramener
THIS TALK HAS TWO PARTS
WHAT I DO IN MY
CURRENT JOB
HOW I GOT MY
CURRENT JOB
Heinlein, in connection with my story “Dreaming Is a Private Thing”, accused me, good-naturedly, of coining money out of my neuroses.
Well, whose neuroses should I make money off of?
LET’S TAKE TESCO’S GROCERIES
category title kJ rate
dairy Activia Pouring Natural Yogurt 1X950g 216 0.21
dairy Activia Pouring Strawberry Yogurt 1X950g 250 0.21
dairy Activia Pouring Vanilla Yogurt 1X950g 263 0.21
icecream Almondy Daim 400G 1804 0.75
icecream Almondy Toblerone 400G 1850 0.5
cereals Alpen 10 Pack Lite Summer Fruits Cereal Bars 210G 1222 1.57
cereals Alpen 10Pk Fruit Nut And Chocolate Cereal Bars 290G 1812 1.14
cereals Alpen Coconut And Chocolate Cereal Bars 5Pk 145G 1863 1.24
cereals Alpen Fruit And Nut With Chocolate Cereal Bar 5X29g 1812 1.24
cereals Alpen High Fruit 650G 1439 0.4
cereals Alpen Light Bars Chocolate And Orange 5X21g 1246 1.71
cereals Alpen Light Chocolate And Fudge Bar 5X21g 1264 1.71
cereals Alpen Light Sultana & Apple Bars 5Pk 105G 1197 1.71
cereals Alpen Light Summer Fruits Bars 5Pk 105G 1222 1.71
cereals Alpen No Added Sugar 1.3Kg 1488 0.31
cereals Alpen No Added Sugar 560G 1488 0.46
cereals Alpen Original 1.5Kg 1509 0.27
cereals Alpen Original Muesli 750G 1509 0.35
cereals Alpen Raspberry And Yoghurt Cereal Bars5x29g 1748 1.24
cereals Alpen Strawberry With Yoghurt Cereal Bar 5X29g 1756 1.24
dairy Alpro Natural Yofu 500G 0.28
dairy Alpro Raspberry Vanilla Yofu 4X125g 0.35
dairy Alpro Strawberry And Fof Soya Yofu 4X125g 0.35
dairy Alpro Vanilla Yofu 500G 0.28
The ShawshankRedepmption
The Godfather
The Dark Knight
Titanic
The Phantom Menace
Twilight
New Moon
Wild Wild West
Transformers
The Good, The Bad, The Ugly
12 Angry Men
7 Samurai
Taare ZameenPar
Rang De BasantiYojinbo
MORE VOTES
BETTER RATED
Many unwatched movies
Few unwatched movies
Mix of watched & unwatched
Few watched movies
Many watched movies
Movies on the IMDb
3 Idiots
We handle terabyte-size data via non-traditional analytics and visualise it in real-time.
Gramener visualises
your data
Gramener transforms your data into concise dashboardsthat make your business problem & solution visually obvious.We help you find insights quickly, based on cognitive research,and our visualisations guide you towards actionable decisions.
A data analytics and visualisation company
MOST OF WHAT I DO TODAY IS
VISUALISING DATA ANOMALIES
India’s religions
Australia’s religions
As a Data Scientist, I’m quite intrigued by anomalies, and
ANOMALIES ARE EVERYWHERE…
S Anand, Chief Data Scientist, Gramener
100
YE
AR
SO
FIN
DIA
’SW
EA
TH
ER
1901
1911
1921
1931
1941
1951
1961
1971
1981
1991
2001
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
You don’t need sophisticated analyses for this
IT CAN BE EASY TO SPOT THEM
S Anand, Chief Data Scientist, Gramener
EDUCATION
PREDICTING MARKS
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction matter?
Does community or religion matter?
Does their birthday matter?
Does the first letter of their name matter?
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: ENGLISH
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
ICSE 2013 CLASS XII: TOTAL MARKS
DETECTING FRAUD
“We know meter readings are incorrect, for various reasons.
We don’t, however, have the concrete proof we need to start the process of meter reading automation.
Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
ENERGY UTILITY
BILLING FRAUD AT AN ENERGY UTILITY
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of
readings are aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels.
Each bar represents the number of customers with a customers with a
specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in
full at a higher tariff than someone with 100 units. So people have a
strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million
subscribers) had 10 years worth of
customer billing data available.
Most fraud detection software failed to
load the data, and sampled data
revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their
usage very carefully, and turn of their
lights and fans the instant their usage
hits the slab boundary.
Or, more realistically, there’s probably some level of corruption
involved, where customers pay a small sum to the meter reading staff
to ensure that it stays exactly at the slab boundary, giving them the
advantage of a lower price.
Subject Girs higher by Girls Boys
Physics 0 119 119
Chemistry 1 123 122
English 4 130 126
Computers 6 137 131
Biology 6 129 123
Mathematics 11 123 112
Language 11 152 141
Accounting 12 138 126
Commerce 13 127 114
Economics 16 142 126
PERFORMANCE: GIRLS VS BOYS
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.
June bornsscore the lowest
The marks shoot up for Aug borns
… and peaks for Sep-borns
120 marks out of 1200 explainable by month of birth
An identical pattern was observed in 2009 and 2010…
… and across districts, gender, subjects, and class X & XII.
“It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”
-- Malcolm Gladwell, Outliers
LET’S LOOK AT 15 YEARS OF US BIRTH DATA
This is a dataset (1975 – 1990) that has
been around for several years, and has
been studied extensively. Yet, a
visualization can reveal patterns that
are neither obvious nor well known.
For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the C-section option to move dates?
• Is there any day of the month that has unusually high or low births?
• Are there any months with relatively high or low births?
Very high births in September.
But this is fairly well known.
Most conceptions happen during
the winter holiday season
Relatively few births during the
Christmas and Thanksgiving
holidays, as well as New Year and
Independence Day.
Most people prefer not
to have children on the
13th of any month, given
that it’s an unlucky day
Some special days like April
Fool’s day are avoided, but
Valentine’s Day is quite
popular
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
THE PATTERN IN INDIA IS QUITE DIFFERENTThis is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Very few children are born in the
month of August, and thereafter.
Most births are concentrated in
the first half of the year
We see a large number of
children born on the 5th, 10th,
15th, 20th and 25th of each month
– that is, round numbered dates
Such round numbered patterns a
typical indication of fraud. Here,
birthdates are brought forward
to aid early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
THIS ADVERSELY IMPACTS CHILDREN’S MARKS
It’s a well established fact that older
children tend to do better at school in
most activities. Since many children
have had their birth dates brought
forward, these younger children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the
month tend to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
Children “born” on round numbered days score lower marks on average,due to a higher proportion of younger children
WHAT’S UNUSUAL ABOUT LOANS AFTER THE 20TH?Every loan disbursed after the 20th of the month, i.e. from the 21st to
the end of the month, shows consistently lower non-performing assets
(i.e. better quality) than any loan disbursed prior to the 20th.
The bank mapped this back to their incentive scheme. The sales team’s
commission is based only on loans disbursed until the 20th. Hence new
loans are squeezed into this period without regard for their quality.
The personal finance division of a
bank, focusing on retail loans, drove
its sales through a branch sales team.
A study of the non-performing assets
of loans generated over the course of
one year shows a strange pattern.
Analytics can detect something that you’re specifically looking for.
It takes a visual to detect what we don’t know to look for
This representation, known as a
calendar map, can show some
interesting patterns, particularly
weekday-based patterns, as the next
example will show.
5
RESTAURANT FOUND AN UNUSUAL DIP IN SALESA restaurant chain had data for every
single transaction made over a few
years. Plotting this as a time series
showed them nothing unusual.
However, the same data on a calendar
map reveals a very different story.
Specifically, at the bottom left point-of-sale terminal, sales dips on
every Wednesday. At the bottom right point-of-sale terminal, sales
rises on every Wednesday (almost as if to compensate for the loss.)
It turns out that the manager closes the bottom-left counter every
Wednesday afternoon due to shortage of staff, assuming that it results
in no loss of sales. There is, however, a net loss every Wednesday.
5
But that’s to say that simple techniques can spot everything
YOU CAN GO BEYOND “EASY”
S Anand, Chief Data Scientist, Gramener
WHAT’S SO SPECIAL ABOUT TOBACCO? 4
WHAT’S WRONG WITH THE MINERAL WATER? 3
Try it! All you need is some data and some curiosity to…
VISUALISE DATA YOURSELF!
S Anand, Chief Data Scientist, Gramener
top related