data analytics in decision making
TRANSCRIPT
DATA ANALYTICS INDECISION MAKING
S Anand, Chief Data Scientist, Gramener
DO THESE FOUR CITIES LOOK IDENTICAL TO YOU?
So is the variance in sales.Variance in price is the same.
Average sales is the same too.Average price is the same.
Take a look at the sales report alongside. A company has branches in 4 cities, and each branch changes the product price every month. This leads to a corresponding change in the sales.
Here is the performance of the four branches with their monthly price and sales for each month.
Looking at the average, the four branches have an identical performance.
2010 Boston Chicago Detroit New York
Month Price Sales Price Sales Price Sales Price Sales
Jan 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
Feb 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
Mar 13.0 7.58 13.0 8.74 13.012.7
48.0 7.71
Apr 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
May 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
Jun 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
Jul 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
Aug 4.0 4.26 4.0 3.10 4.0 5.39 19.012.5
0
Sep 12.010.8
412.0 9.13 12.0 8.15 8.0 5.56
Oct 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
Nov 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Average 9.0 7.50 9.0 7.50 9.0 7.50 9.0 7.50
Variance 10.0 3.75 10.0 3.75 10.0 3.75 10.0 3.75
DO YOU AGREE?
ARE THEY REALLY IDENTICAL? CHECK AGAIN…
But in fact, the four cities are totally different in behaviour.
Boston’s sales has generally increased with price.
Detroit has a nearly perfect increase in sales with price, except for one aberration.
Chicago shows a decline in sales beyond a price of 10.
New York’s sales fluctuates despite a nearly constant price.
Boston Detroit
Chicago New York
Rural
Semi-urban
Urban
Metro
Total
Sanctioned
Utilised
Gap
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
INVESTMENTS IN BIG DATA & ANALYTICS NEED NOT GUARANTEE
BUSINESS EFFECTIVENESSNo coherent
consumption
Enterprises have a disjoint view of data across
divisions. This impedes org action & speed
Last-mile disconnect
Longer Realization
s
Processed & analyzed data is not presented effectively as a
story. Meaningful consumption is an issue
Implementation takes years. System stabilization takes 1-2
years or more, with prohibitive cost of change
ENTERPRISES NEED HELP CROSSING THE ANALYTICS CHASM
Org design Impedes
Org structures & authorization processes impede quick action after data bears needed action
COUNTER-INTUITION:
INSIGHTS FROM DATA
PREDICTING MARKS
“What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction matter?
Does community or religion matter?
Does their birthday matter?
Does the first letter of their name matter?
EDUCATION
TN CLASS X: ENGLISH
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
TN CLASS X: SOCIAL SCIENCE
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
TN CLASS X: MATHEMATICS
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
CBSE 2013 CLASS XII: ENGLISH MARKS
DETECTING FRAUD
DETECTING FRAUD
“We know meter readings are incorrect, for various reasons.
We don’t, however, have the concrete proof we need to start the process of meter reading automation.
Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
ENERGY UTILITY
AN ENERGY UTILITY DETECTED BILLING FRAUD
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large
number of readings are aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available.
Most fraud detection software failed to load the data, and sampled data revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary.
Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large
number of readings are aligned with the tariff slab boundaries.
This clearly shows collusion of some form with the customers.
Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11217 219 200 200 200 200 200 200 200 350 200 200250 200 200 200 201 200 200 200 250 200 200 150250 150 150 200 200 200 200 200 200 200 200 150150 200 200 200 200 200 200 200 200 200 200 50200 200 200 150 180 150 50 100 50 70 100 100100 100 100 100 100 100 100 100 100 100 110 100100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 500 100 27 100 50 100 100 100 100 100 70 1001 1 1 100 99 50 100 100 100 100 100 100
This happens with specific customers, not randomly. Here are such customers’ meter readings.
Section
Apr-10
May-10
Jun-10
Jul-10
Aug-10Sep-10
Oct-10Nov-10
Dec-10
Jan-11
Feb-11
Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as the percentage excess of the 100 unitmeter reading, the value varies considerably across sections, and time
New section manager arrives
… and is transferred out
… with some explainable anomalies.
Why would these
happen?
SIMPLE HEURISTICS
EMERGENCY
“A man is rushed to a hospital in the throes of a heart attack.
The nurse needs to decide whether the victim should be admitted into emergency care.
Although this decision can save or cost a life, the nurse must decide using only the available cues, and within a few seconds – preferably using some fancy statistical software package.
SIMPLE HEURISTICS
EMERGENCY
Pressure < 91
Age > 62
Pulse > 100
No Yes
No Yes
No Yes
8.3% 0.0%100 0.0%Base
OK
WASTED
Marketing cost
Rs 40
MISSED
Acquisition cost
Rs 80
OK
No churn Churn
No c
hu
rnC
hu
rn
Prediction
Act
ual
MISSED WASTEDCOST PER
CUST.IMPROVEMEN
TMODEL
3.2% 3.6%
MISSED WASTED
61.7
COST PER CUST.
39.3%
IMPROVEMENT
Decision tree
MODEL
Outgoing call
0 0 - 4 15+5-14
1
REFILL AMOUNT > 50
RS
01
YN
> 1 RECHARGE
0
N Y
0.6% 2.5%
MISSED WASTED
34.0
COST PER CUST.
66.6%
IMPROVEMENT
SVM
MODEL
TAKEAWAYS
1. In a single circle with 2 crore customers,
this improvement represents a saving of Rs
2.6 x 2 cr ~ Rs 5 cr / month / circle
2. Testing structure allows us to test out any
number of models, and evaluate their
effectiveness
3. Need to trade-off between simplicity vs over-
fitting. Incremental improvements often not
worth the trouble
4. Implementation needs to be constantly
monitored, with continuous re-evaluation of
the model
ANALYSING CAUSAL DRIVERS
We group by every input
factor
… and calculate the impact on every metric.
By moving from average to the best group, what’s the improvement?
The actual performance by each group is shown
0-3m 3-6m 6m-1yr 1-2 yrs > 2 yrs
11 12.3 12.7 15.3 16.1
Only significant results shown
EMERGENT PATTERNS
Tata TeleservicesTata Consultancy Services
Tata Business Support ServicesTata Global BeveragesTata Infotech (merged)
Tata Toyo RadiatorHoneywell Automation India
Tata CommunicationsA G C Networks
Tata Technologies
Tata ProjectsTata PowerTata FinanceIdea CellularTata MotorsTata SonsTata SteelTayo RollsTata SecuritiesTata CoffeeTata Investment Corp
A J EngineerH H MalghamH K SethnaKeshub MahindraRavi KantRussi ModySujit Gupta
A S BamAmal GanguliD B EngineerD N GhoshM N BhagwatN N KampaniU M Rao
B MuthuramanIshaat Hussain
J J IraniN A PalkhivalaN A Soonawala
R GopalakrishnanRatan Tata
S RamadoraiS Ramakrishnan
DIRECTORSHIPS AT THE TATASEvery person who was a Director at the Tata Group is shown here as an orange circle. The size of the circle is based on the number of directorship positions held over their lifetime.Every company in the Tata Group is shown here as a blue circle. The size of the circle is based on the number of directors the company has had over time.Every directorship relation is shown by a line. If a person has held a directorship position at a company, the two are connected by a line.The group appears to be divided into two clusters based on the network of directorship roles.
Prominent leadersbridge the groups
Second group of companies
First group of companies
Some directors are mainly associated with the first group of companies
Some directors are mainly associated with the second group of companies
SIMILARITIES IN AN SME TRANSACTION NETWORK
The same visual was applied to the SME clientele of a bank
• Identified clusters of SMEs transacting with each other
• Targeted non-clients in the middle of a client cluster
• Enhanced service for client in the middle of non-clients
This resulted in a28% QOQ GROWTHin new accounts (against a default QoQ base of 3-8% in the city for the last 5 years)
We’ve used network diagrams to detect terrorism, corporate fraud,de-dup customers, and identify product affinities
MONITORING PERFORMANCE
PORTFOLIO PERFORMANCE VISUAL
Worldwide$288.0mn
A: Accelerate$68.9mn
B: Build$77.2mn
C: Cut down$141.9mn
Worldwide:$288 mn UK: 87.0
Stores: 34.4
Product 9: 6.2Product 10: 5.4Product 7: 5.1Product 15: 4.8
Product 8: 3.1Product 14: 2.1
Partners: 29.2Product 15: 6.7Product 17: 4.1Product 6: 3.4Product 1: 3.2Product 7: 2.9Product 11: 2.4
Direct: 23.5 Product 17: 5.2
Product 8: 4.4Product 16: 4.0
Product 14: 2.5
Product 1: 2.5
Japan
: 71.9 Stores:
25.9 Product 14: 6
.0
Product 7: 5
.4
Product 11: 4
.0
Product 17: 2
.8
Partn
ers:
25.5Pro
duct 8: 8
.2
Product
11: 3
.6
Product
16: 3
.3
Product
1: 3
.1
Product
9: 2
.0
Dire
ct: 2
0.5
Produ
ct 1
1: 5
.2
Prod
uct 1
5: 4
.5
Prod
uct 1
4: 2
.8
Prod
uct 9
: 2.3
Chi
na: 6
5.6
Part
ners
: 27.
3
Prod
uct 1
0: 8
.0
Prod
uct 3
: 7.1
Prod
uct 15
: 3.0
Prod
uct 2:
2.1
Prod
uct 8
: 2.0
Dir
ect:
19.
6
Prod
uct 3:
5.5
Pro
duct
2: 4
.7
Pro
duct
8: 2
.6
Prod
uct
17: 2.
1
Sto
res:
18.7
Pro
duct
10:
5.4
Pro
duct
14:
2.2
Pro
duct
7: 2.
1
Pro
duct
15: 2.0
India
: 46.6
Sto
res:
17.5
Pro
duct
16: 6.8
Dir
ect:
15.6
Pro
duct
10:
3.4
Pro
duct
16:
2.9
Pro
duct
17: 2.5
Pro
duct
7:
2.4
Part
ner
s: 1
3.4
Pro
du
ct 8
: 2.5
Pro
du
ct 7
: 2
.3
US
: 1
7.0
Part
ners
: 6
.0P
rodu
ct 1
0:
4.4
Dir
ect
: 5
.8P
rod
uct
11:
3.9
Sto
res:
5.3
Pro
du
ct 1
1:
3. 8
The visualization shows the market opportunities across various countries to identify areas of focus. This chart has been built as an interactive-app to present the key findings, while letting user click-through and drill-down to a custom view across 4 different levels.
Open
BANKING DASHBOARD
Product Profitability
Cross Holding Analysis
ATM Transactions
Branch Performance
Employee Productivity
600+ mn transactions
40+ GB of data
11,000+ ATMs
2000+ Branches
120+ products
Hourly view
Data processed
LIVE MONITORING: IMPACT OF BUDGET ON STOCKS
LEVERAGING CROSS-SELL
FINDING PATTERNS
“Which securities move together?
How should I diversify?
What should I sell to reduce risk?
What’s a reliable predictor of a security?
SECURITIES
68% correlation between AUD &
EUR
Plot of 6 month daily AUD - EUR
values
Block of correlated currencies
… clustered hierarchically
RESTAURANT: PRODUCT SALES CORRELATION
RESTAURANT: PRODUCT SALES CORRELATION
RESTAURANT FOUND AN UNUSUAL DIP IN SALESA restaurant chain had data for every single transaction made over a few years. Plotting this as a time series showed them nothing unusual.
However, the same data on a calendar map reveals a very different story.
Specifically, at the bottom left point-of-sale terminal, sales dips on every Wednesday. At the bottom right point-of-sale terminal, sales rises on every Wednesday (almost as if to compensate for the loss.)
It turns out that the manager closes the bottom-left counter every Wednesday afternoon due to shortage of staff, assuming that it results in no loss of sales. There is, however, a net loss every Wednesday.
BANK FOUND ALL LOANS BEFORE 20TH POOR
Every loan disbursed after the 20th of the month, i.e. from the 21st to the end of the month, shows consistently lower non-performing assets (i.e. better quality) than any loan disbursed prior to the 20th.
The bank mapped this back to their incentive scheme. The sales team’s commission is based only on loans disbursed until the 20th. Hence new loans are squeezed into this period without regard for their quality.
The personal finance division of a bank, focusing on retail loans, drove its sales through a branch sales team.
A study of the non-performing assets of loans generated over the course of one year shows a strange pattern.
Analytics can detect something that you’re specifically looking for.
It takes a visual to detect what we don’t know to look for
This representation, known as a calendar map, can show some interesting patterns, particularly weekday-based patterns, as the next example will show.
MONITORING SOCIAL MEDIA
UNSTRUCTURED CONTENT
How does Mahabharata, one of the largest epics
with 1.8 million words lend itself to text analytics?
Can this ‘unstructured data’ be processed to
extract analytical insights?
What does sentiment analysis of this tome convey?
Is there a better way to explore relations between
characters?
How can closeness of characters be analysed &
visualized?
VISUALISING THE MAHABHARATA
3642 LIC3148 MTNL2494 BSES
444 RELIANCE ENERGY426 ESCROW396 ICICI378 CLG RTD294 MAHANAGAR GAS232 HDFC216 MAHANGAR GAS LTD212 ORANGE204 LIC OF INDIA190 ESCROW A/C
BUILDING ANALYTIC CAPABILITY
DATA → INSIGHTS → ACTION
TWO ROUTES TO BUILDING ANALYTIC CAPABILITY
Stakeholder groups
Objectives Initiatives Questions Data
have a set of that can be met by which answer specific using
for that meet that can address suggests
Business driven approach
Data driven approach
Importance
Ease
Quick wins
Strategic
Deferred
Revenue impactBreadth of usageEffort reduction
Data availabilityTechnology feasibility
Start small with quick wins
Cover strategic landscape
Deferreds become easier with growing capability
Actions
Gap in current reports
Addressed by current reports
1
2
TYPICAL INITIATIVES WE SEE ACROSS BANKS TODAY
Deposit mobilisation
Product performance
Branch performance
Employee performance
Transaction performance (e.g. ATM)
Performance
Product bundling
Competitive positioning
Product management
Predicting churn
Driving cross-sell
Product recommendations
Customer mgmt
Fraud detection
Scenario modelling (e.g. interest rate change)
Risk management
Data driven insights in statements
Social listening
Client communication
Infrastructure Initiatives in parallel: Digitisation and Data Cleansing
NEW TECHNIQUES MAKE THESE POSSIBLEThe visuals shown in the earlier slides were created using the Gramener visualization server, which leverages some of the recent innovations at Gramener in automating
Visuals are templatized.
As the data or the parameters change, the visuals are re-drawn to match the data, ensuring that the view shows live data in real-time.
We’ve extracted common patterns of insights that apply across all datasets. When data is fed in, these automated analysis components perform a sequence of analytic steps and display results visually.
Binding visuals together into a logical story using text or audio that weaves a story is an integral part of communicating insights. This too is automated in Gramener’s visualizations.
Visualizations Analysis Narration
For e.g., this has been used to• view social media events• election results• oil leakages in fuel stations• monitor retail inventory• plan truck delivery• monitor sentiments on
social media
This has been applied to• identify which security
would go well with a given portfolio
• predict which telecom customers will leave
• assess the impact of changing delivery channel for proxy votes
This has been applied to• automatically “writing” a
newspaper column on the day’s stock market
• automatically writing the report summarising the status of clinical trials
• automated videos
These techniques are focused on automating patterns of insights made by humans – effectively systematizing the “magic” that happens when we find something interesting in data. This is similar to how chess playing programs work. It’s not intelligent, as such. It just calculates and evaluates so many moves automatically that it seems intelligent.AUTOMATIO
N
TAKE YOUR NEXT STEP TOWARDS
DATA-DRIVEN LEADERSHIP
S Anand, Chief Data Scientist, Gramener