satnam singh talk feb5 indian analytics and big data summit
TRANSCRIPT
Short Stories of Building
Data Science Products
February 5, 2015
Satnam Singh, PhD Data Scientist/Director, CA Technologies
Rocket Singh
Image credit: Movie Makers of Rocket Singh
- Salesman of the Year
- Wanted to become a Data Scientist
Dialog with Rocket Singh
Who is a Data Scientist?
“A Story Teller &…
Data Scientist
“An Expert Software Coder &…
Data Scientist
“A Data lake/ocean Swimmer &…
Data Scientist
Data Scientist
“A Statistician &…
“An Business Savvy Engineer…
Data Scientist
Dialog with Rocket Singh
How do I become a
Data Scientist?
Satnam – Becoming a
Data Scientist is
a fantastic and long
Journey
++ Big Data Processing Skills
Data Scientist Skills
Dialog with Rocket Singh
Can you tell me about a
data science product that
you have built?
Satnam – Many stories, here are
two stories:
1. Smartphones – story
2. Automobile - story
12
Sensor Data
- Recommendations - User Modeled Activities - Personalization
Social data
User data …
Analytics (Text Mining, Machine Learning, Data Mining)
Sensor Data
User Data
Social Data
3rd Party Applications, Native Applications
Smart Analytics in Smartphones
13
Smart Gallery
Image Credit: Photo
Organizer Fish Bowl
Smart Grouping
14
Smart Contacts
Image Credit: Contact+
Duplicate Names “Satnam Singh” , “Satnam Singh,
PhD”, “Singh Sat”, “Satnam Bro”
Marie Brown
Brown Marie
Frnd
Marie B Bow
Problem: Find duplicate
contacts and compare your
algorithm with existing
technique in Android
15
Algo 1 in Android: a variant of largest common
substring match
Singh Algo: Lexical similarity for in-device
implementation and locality sensitive hashing (LSH)
for cloud implementation
Solve Duplicate Contacts Problem
16
Data Variety and Data Fusion is “Hidden Treasure”
Image credit: Denise Lu
Datablending
Motivation: Field Failure Data for QRD Enhancement
18
GM’s Databases
Data collection
via data link
in service shops
Data collection via
Telematics
Field failure data: Diagnostic Trouble
Codes (DTCs), Operating Parameter
Identifiers (PIDs), Warranty Claims, etc.
OEM Databases
Test fleet,
Production fleet
Advanced
Analytical
Toolset
DTC
Design DTC
Software
Research
Parameter Identifiers (PIDs) – e.g. Engine
speed, vehicle speed, powertrain
voltage, environmental parameters, etc.
OEM’sDatabase
1. Data collection viadiagnostic toolsat dealer shops
Automobilefield failure data
1. Data collection viatelematics services
Data: DTCs, PIDs, Claims, etc.
PIDs data
Scan tools at
Dealer shops
400-600
PIDs
10-12 Unique
PIDs
SME
analyses and
test vehicle
validation
Business Case and Problem
DTC: Diagnostic Trouble Code (fault code)
Design and software enhancement
Reduce
NTFs
Characterize
Intermittent
faults
Enhance
DTC design
1. Field Failure
Data
2. Data
Transformation
3. Study PID
Distributions
5. PIDs
selection
Decision Trees
4. Analyze PID
Correlations
Filter
Preprocessing
Data
User Data
Selection
Tool Use for
Innovation and Business Impact:
PIDs Mining Tool
20
Problem Formulation and Solution
Feature
Selection and
Classification
Problem
400-600
Attributes
(PIDs)
10-20
Informative
Attributes
(PIDs)
},x,...,x,x,x{),( 321 yYX n
ix are attribute vectors (i.e. PIDs data) for m number of patterns (i.e. vehicles)
y is class variable (i.e. faults) if y=0 (baseline), y=1 (intermittent fault)
Problem: Find Informative features that separates intermittent
class from baseline class infX
21
Problem Formulation and Solution
22
)()()(),(1
m
i
AiS iSHApSHASIG
n
j
SS jpjpSH1
2 )(log)()(
Solution: Decision tree-based method (C4.5 Algorithm) using Entropy criterion
to select the select the attribute Ai in decision tree
Stopping Criterion to stop the tree growth: Minimum no. of patterns (i.e. vehicles)
should be greater than Nmin (e.g. 10)
H(S): Entropy of the set S
IG(S, A): Gain in Entropy of the set S
after Split on Attribute A
Entropy Decision Forest
PIDs data for
Intermittent fault vs.
Baseline
Tree1
D1
Remove PIDs
that are in Tree
1 from input
data of Tree 2
Tree2
D2
Remove PIDs that
are in Tree 2 from
input data of Tree 3
Continue till
stopping
criteria met
...
Hypothesis 1 Hypothesis 2 Hypothesis n
23
Entropy Decision Forest
24
Decision Tree 1
Decision Forest
Data Science Conference in Bangalore: March 18-21