satnam singh talk feb5 indian analytics and big data summit

26
Short Stories of Building Data Science Products February 5, 2015 Satnam Singh, PhD Data Scientist/Director, CA Technologies

Upload: satnam-singh-phd

Post on 28-Jul-2015

110 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Short Stories of Building

Data Science Products

February 5, 2015

Satnam Singh, PhD Data Scientist/Director, CA Technologies

Page 2: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Rocket Singh

Image credit: Movie Makers of Rocket Singh

- Salesman of the Year

- Wanted to become a Data Scientist

Page 3: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Dialog with Rocket Singh

Who is a Data Scientist?

Page 4: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

“A Story Teller &…

Data Scientist

Page 5: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

“An Expert Software Coder &…

Data Scientist

Page 6: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

“A Data lake/ocean Swimmer &…

Data Scientist

Page 7: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Data Scientist

“A Statistician &…

Page 8: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

“An Business Savvy Engineer…

Data Scientist

Page 9: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Dialog with Rocket Singh

How do I become a

Data Scientist?

Satnam – Becoming a

Data Scientist is

a fantastic and long

Journey

Page 10: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

++ Big Data Processing Skills

Data Scientist Skills

Page 11: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Dialog with Rocket Singh

Can you tell me about a

data science product that

you have built?

Satnam – Many stories, here are

two stories:

1. Smartphones – story

2. Automobile - story

Page 12: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

12

Sensor Data

- Recommendations - User Modeled Activities - Personalization

Social data

User data …

Analytics (Text Mining, Machine Learning, Data Mining)

Sensor Data

User Data

Social Data

3rd Party Applications, Native Applications

Smart Analytics in Smartphones

Page 13: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

13

Smart Gallery

Image Credit: Photo

Organizer Fish Bowl

Smart Grouping

Page 14: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

14

Smart Contacts

Image Credit: Contact+

Duplicate Names “Satnam Singh” , “Satnam Singh,

PhD”, “Singh Sat”, “Satnam Bro”

Marie Brown

Brown Marie

Frnd

Marie B Bow

Problem: Find duplicate

contacts and compare your

algorithm with existing

technique in Android

Page 15: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

15

Algo 1 in Android: a variant of largest common

substring match

Singh Algo: Lexical similarity for in-device

implementation and locality sensitive hashing (LSH)

for cloud implementation

Solve Duplicate Contacts Problem

Page 16: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

16

Data Variety and Data Fusion is “Hidden Treasure”

Image credit: Denise Lu

Datablending

Page 17: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit
Page 19: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

DTC

Design DTC

Software

Research

Parameter Identifiers (PIDs) – e.g. Engine

speed, vehicle speed, powertrain

voltage, environmental parameters, etc.

OEM’sDatabase

1. Data collection viadiagnostic toolsat dealer shops

Automobilefield failure data

1. Data collection viatelematics services

Data: DTCs, PIDs, Claims, etc.

PIDs data

Scan tools at

Dealer shops

400-600

PIDs

10-12 Unique

PIDs

SME

analyses and

test vehicle

validation

Business Case and Problem

DTC: Diagnostic Trouble Code (fault code)

Design and software enhancement

Page 20: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Reduce

NTFs

Characterize

Intermittent

faults

Enhance

DTC design

1. Field Failure

Data

2. Data

Transformation

3. Study PID

Distributions

5. PIDs

selection

Decision Trees

4. Analyze PID

Correlations

Filter

Preprocessing

Data

User Data

Selection

Tool Use for

Innovation and Business Impact:

PIDs Mining Tool

20

Page 21: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Problem Formulation and Solution

Feature

Selection and

Classification

Problem

400-600

Attributes

(PIDs)

10-20

Informative

Attributes

(PIDs)

},x,...,x,x,x{),( 321 yYX n

ix are attribute vectors (i.e. PIDs data) for m number of patterns (i.e. vehicles)

y is class variable (i.e. faults) if y=0 (baseline), y=1 (intermittent fault)

Problem: Find Informative features that separates intermittent

class from baseline class infX

21

Page 22: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Problem Formulation and Solution

22

)()()(),(1

m

i

AiS iSHApSHASIG

n

j

SS jpjpSH1

2 )(log)()(

Solution: Decision tree-based method (C4.5 Algorithm) using Entropy criterion

to select the select the attribute Ai in decision tree

Stopping Criterion to stop the tree growth: Minimum no. of patterns (i.e. vehicles)

should be greater than Nmin (e.g. 10)

H(S): Entropy of the set S

IG(S, A): Gain in Entropy of the set S

after Split on Attribute A

Page 23: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Entropy Decision Forest

PIDs data for

Intermittent fault vs.

Baseline

Tree1

D1

Remove PIDs

that are in Tree

1 from input

data of Tree 2

Tree2

D2

Remove PIDs that

are in Tree 2 from

input data of Tree 3

Continue till

stopping

criteria met

...

Hypothesis 1 Hypothesis 2 Hypothesis n

23

Page 24: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Entropy Decision Forest

24

Decision Tree 1

Decision Forest

Page 25: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Thanks

@satnam74s

[email protected]

Page 26: Satnam Singh Talk Feb5 Indian Analytics and Big Data Summit

Data Science Conference in Bangalore: March 18-21