satnam singh talk feb5 indian analytics and big data summit

Post on 28-Jul-2015

110 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Short Stories of Building

Data Science Products

February 5, 2015

Satnam Singh, PhD Data Scientist/Director, CA Technologies

Rocket Singh

Image credit: Movie Makers of Rocket Singh

- Salesman of the Year

- Wanted to become a Data Scientist

Dialog with Rocket Singh

Who is a Data Scientist?

“A Story Teller &…

Data Scientist

“An Expert Software Coder &…

Data Scientist

“A Data lake/ocean Swimmer &…

Data Scientist

Data Scientist

“A Statistician &…

“An Business Savvy Engineer…

Data Scientist

Dialog with Rocket Singh

How do I become a

Data Scientist?

Satnam – Becoming a

Data Scientist is

a fantastic and long

Journey

++ Big Data Processing Skills

Data Scientist Skills

Dialog with Rocket Singh

Can you tell me about a

data science product that

you have built?

Satnam – Many stories, here are

two stories:

1. Smartphones – story

2. Automobile - story

12

Sensor Data

- Recommendations - User Modeled Activities - Personalization

Social data

User data …

Analytics (Text Mining, Machine Learning, Data Mining)

Sensor Data

User Data

Social Data

3rd Party Applications, Native Applications

Smart Analytics in Smartphones

13

Smart Gallery

Image Credit: Photo

Organizer Fish Bowl

Smart Grouping

14

Smart Contacts

Image Credit: Contact+

Duplicate Names “Satnam Singh” , “Satnam Singh,

PhD”, “Singh Sat”, “Satnam Bro”

Marie Brown

Brown Marie

Frnd

Marie B Bow

Problem: Find duplicate

contacts and compare your

algorithm with existing

technique in Android

15

Algo 1 in Android: a variant of largest common

substring match

Singh Algo: Lexical similarity for in-device

implementation and locality sensitive hashing (LSH)

for cloud implementation

Solve Duplicate Contacts Problem

16

Data Variety and Data Fusion is “Hidden Treasure”

Image credit: Denise Lu

Datablending

DTC

Design DTC

Software

Research

Parameter Identifiers (PIDs) – e.g. Engine

speed, vehicle speed, powertrain

voltage, environmental parameters, etc.

OEM’sDatabase

1. Data collection viadiagnostic toolsat dealer shops

Automobilefield failure data

1. Data collection viatelematics services

Data: DTCs, PIDs, Claims, etc.

PIDs data

Scan tools at

Dealer shops

400-600

PIDs

10-12 Unique

PIDs

SME

analyses and

test vehicle

validation

Business Case and Problem

DTC: Diagnostic Trouble Code (fault code)

Design and software enhancement

Reduce

NTFs

Characterize

Intermittent

faults

Enhance

DTC design

1. Field Failure

Data

2. Data

Transformation

3. Study PID

Distributions

5. PIDs

selection

Decision Trees

4. Analyze PID

Correlations

Filter

Preprocessing

Data

User Data

Selection

Tool Use for

Innovation and Business Impact:

PIDs Mining Tool

20

Problem Formulation and Solution

Feature

Selection and

Classification

Problem

400-600

Attributes

(PIDs)

10-20

Informative

Attributes

(PIDs)

},x,...,x,x,x{),( 321 yYX n

ix are attribute vectors (i.e. PIDs data) for m number of patterns (i.e. vehicles)

y is class variable (i.e. faults) if y=0 (baseline), y=1 (intermittent fault)

Problem: Find Informative features that separates intermittent

class from baseline class infX

21

Problem Formulation and Solution

22

)()()(),(1

m

i

AiS iSHApSHASIG

n

j

SS jpjpSH1

2 )(log)()(

Solution: Decision tree-based method (C4.5 Algorithm) using Entropy criterion

to select the select the attribute Ai in decision tree

Stopping Criterion to stop the tree growth: Minimum no. of patterns (i.e. vehicles)

should be greater than Nmin (e.g. 10)

H(S): Entropy of the set S

IG(S, A): Gain in Entropy of the set S

after Split on Attribute A

Entropy Decision Forest

PIDs data for

Intermittent fault vs.

Baseline

Tree1

D1

Remove PIDs

that are in Tree

1 from input

data of Tree 2

Tree2

D2

Remove PIDs that

are in Tree 2 from

input data of Tree 3

Continue till

stopping

criteria met

...

Hypothesis 1 Hypothesis 2 Hypothesis n

23

Entropy Decision Forest

24

Decision Tree 1

Decision Forest

Thanks

@satnam74s

satnam.datageek@yahoo.com

Data Science Conference in Bangalore: March 18-21

top related