ted dunning, chief application architect, mapr at mlconf atl - 9/18/15

41
© 2014 MapR Technologies 1 © 2014 MapR Technologies Cheap Learning Complements Deep Learning Ted Dunning

Upload: mlconf

Post on 14-Apr-2017

811 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 1© 2014 MapR Technologies

Cheap Learning Complements Deep Learning

Ted Dunning

Page 2: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 2

Me, Us• Ted Dunning, Chief Application Architect, MapR

– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG

• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s

• Info– Hash tag - #mapr #mlconfatl– See also - @ApacheDrill

@ted_dunning and @mapR

Page 3: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 3

Agenda• Rationale• Why cheap isn't the same as simple-minded• Some techniques• Examples

Page 4: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 4

Why is cheap better than deep (sometimes)Greenfield problems can be

– Easy (large number of these)– Impossible (large number of these)– Hard but possible (right on the boundary)

Mature problems can be– Easy (these are already done)– Impossible (still a large number of these)– Hard but possible (now the majority of the effort)

Page 5: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 5

Most data isn’t worth much in isolation

First data is valuable

Later data is dregs

Page 6: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 6

Suddenly worth processing

First data is valuable

Later data is dregs

But has high aggregate value

Page 7: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 7

If we can handle the scale

It’s really big

Page 8: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 8

With great scale comes great opportunity• Increasing scale by 1000x changes the game

• We essentially have green fields opening up all around

• Most of the opportunities don’t require advanced learning

Page 9: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 9

A simple example - security monitoring

• “Small” data– Capture IDS logs– Detect what you already know

• “Big” data– Capture switch, server, firewall logs as well– New patterns emerge immediately

Page 10: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 10

Another example – fraud detection

• “Small” data– Maintain card profiles– Segment models– Evaluate all transactions

• “Big” Data– Maintain card profiles, full 90 day transaction history– Per user hierarchical models– Evaluate all transactions

Page 11: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 11

Easy != Stupid• You still have to do things reasonably well

– Techniques that are not well founded are still problems

• Heuristic frequency ratios still fail – Coincidences still dominate the data– Accidental 100% correlations abound

• Related techniques still broken for coincidence– Pearson’s χ2

– Simple correlations

Page 12: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 12

Blast from the past

Page 13: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 13

Scale does not cure wrong

It just makes easy more common

Page 14: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 14

A core technique• Many of these easy problems reduce to finding interesting

coincidences

• This can be summarized as a 2 x 2 table

• Actually, many of these tables

A OtherB k11 k12

Other

k21 k22

Page 15: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 15

How do you do that?• This is well handled using G-test

– See wikipedia– See http://bit.ly/surprise-and-coincidence

• Original application in linguistics now cited > 2000 times

• Available in ElasticSearch, in Solr, in Mahout• Available in R, C, Java, Python

Page 16: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 16

Which one is the anomalous co-occurrence?

A not AB 13 1000

not B 1000 100,000

A not AB 1 0

not B 0 10,000

A not AB 10 0

not B 0 100,000

A not AB 1 0

not B 0 2

Page 17: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 17

Which one is the anomalous co-occurrence?

A not AB 13 1000

not B 1000 100,000

A not AB 1 0

not B 0 10,000

A not AB 10 0

not B 0 100,000

A not AB 1 0

not B 0 20.90 1.95

4.52 14.3

Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)

Page 18: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 18

So we can find interesting coincidence

and that gets us exactly what?

Page 19: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 19

Cooccurrence Analysis

Page 20: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 20

Real-life example• Query: “Paco de Lucia”• Conventional meta-data search results:

– “hombres de paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Page 21: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 21

Real-life example

Page 22: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 22

Any other domains?

Page 23: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 23

Document classification

Page 24: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 24

Language identification

Page 25: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 25

OK … Works for language

Anything else?

Page 26: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 26

Species identification

Page 27: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 27

Anything useful?

Like, to do with money?

Page 28: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 28

Common Point of Compromise• Scenario:

– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits

• Goal:– Find merchant 0

• Meta-goal:– Screen algorithms for this task without leaking sensitive data

Page 29: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 29

Simulation Setup

Page 30: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 30

Simulation Strategy• For each consumer

– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time

• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor

• Restate data– Flatten transaction streams– Sort by time

• Tunables– Compromise probability, background fraud, detection probability

Page 31: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 31

But that isn’t very realistic!• No details of the fraud• No details of the fraudsters• No details on the transactions• No details on the models

• How can this be any good at all?

Page 32: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 32

Secure Development is Hard

Page 33: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 33

Secure Development is Hard

Outside collaborators are outside the security perimeter

They can’t see the data and they can’t tune new algorithms to fit reality

Page 34: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 34

How To Make Realistic Data

Page 35: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 35

Parametric Simulation

Parametric matching of failure signatures allows emulation of complex data properties

Matching on KPI’s and failure modes guarantees practical fidelity

Page 36: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 36

Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores

Page 37: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 37

So how does it work in practice?

Page 38: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 38

Page 39: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 39

Really truly bad guys

Page 40: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 40

Summary• We live in a golden age of newly achieved scale

• That scale has lowered the tree– Hard problems are much easier– Lots of low-hanging fruit all around us

• Cheap learning has huge value

• Code available at http://github.com/tdunning

Page 41: Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

© 2014 MapR Technologies 41

Me, Us• Ted Dunning, Chief Application Architect, MapR

– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG

• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s

• Info– Hash tag - #mapr #mlconfatl– See also - @ted_dunning and @mapR