machine learning meets the real world: successes and new research directions andrea pohoreckyj...

44
Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College, Williamstown, MA October 11, 2002

Upload: aleesha-dalton

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Machine Learning meets the Real World:Successes and new research directions

Andrea Pohoreckyj Danyluk

Department of Computer Science

Williams College, Williamstown, MA

October 11, 2002

Page 2: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Data, data everywhere...

• Scientific: data collection routinely produces gigabytes of data per day

• Telecommunications: AT&T produces 275 million call records

• Web: Google handles 70 million searches

• Retail: WalMart records 20 million sales transactions

Page 3: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

A wealth of information

• Scientific data– Detection of oil spills from satellite images– Prediction of molecular bioactivity for drug

design

• Telecommunications– Fraud detection to distinguish between “bad”

and normal usage of cell phones

Page 4: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

A wealth of information

• Web mining– Characterize killer pages

• Retail– Determine better product placement

• Direct mail– Predict who is most likely to donate to a charity

Page 5: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Machine learning success(Machine learning is ubiquitous)

• Scientific discovery– Detection of oil spills from satellite images

• Telecommunications– Diagnosis of problems in the local loop

• Printing– Determine causes of banding (printing cylinder

problems)

• Control– Self-steering vehicles

Page 6: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Why research in machine learning is so good today

Research in machine learning benefits from

• Abundant data

• Interest in fielding new applications– Even more data– Push on limits of our understanding,

technology, etc.

Page 7: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Plan for this talk

Original

• Discuss success stories and failures

• Failures help identify new areas of research

New plan

• One success story in detail

• Lesson learned: can identify new areas of research even when we succeed

Page 8: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Induction of decision trees

• Not the only (or even the most “hot”) algorithms

• Have been used in many contexts

• Important for understanding our success story: local-loop network diagnosis

Page 9: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Inductive learning

Given a collection of observations of the form (<x>, f<x>)

Find g<x> that approximates f<x>

Page 10: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Sample data

Outlook Temp Humidity Wind Go to class?

Student 1 Sunny Hot High False Yes

Student 2 Sunny Hot High True No

Student 3 Overcast Hot High False Yes

Student 4 Rainy Mild High False Yes

Student 5 Rainy Cool High True No

Student 6 Rainy Cool Normal False No

Page 11: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Predictive modelI.e., g<x>

Outlook

Sunny Rainy Overcast

Wind Temp Yes

Yes No mild cold

No Yes Yes No

Page 12: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Learning objectives

• Learn a tree that is correct

• Learn a tree that is compact

• At every level in the tree, select a test that best differentiates examples of one class from another

Page 13: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

TDIDT

• If all examples are from the same class– The tree is a leaf with that class name

• Else– Pick a test to make– Construct one edge for each possible test

outcome– Partition the examples by test outcome– Build subtrees recursively

Page 14: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Which is better?

10 Yes10 No

Humid Outlook

High Normal Sunny Overcast

5 Y 5 Y 2 Y 8 Y5 N 5 N 10 N

Page 15: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

The Gain Criterion

• Measure the information of the collection

• Measure the information of each possible split

• Choose the split with greatest information gain

Page 16: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Information (Entropy)

• Let T be a set of examples

• Let C1, C2, …, Cn be class labels

• freq(Ci,T) = number of examples in T that belong to class Ci.

• |T| = number of examples in T

• Select example and announce its class: info = - log2 freq(Ci,T)/|T|

Page 17: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Information (Entropy)

• Let T be a set of examples

• Info(T) =- (freq(Ci,T)/|T|) (log2 (freq(Ci/|T|)/|T|))

Page 18: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Entropy after a split

• Let X be an attribute with n possible values.

• Let Tj be the examples that have the value j for attribute X.

Average entropy that results from making split on X:

infoX(T) = ( |Ti| / |T| ) * info(Ti),sum over n possible values of X.

Page 19: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Information Gain

• Compute infoX(T) for every attribute

• Select attribute that maximizes

info(T) – infoX(T)

Page 20: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Which is better?

10 Yes10 No

Humid Outlook

High Normal Sunny Overcast

5 Y 5 Y 2 Y 8 Y5 N 5 N 10 N

Page 21: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Scrubber (the success story)

• Diagnoses problems in the local loop

• Problem may be due to trouble in:– Customer premise equipment– Facilities connecting customer to cable– Cable– Central office

• Millions of “troubles” reported annually

Page 22: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

MAX, 1990

• Acts as Maintenance Administrator (MA)

• Sequence of action:– Customer calls– Rep takes information; initiates tests– Trouble report sent to MA– MA puts trouble in dispatch queue for specific

type of technician

Page 23: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Scrubber 2

• Performed a task at a later point in the pipeline

• Survey dispatch queues to determine whether dispatch appropriate– Dispatch not immediate– Many problems resolved exogenously

Page 24: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Scrubber 3

• Scrubber 2 for new application platform

• Centralized knowledge server

• Cover twice as large a network

Page 25: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Implementation difficulties

• Original expert system shell no longer supported

• Knowledge base evolved into opacity– Many tweaks over a decade– Many knowledge engineers– Most not available to work on Scrubber3

Page 26: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Requirements

• Level of performance at least as good as prior system– Overall accuracy– False positives and false negatives in range

• Comprehensible– For understanding and acceptance by experts

Page 27: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Additional requirements (ours)

• Improved performance

• Improved extensibility

Page 28: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase I: Modeling Scrubber 2

• Applied a decision tree learning algorithm

• Input data:– Trouble reports– Scrubber 2 diagnoses

Page 29: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Data

26,000 trouble reports

• 40 attributes (1/2 continuous; 1/2 symbolic)

• Two classes– Dispatch– Don’t -- I.e., call customer to verify ok

Page 30: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Background knowledge

• C4.5 selected

• 17 of 40 attributes used

Page 31: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase I results

• Decision trees with predictive accuracy of .99, with as few as 10,000 examples

• Less than two days of work (easy!)

Page 32: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase II: Acceptance

• Comprehensibility Readability– Need to observe rationality in learned

knwoledge– Original trees on order of 1000 nodes

• The simpler the model, the better it can be understood

Comprehensibility = Readability + Simplicity + Fidelity

Page 33: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Trading off simplicity and correctness

• Pruning nodes sacrifices correctness

• Appropriate when comprehensibility an issue

• Langley and Schwabacher, 2001

• Note: not pruning to avoid overfitting

Page 34: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase II results

• Used only two most prominent attributes

• New decision trees created

• Still fell into acceptable zone

Page 35: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase III: Working toward extensibility

• Hoped to gain flexibility for– Local modifiability– Additional attribute values

• Moved toward probabilistic decision tree– Leaves labeled with probability estimates, not

decisions– Stubby trees easy to represent in tabular form

Page 36: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase IIIb: More data

• Focus on two attributes gave us access to an extensive data set– Many more trouble reports– Abridged (two-attribute) form had not been

considered useful earlier

Page 37: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Phase III results

• Simple diagnostic model

• Greater empirical confidence -- impt due to small disjunct problem– “Big” general rules cover approximately 50%

of the data– Remaining 50% covered by small disjuncts

Page 38: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Summarizing the success story

• C4.5 applied to induce Scrubber 2 model

• Pruned model for comprehensibility/simplicity

• Converted new model into probabilistic one

• Used newly gained data for additional tuning and confidence

• Small(?), simple model in very short time

Page 39: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Lessons can be learned from success

Lesson 1: the importance of comprehensibility– Rationality– Readability– Simplicity

Page 40: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Lessons can be learned from success

Lesson 2: the need for algorithms to handle small data sets– Creative ways to engineer interesting features

from few

– Openness to alternative sources of data

– Algorithms specifically tuned to handle small data sets

Langley has noted this to be an issue of scientific data -- but true for industrial data as well

Page 41: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Lessons can be learned from success

Lesson 3: the need to think about systematic error– Locally systematic error only look like noise

with enough data– Clearly related to the problem of small data sets– How do our algorithms hold up?

Page 42: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Lessons can be learned from success

Lesson 4: the need to think about the future– Learning results put into practice will be

modifed and extended– Must new models be learned?– Can improvement be incremental?

Page 43: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Lessons can be learned from success

Lesson 5: creative uses of the technology– Learning for the purposes of re-engineering

isn’t “standard”– New applications will serve to fuel new

research

Page 44: Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College,

Further reading and acknowledgements

• Carla Brodley et al, American Scientist, Jan./Feb. ‘99

• Pat Langley, various publications

• Thanks to Foster Provost and many others at Nynex / Bell Atlantic