machine learning and data mining

58
Tilani Gunawardena Machine Learning and Data Mining 1

Upload: tilani-gunawardena-phdunibas-bscpera-fheauk-amiesl

Post on 24-Jan-2017

246 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Machine Learning and Data Mining

1

Tilani Gunawardena

Machine Learning and Data Mining

Page 2: Machine Learning and Data Mining

2

• Supervised learning • Unsupervised learning• Reinforcement learning

Outline

Page 3: Machine Learning and Data Mining

3

• Data Mining: Process of discovering patterns in data

Data Mining

Page 4: Machine Learning and Data Mining

4

• Machine Learning– Grew out of work in AI– New Capability for computers

• Machine Learning is a science of getting computers to learn without being explicitly programed

• Learning= Improving with experience at some task– Improve over task T– With respect to P– Based on experience E

Machine Learning

Page 5: Machine Learning and Data Mining

5

• Database Mining– Large datasets from growth of automation/web

• Ex: web click data, medical records, biology, engineering

• Applications can’t program by hand– Ex: Autonomous helicopter, handwriting

recognition, most of NLP, Computer vision• Self- customizing programs

– Amazon, Netflix product recommendation• Understand human Learning(brain, real AI)

Machine Learning

Page 6: Machine Learning and Data Mining

6

Types of Learning– Supervised learning: Learn to predict

• correct answer for each example. Answer can be a numeric variable, categorical variable etc.

– Unsupervised learning: learn to understand and describe the data • correct answers not given – just examples (e.g. – the same figures as above , without the labels)

– Reinforcement learning: learn to act

• occasional rewards

M M MF F F

Page 7: Machine Learning and Data Mining

7

Machine Learning Problems

Page 8: Machine Learning and Data Mining

8

• The success of machine learning system also depends on the algorithms.

• The algorithms control the search to find and build the knowledge structures.

• The learning algorithms should extract useful information from training examples.

Algorithms

Page 9: Machine Learning and Data Mining

9

• Supervised learning – Prediction– Classification (discrete labels), Regression (real values)

• Unsupervised learning – Clustering– Probability distribution estimation– Finding association (in features)– Dimension reduction

• Reinforcement learning– Decision making (robot, chess machine)

Algorithms

Page 10: Machine Learning and Data Mining

10

• Problem of taking labeled dataset, gleaning information from it so that you can label new data sets

• Learn to predict output from input• Function approximation

Supervised Learning

Page 11: Machine Learning and Data Mining

11

• Predict housing prices

R

Supervised learning: example 1

Regression: predict continuous valued output(price)

Page 12: Machine Learning and Data Mining

12

• Breast Cancer(malignant, benign)

Supervised learning: example 2

This is classification problem : Discrete valued output(0 or 1)

Page 13: Machine Learning and Data Mining

13

1 attribute/feature : Tumor Size

Page 14: Machine Learning and Data Mining

14

Supervised learning: example 3

2 attributes/features : Tumor Size and Age

Page 15: Machine Learning and Data Mining

15

1. Input: Credit history (# of loans, how much money you make,…) Out put : Lend money or not?

2. Input: picture , Output: Predict Bsc, Msc PhD3. Input: picture, Output: Predict Age4. Input: Large inventory of identical items,

Output: Predict how many items will sell over the next 3 months

5. Input: Customer accounts, Output: hacked or not

Q?

Page 16: Machine Learning and Data Mining

16

• Find patterns and structure in data

Unsupervised Learning

Page 17: Machine Learning and Data Mining

17

Unsupervised Learning-examples • Organize computing clusters

– Large data centers: what machines work together?• Social network analysis

– Given information which friends you send email most /FB friends/Google+ circles

– Can we automatically identify which are cohesive groups of friends

Page 18: Machine Learning and Data Mining

18

• Market Segmentation– Customer data set and group customer into

different market segments

• Astronomical data analysis– Clustering algorithms gives interesting & useful

theories ex: how galaxies are formed

Page 19: Machine Learning and Data Mining

19

1. Given email labeled as spam/not spam, learn a spam filter

2. Given set of news articles found on the web, group them into set of articles about the same story

3. Given a database of customer data, automatically discover market segments ad groups customer into different market segments

4. Given a dataset of patients diagnosed as either having diabetes or nor, learn to classify new patients as having diabetes or not

Q?

Page 20: Machine Learning and Data Mining

20

Reinforcement Learning• Learn to act• Learning from delayed reward• Learning come from after several step ,the decisions that you’ve

actually made

Page 21: Machine Learning and Data Mining

21

Algorithms: The Basic Methods

Page 22: Machine Learning and Data Mining

22

Outline

• Simplicity first: 1R

• Naïve Bayes

Page 23: Machine Learning and Data Mining

23

Simplicity first

• Simple algorithms often work very well! • There are many kinds of simple structure, eg:

– One attribute does all the work– All attributes contribute equally & independently– A weighted linear combination might do– Instance-based: use a few prototypes– Use simple logical rules

• Success of method depends on the domain

Page 24: Machine Learning and Data Mining

24

Inferring rudimentary rules• 1R: learns a 1-level decision tree

– I.e., rules that all test one particular attribute• Basic version

– One branch for each value– Each branch assigns most frequent class– Error rate: proportion of instances that don’t belong to the majority

class of their corresponding branch– Choose attribute with lowest error rate

(assumes nominal attributes)

Page 25: Machine Learning and Data Mining

25

Pseudo-code for 1R

For each attribute,For each value of the attribute, make a rule as follows:

count how often each class appearsfind the most frequent classmake the rule assign that class to this attribute-value

Calculate the error rate of the rulesChoose the rules with the smallest error rate

• Note: “missing” is treated as a separate attribute value

Page 26: Machine Learning and Data Mining

26

Evaluating the weather attributesOutlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Attribute

Table 1. Weather data (Nominal)

Page 27: Machine Learning and Data Mining

27

Evaluating the weather attributesAttribute Rules Errors Total

errorsOutlook Sunny No 2/5

Overcast Rainy

Temp Hot Mild Cool

Humidity High Normal

Windy False True

Outlook Temp Humidity

Windy Play

Sunny Hot High False NoSunny Hot High True NoOvercast Hot High False YesRainy Mild High False YesRainy Cool Normal False YesRainy Cool Normal True NoOvercast Cool Normal True YesSunny Mild High False NoSunny Cool Normal False YesRainy Mild Normal False YesSunny Mild Normal True YesOvercast Mild High True YesOvercast Hot Normal False YesRainy Mild High True No * indicates a tie

Page 28: Machine Learning and Data Mining

28

Evaluating the weather attributesAttribute Rules Errors Total

errorsOutlook Sunny No 2/5 4/14

Overcast Yes 0/4Rainy Yes 2/5

Temp Hot No* 2/4 5/14Mild Yes 2/6Cool Yes 1/4

Humidity High No 3/7 4/14Normal Yes 1/7

Windy False Yes 2/8 5/14True No* 3/6

Outlook Temp Humidity

Windy Play

Sunny Hot High False NoSunny Hot High True NoOvercast Hot High False YesRainy Mild High False YesRainy Cool Normal False YesRainy Cool Normal True NoOvercast Cool Normal True YesSunny Mild High False NoSunny Cool Normal False YesRainy Mild Normal False YesSunny Mild Normal True YesOvercast Mild High True YesOvercast Hot Normal False YesRainy Mild High True No * indicates a tie

Page 29: Machine Learning and Data Mining

29

Evaluating the weather attributesAttribute Rules Errors Total

errorsOutlook Sunny No 2/5 4/14

Overcast Yes 0/4Rainy Yes 2/5

Temp Hot No* 2/4 5/14Mild Yes 2/6Cool Yes 1/4

Humidity High No 3/7 4/14Normal Yes 1/7

Windy False Yes 2/8 5/14True No* 3/6

Outlook Temp Humidity

Windy Play

Sunny Hot High False NoSunny Hot High True NoOvercast

Hot High False Yes

Rainy Mild High False YesRainy Cool Normal False YesRainy Cool Normal True NoOvercast

Cool Normal True Yes

Sunny Mild High False NoSunny Cool Normal False YesRainy Mild Normal False YesSunny Mild Normal True YesOvercast

Mild High True Yes

Overcast

Hot Normal False Yes

Rainy Mild High True No

* indicates a tie

Page 30: Machine Learning and Data Mining

30

Evaluating the weather attributesAttribute Rules Errors Total

errorsOutlook Sunny No 2/5 4/14

Overcast Yes 0/4Rainy Yes 2/5

Temp Hot No* 2/4 5/14Mild Yes 2/6Cool Yes 1/4

Humidity High No 3/7 4/14Normal Yes 1/7

Windy False Yes 2/8 5/14True No* 3/6

Outlook Temp Humidity

Windy Play

Sunny Hot High False NoSunny Hot High True NoOvercast Hot High False YesRainy Mild High False YesRainy Cool Normal False YesRainy Cool Normal True NoOvercast Cool Normal True YesSunny Mild High False NoSunny Cool Normal False YesRainy Mild Normal False YesSunny Mild Normal True YesOvercast Mild High True YesOvercast Hot Normal False YesRainy Mild High True No * indicates a tie

Page 31: Machine Learning and Data Mining

32

Evaluating the weather attributesAttribute Rules Errors Total errors

Outlook Sunny->No 2/5 4/14

Overcast->Yes 0/4

Rainy->Yes 2/5

Temp Hot->No* 2/4 5/14

Mild->Yes 2/6

Cool->Yes 1/4

Humidity High->No 3/7 4/14

Normal->Yes 1/7

Windy False->Yes 2/8 5/14

True->No* 3/6

Table 2. Rules for Weather data (Nominal)

Page 32: Machine Learning and Data Mining

33

Dealing with numeric attributes

• Discretize numeric attributes• Divide each attribute’s range into intervals

– Sort instances according to attribute’s values– Place breakpoints where the class changes

(the majority class)– This minimizes the total error

Page 33: Machine Learning and Data Mining

34

Example: temperature from weather data

64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y | N

Outlook Temperature Humidity Windy PlaySunny 85 85 False NoSunny 80 90 True No

Overcast 83 86 False YesRainy 70 96 False YesRainy 68 80 False YesRainy 65 70 True No

Overcast 64 65 True YesSunny 72 95 False NoSunny 69 70 False YesRainy 75 80 False YesSunny 75 70 True Yes

Overcast 72 90 True YesOvercast 81 75 False Yes

Rainy 71 91 True No

Page 34: Machine Learning and Data Mining

35

Example: temperature from weather data

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No | Yes | Yes Yes | No | Yes Yes | No

Outlook Temperature Humidity Windy PlaySunny 85 85 False NoSunny 80 90 True No

Overcast 83 86 False YesRainy 70 96 False YesRainy 68 80 False YesRainy 65 70 True No

Overcast 64 65 True YesSunny 72 95 False NoSunny 69 70 False YesRainy 75 80 False YesSunny 75 70 True Yes

Overcast 72 90 True YesOvercast 81 75 False Yes

Rainy 71 91 True No

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

The simplest fix is to move the breakpoint at 72 up one example, to 73.5, producing a mixed partition in which no is the majority class.

Page 35: Machine Learning and Data Mining

36

Evaluating the weather attributesAttribute Rules Errors Total errors

Outlook Sunny->No 2/5

Overcast->

Rainy->

Temp Hot->

Mild->

Cool->

Humidity High->

Normal->

Windy False->

True->

Table :Rules for Weather data (Nominal)

Page 36: Machine Learning and Data Mining

37

Dealing with numeric attributes

Attribute Rules Errors Total errors

Temperature <=64.5->Yes 0/1 1/14

>64.5->No 0/1

>66.5 and <=70.5->Yes 0/3

>70.5 and <=73.5->No 1/3

>73.5 and <=77.5->Yes 0/2>77.5 and <=80.5->No 0/1>80.5 and <=84->Yes 0/2>84->No 0/1

Table 5. Rules for temperature from weather data (overfitting)

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

Break points :64.5, 66.5, 70.5, 73.5, 77.5, 80.5, and 84

Page 37: Machine Learning and Data Mining

38

The problem of overfitting• This procedure is very sensitive to noise

– One instance with an incorrect class label will probably produce a separate interval

• Simple solution:enforce minimum number of instances in majority class per interval

Page 38: Machine Learning and Data Mining

39

Discretization example

• Example (with min = 3):

• Final result for temperature attribute

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

which leads to the rule set

temperature: ≤ 77.5 → yes

> 77.5 → no

Page 39: Machine Learning and Data Mining

40

With overfitting avoidance• Resulting rule set:

Attribute Rules Errors Total errorsOutlook Sunny No 2/5 4/14

Overcast Yes 0/4Rainy Yes 2/5

Temperature 77.5 Yes 3/10 5/14> 77.5 No* 2/4

Humidity 82.5 Yes 1/7 3/14> 82.5 and 95.5 No 2/6> 95.5 Yes 0/1

Windy False Yes 2/8 5/14True No* 3/6

Page 40: Machine Learning and Data Mining

41

Discussion of 1R• 1R was described in a paper by Holte (1993)

– Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)

– Minimum number of instances was set to 6 after some experimentation

– 1R’s simple rules performed not much worse than much more complex decision trees

• Simplicity first pays off!

Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of Ottawa

Page 41: Machine Learning and Data Mining

42

Statistical modeling• “Opposite” of 1R: use all the attributes• Two assumptions: Attributes are

– equally important– statistically independent (given the class value)

• I.e., knowing the value of one attribute says nothing about the value of another(if the class is known)

• Independence assumption is almost never correct!• But … this scheme works surprisingly well in practice

Page 42: Machine Learning and Data Mining

43

Probabilities for weather dataOutlook Temperature Humidity Windy Play

Yes No Yes No Yes No Yes No Yes NoSunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5Overcast

4 0 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 Cool 3 1Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/1

45/14

Overcast

4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5

Rainy 3/9 2/5 Cool 3/9 1/5Outlook Temp Humidity Windy PlaySunny Hot High False NoSunny Hot High True NoOvercast Hot High False YesRainy Mild High False YesRainy Cool Normal False YesRainy Cool Normal True NoOvercast Cool Normal True YesSunny Mild High False NoSunny Cool Normal False YesRainy Mild Normal False YesSunny Mild Normal True YesOvercast Mild High True YesOvercast Hot Normal False YesRainy Mild High True No

Page 43: Machine Learning and Data Mining

44

Probabilities for weather data

Outlook Temp. Humidity Windy PlaySunny Cool High True ?

• A new day: Likelihood of the two classesFor “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206

Conversion into a probability by normalization:P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

Outlook Temperature Humidity Windy PlayYes No Yes No Yes No Yes No Yes No

Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5Overcast

4 0 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 Cool 3 1Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/1

45/14

Overcast

4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5

Rainy 3/9 2/5 Cool 3/9 1/5

Page 44: Machine Learning and Data Mining

45

Bayes’s rule• Probability of event H given evidence E :

• A priori probability of H :– Probability of event before evidence is seen

• A posteriori probability of H :– Probability of event after evidence is seen

Thomas BayesBorn: 1702 in London, EnglandDied: 1761 in Tunbridge Wells, Kent, England

from Bayes “Essay towards solving a problem in the doctrine of chances” (1763)

Page 45: Machine Learning and Data Mining

46

Naïve Bayes for classification• Classification learning: what’s the probability of the class

given an instance? – Evidence E = instance– Event H = class value for instance

• Naïve assumption: evidence splits into parts (i.e. attributes) that are independent

Page 46: Machine Learning and Data Mining

47

Weather data exampleOutlook Temp. Humidit

yWindy Play

Sunny Cool High True ?Evidence E

Probability ofclass “yes”

Page 47: Machine Learning and Data Mining

48

The “zero-frequency problem”• What if an attribute value doesn’t occur with every class value?

(e.g. “Outlook = Overcast” for class “No”)– Probability will be zero!– A posteriori probability will also be zero!

(No matter how likely the other values are!) • Remedy: add 1 to the count for every attribute value-class

combination (Laplace estimator)• Result: probabilities will never be zero!

(also: stabilizes probability estimates)

Outlook Temperature Humidity Windy PlayYes No Yes No Yes No Yes No Yes No

Sunny 2 3 +1 Hot 2 2 High 3 4 False 6 2 9 5Overcast

4 0 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 Cool 3 1Sunny 2/9 3/

5Hot 2/9 2/5 High 3/9 4/

5False 6/9 2/

59/14

5/14

Overcast

4/9 0/5

Mild 4/9 2/5 Normal 6/9 1/5

True 3/9 3/5

Rainy 3/9 2/5

Cool 3/9 1/5

Page 48: Machine Learning and Data Mining

49

The “zero-frequency problem”• What if an attribute value doesn’t occur with every class value?

(e.g. “Outlook = Overcast” for class “No”)– Probability will be zero!– A posteriori probability will also be zero!

(No matter how likely the other values are!) • Remedy: add 1 to the count for every attribute value-class

combination (Laplace estimator)• Result: probabilities will never be zero!

(also: stabilizes probability estimates)

Outlook Temperature Humidity Windy PlayYes No Yes No Yes No Yes No Yes No

Sunny 2 3 +1 Hot 2 2 High 3 4 False 6 2 9 5Overcast

4 0 +1 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 +1 Cool 3 1Sunny 2/9 3/

54/8

Hot 2/9 2/5 High 3/9 4/5

False 6/9 2/5

9/14

5/14

Overcast

4/9 0/5

1/8

Mild 4/9 2/5 Normal 6/9 1/5

True 3/9 3/5

Rainy 3/9 2/5

3/8

Cool 3/9 1/5

Page 49: Machine Learning and Data Mining

50

*Modified probability estimates

• In some cases adding a constant different from 1 might be more appropriate

• Example: attribute outlook for class yes

• Weights don’t need to be equal (but they must sum to 1)

Sunny Overcast Rainy

Page 50: Machine Learning and Data Mining

51

Missing values• Training: instance is not included in

frequency count for attribute value-class combination

• Classification: attribute will be omitted from calculation

• Example: Outlook Temp. Humidity Windy Play? Cool High True ?

Likelihood of “yes” = = 0.0238Likelihood of “no” = = 0.0343P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

Page 51: Machine Learning and Data Mining

52

Missing values• Training: instance is not included in

frequency count for attribute value-class combination

• Classification: attribute will be omitted from calculation

• Example: Outlook Temp. Humidity Windy Play? Cool High True ?

Likelihood of “yes” = 3/9 3/9 3/9 9/14 = 0.0238Likelihood of “no” = 1/5 4/5 3/5 5/14 = 0.0343P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

Page 52: Machine Learning and Data Mining

53

Numeric attributes• Usual assumption: attributes have a normal or

Gaussian probability distribution (given the class)• The probability density function for the normal

distribution is defined by two parameters:– Sample mean

– Standard deviation

– Then the density function f(x) is

Karl Gauss, 1777-1855great German mathematician

Page 53: Machine Learning and Data Mining

54

Statistics for weather data

• Example density value:

Outlook Temperature Humidity Windy PlayYes No Yes No Yes No Yes No Yes No

Sunny 2 3 64, 68, 65, 71,

65, 70, 70, 85, False 6 2 9 5

Overcast

4 0 69, 70, 72, 80,

70, 75, 90, 91, True 3 3

Rainy 3 2 72, … 85, …

80, … 95, …

Sunny 2/9 3/5 =73 =75 =79 =86 False 6/9 2/5 9/14

5/14

Overcast

4/9 0/5 =6.2 =7.9

=10.2 =9.7 True 3/9 3/5

Rainy 3/9 2/5

Page 54: Machine Learning and Data Mining

55

Classifying a new day• A new day:

• Missing values during training are not included in calculation of mean and standard deviation

Outlook Temp. Humidity Windy Play

Sunny 66 90 true ?

Likelihood of “yes” = 0.0221 = 0.000036Likelihood of “no” = 0.0291 0.0380 = 0.000136P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%

Page 55: Machine Learning and Data Mining

56

Classifying a new day• A new day:

• Missing values during training are not included in calculation of mean and standard deviation

Outlook Temp. Humidity Windy Play

Sunny 66 90 true ?

Likelihood of “yes” = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036Likelihood of “no” = 3/5 0.0291 0.0380 3/5 5/14 = 0.000136P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%

Page 56: Machine Learning and Data Mining

57

Naïve Bayes: discussion• Naïve Bayes works surprisingly well (even if

independence assumption is clearly violated)• Why? Because classification doesn’t require

accurate probability estimates as long as maximum probability is assigned to correct class

• However: adding too many redundant attributes will cause problems (e.g. identical attributes)

• Note also: many numeric attributes are not normally distributed ( kernel density estimators)

Page 57: Machine Learning and Data Mining

58

Naïve Bayes Extensions• Improvements:

– select best attributes (e.g. with greedy search)– often works as well or better with just a

fraction of all attributes• Bayesian Networks

Page 58: Machine Learning and Data Mining

59

Summary

• OneR – uses rules based on just one attribute• Naïve Bayes – use all attributes and Bayes rules to

estimate probability of the class given an instance.• Simple methods frequently work well,

– 1R and Naive Bayes do just as well—or even better.

• but …– Complex methods can be better (as we will

see)