data mining: practical machine learning techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf ·...

59
Data Mining: Practical Machine Learning Techniques School of Computer Science & Engineering Chung-Ang University Artificial Intelligence Dae-Won Kim

Upload: others

Post on 25-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Data Mining: Practical Machine Learning Techniques

School of Computer Science & Engineering Chung-Ang University

Artificial Intelligence

Dae-Won Kim

Page 2: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

AI Scope

1. Search-based optimization techniques for real-life problems

• Hill climbing, Branch and bound, A*, Greedy algorithm

• Simulated annealing, Tabu search, Genetic algorithm

2. Machine Learning/ Pattern Recognition/ Data Mining

• Classification: Bayesian algorithm, Nearest-neighbor algorithm, Neural network

• Clustering: Hierarchical algorithm, K-Means algorithm

3. Reasoning: Logic, Inference, and knowledge representation

• Logical language: Syntax and Semantics

• Inference algorithm: Forward/Backward chaining, Resolution, and Expert System

4. Uncertainty based on Probability theory

5. Planning, Scheduling, Robotics, and Industry Automation

Page 3: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Did you ever hear about Big Data?

Page 4: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Progress in digital data acquisition and storage technology has resulted in the growth of huge databases.

Page 5: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

Page 6: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We build algorithms that sift through databases automatically, seeking patterns.

Page 7: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Strong patterns, if found, will likely generalize to make accurate predictions on future data.

Page 8: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Algorithms need to be robust enough to cope with imperfect data and to extract patterns that are inexact useful.

Page 9: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Machine learning provides the technical basis of data mining.

Page 10: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We will study simple machine learning methods, looking for patterns in data.

Page 11: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

People has been seeking patterns in data since human life began.

Page 12: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

In data mining, computer algorithm is solving problems by analyzing data in databases.

Page 13: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Data mining is defined as the process of (knowledge) discovering patterns in data.

Page 14: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Data mining is defined as the process of (knowledge) discovering patterns in data.

Page 15: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We start with a simple example.

Page 16: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Q: Tell me the name of this fish.

Page 17: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Algorithm ??

Page 18: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We have 100 fishes, and measured their lengths. (e.g., fish: x=[length]t)

Page 19: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 20: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Our algorithm can measure the length of a new fish, and estimate its label.

Page 21: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Yes, it is a typical prediction task through classification technique. But, it is often inexact and unsatisfactory.

Page 22: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Next, we measured their lightness. (e.g., fish: x=[lightness])

Page 23: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 24: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Lightness is better than length.

Page 25: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Let us use both lightness and width. (e.g., fish: x=[lightness, width])

Page 26: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 27: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Each fish is represented a point (vector) in 2D x-y coordinate space.

Page 28: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Everything is represented as N-dimensional vector in coordinate space.

Page 29: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

The world is represented as matrix

Page 30: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We assume that you have learned the basic concepts of linear algebra.

Page 31: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

The objective is to find a line that effectively separates two groups.

Page 32: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

How to find the line using a simple Math from high school?

Page 33: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We can build a complex nonlinear line to provide exact separation.

Page 34: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 35: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

The formal procedure is given as:

Page 36: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 37: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

This shows a predictive task of data mining, often called as pattern classification/ recognition/ prediction.

Page 38: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

The act of taking in raw data and making an action based on the category of the pattern.

Page 39: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We build a machine that can recognize or predict patterns.

Page 40: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 41: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Another famous task of data mining is a descriptive task. Cluster analysis is the well-known group discovery algorithm.

Page 42: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science
Page 43: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We will experience the basic issues in the prediction task (pattern classification) in forthcoming weeks.

Page 44: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Some terms should be defined.

Page 45: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Fish Lightness Length Weight Width Class Label

Fish-1 10 70.3 6.0 36 Salmon

Fish-2 10 75.5 8.8 128 Salmon

Fish-3 29 51.1 9.4 164 Sea bass

Fish-4 36 49.9 8.4 113 Sea bass

Given training data set : ‘n x d’ pattern/data matrix:

Page 46: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Fish Lightness Length Weight Width Class Label

Fish-1 10 70.3 6.0 36 Salmon

Fish-2 10 75.5 8.8 128 Salmon

Fish-3 29 51.1 9.4 164 Sea bass

Fish-4 36 49.9 8.4 113 Sea bass

Given training data set : ‘n x d’ pattern/data matrix:

‘n’ patterns (objects, observations, vectors, records)

‘d’ features (attributes, variables, dimensions, fields)

Page 47: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Each pattern is represented as a feature vector.

Page 48: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

The training pattern matrix is stored in a file or database.

Page 49: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Given labeled training patterns, the class groups are known a priori.

Page 50: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

We constructs algorithms to classify new data into the known groups.

Page 51: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Training data vs. Test data

Page 52: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Training data are used as answers. We are learning algorithms using training data.

Page 53: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Test data are a set of new unseen data. We predict class labels using the learned algorithm.

Page 54: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Training data

# of data # of features

data index feature-1 feature-2 … feature-N class label

data index feature-1 feature-2 … feature-N class label

… …

data index Feature-1 feature-2 … feature-N class label

Test data

# of data # of features

data index feature-1 feature-2 … feature-N

data index feature-1 feature-2 … feature-N

data index Feature-1 feature-2 … feature-N

Page 55: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

For example, we try to classify the tumor type of breast cancer patients

Page 56: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Breast-cancer-training.txt

100 30

Patient-1 165 52 … 210 cancer

Patient-2 170 50 … 230 normal

… …

Patient-100 160 47 … 250 cancer

Breast-cancer-test.txt

10 30

Patient-1 163 55 … 215

Patient-2 155 50 … 240

Patient-10 165 45 … 235

Page 57: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

To evaluate the performance of prediction algorithms, we need a performance measure (Accuracy).

Page 58: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Gold Standard (Truth)

Positive Negative

Prediction Result

Positive True Positive False Positive

Negative False Negative True Negative

Suspicious Patients with Breast Cancer

Positive (Cancer) Negative (Normal)

Prediction Result

Positive (Cancer) True Positive False Positive

Negative (Normal) False Negative True Negative

Accuracy = (True Positive + True Negative) /

(True Positive + False Positive + False Negative + True Negative)

Page 59: Data Mining: Practical Machine Learning Techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf · 2014-10-30 · Data Mining: Practical Machine Learning Techniques School of Computer Science

Gold Standard (Truth)

Positive Negative

Prediction Result

Positive True Positive False Positive

Negative False Negative True Negative

Suspicious Patients with Breast Cancer

Positive (Cancer) Negative (Normal)

Prediction Result

Positive (Cancer) 30 5

Negative (Normal) 10 55

Accuracy = (30 + 55) / (30 + 5 + 10 + 55) = 0.85 (85%)