data mining: practical machine learning techniquesai.cau.ac.kr/teaching/ai-2014/09.pdf ·...

Post on 25-Jun-2020

14 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Mining: Practical Machine Learning Techniques

School of Computer Science & Engineering Chung-Ang University

Artificial Intelligence

Dae-Won Kim

AI Scope

1. Search-based optimization techniques for real-life problems

• Hill climbing, Branch and bound, A*, Greedy algorithm

• Simulated annealing, Tabu search, Genetic algorithm

2. Machine Learning/ Pattern Recognition/ Data Mining

• Classification: Bayesian algorithm, Nearest-neighbor algorithm, Neural network

• Clustering: Hierarchical algorithm, K-Means algorithm

3. Reasoning: Logic, Inference, and knowledge representation

• Logical language: Syntax and Semantics

• Inference algorithm: Forward/Backward chaining, Resolution, and Expert System

4. Uncertainty based on Probability theory

5. Planning, Scheduling, Robotics, and Industry Automation

Did you ever hear about Big Data?

Progress in digital data acquisition and storage technology has resulted in the growth of huge databases.

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.

We build algorithms that sift through databases automatically, seeking patterns.

Strong patterns, if found, will likely generalize to make accurate predictions on future data.

Algorithms need to be robust enough to cope with imperfect data and to extract patterns that are inexact useful.

Machine learning provides the technical basis of data mining.

We will study simple machine learning methods, looking for patterns in data.

People has been seeking patterns in data since human life began.

In data mining, computer algorithm is solving problems by analyzing data in databases.

Data mining is defined as the process of (knowledge) discovering patterns in data.

Data mining is defined as the process of (knowledge) discovering patterns in data.

We start with a simple example.

Q: Tell me the name of this fish.

Algorithm ??

We have 100 fishes, and measured their lengths. (e.g., fish: x=[length]t)

Our algorithm can measure the length of a new fish, and estimate its label.

Yes, it is a typical prediction task through classification technique. But, it is often inexact and unsatisfactory.

Next, we measured their lightness. (e.g., fish: x=[lightness])

Lightness is better than length.

Let us use both lightness and width. (e.g., fish: x=[lightness, width])

Each fish is represented a point (vector) in 2D x-y coordinate space.

Everything is represented as N-dimensional vector in coordinate space.

The world is represented as matrix

We assume that you have learned the basic concepts of linear algebra.

The objective is to find a line that effectively separates two groups.

How to find the line using a simple Math from high school?

We can build a complex nonlinear line to provide exact separation.

The formal procedure is given as:

This shows a predictive task of data mining, often called as pattern classification/ recognition/ prediction.

The act of taking in raw data and making an action based on the category of the pattern.

We build a machine that can recognize or predict patterns.

Another famous task of data mining is a descriptive task. Cluster analysis is the well-known group discovery algorithm.

We will experience the basic issues in the prediction task (pattern classification) in forthcoming weeks.

Some terms should be defined.

Fish Lightness Length Weight Width Class Label

Fish-1 10 70.3 6.0 36 Salmon

Fish-2 10 75.5 8.8 128 Salmon

Fish-3 29 51.1 9.4 164 Sea bass

Fish-4 36 49.9 8.4 113 Sea bass

Given training data set : ‘n x d’ pattern/data matrix:

Fish Lightness Length Weight Width Class Label

Fish-1 10 70.3 6.0 36 Salmon

Fish-2 10 75.5 8.8 128 Salmon

Fish-3 29 51.1 9.4 164 Sea bass

Fish-4 36 49.9 8.4 113 Sea bass

Given training data set : ‘n x d’ pattern/data matrix:

‘n’ patterns (objects, observations, vectors, records)

‘d’ features (attributes, variables, dimensions, fields)

Each pattern is represented as a feature vector.

The training pattern matrix is stored in a file or database.

Given labeled training patterns, the class groups are known a priori.

We constructs algorithms to classify new data into the known groups.

Training data vs. Test data

Training data are used as answers. We are learning algorithms using training data.

Test data are a set of new unseen data. We predict class labels using the learned algorithm.

Training data

# of data # of features

data index feature-1 feature-2 … feature-N class label

data index feature-1 feature-2 … feature-N class label

… …

data index Feature-1 feature-2 … feature-N class label

Test data

# of data # of features

data index feature-1 feature-2 … feature-N

data index feature-1 feature-2 … feature-N

data index Feature-1 feature-2 … feature-N

For example, we try to classify the tumor type of breast cancer patients

Breast-cancer-training.txt

100 30

Patient-1 165 52 … 210 cancer

Patient-2 170 50 … 230 normal

… …

Patient-100 160 47 … 250 cancer

Breast-cancer-test.txt

10 30

Patient-1 163 55 … 215

Patient-2 155 50 … 240

Patient-10 165 45 … 235

To evaluate the performance of prediction algorithms, we need a performance measure (Accuracy).

Gold Standard (Truth)

Positive Negative

Prediction Result

Positive True Positive False Positive

Negative False Negative True Negative

Suspicious Patients with Breast Cancer

Positive (Cancer) Negative (Normal)

Prediction Result

Positive (Cancer) True Positive False Positive

Negative (Normal) False Negative True Negative

Accuracy = (True Positive + True Negative) /

(True Positive + False Positive + False Negative + True Negative)

Gold Standard (Truth)

Positive Negative

Prediction Result

Positive True Positive False Positive

Negative False Negative True Negative

Suspicious Patients with Breast Cancer

Positive (Cancer) Negative (Normal)

Prediction Result

Positive (Cancer) 30 5

Negative (Normal) 10 55

Accuracy = (30 + 55) / (30 + 5 + 10 + 55) = 0.85 (85%)

top related