introduction to data mining

38
DATA MINING INTRODUCTION UNIVERSITY OF WINDSOR Lecturer: Roozbeh Razavi-Far ([email protected])

Upload: uwindsor

Post on 09-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

DATA MINING INTRODUCTION

UNIVERSITY OF WINDSOR

Lecturer: Roozbeh Razavi-Far ([email protected])

2

- Lecturer: Dr. Roozbeh Razavi-Far Department of Electrical & Computer Engineering University of Windsor

- Contact Info:

[email protected] http://www.researchgate.net/profile/Roozbeh_Razavi-Far

- Office Hours:

Fridays, from 14:30 until 16:00 Office: ECE 3025

- Teaching Assistant:

TBD

Course information

3

Course content

1. Introduction to data mining

2. Data and knowledge representation

3. Data preparation

4. Data mining predictive tasks:

a. Classification

b. Regression

5. Data mining descriptive tasks:

a. Association Rules

b. Clustering

6. Applications

7. Big data (if time permits)

4

Course materials

- Required:

Lecture slides becomes available at:

Data Mining Clew

http://www.researchgate.net/profile/Roozbeh_Razavi-Far

- Recommended textbook:

Data Mining: Concepts and Techniques

Jiawei Han, Micheline Kamber, Jian Pei

- Extra readings:

Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten, Eibe Frank

The Elements of Statistical Learning

Trevor Hastie, Robert Tibshirani, Jerome Friedman

Pattern Classification

Richard O. Duda, Peter E. Hart, David G. Stork

5

Course materials

6

Course objectives

- To provide an introduction to the basic concepts and methods of the data mining to extract patterns and useful knowledge from data and transform it into an understandable structure for further use.

- To develop and apply recent techniques of data mining for extracting knowledge and solving practical problems.

- The material covered in this course is fundamental and is the basis for a wide range of advanced applications including machine learning, predictive analytics, process control, fault diagnosis, pattern recognition and decision making.

- During the course, student will complete an applied project.

7

Course outcomes

By the end of the course, the student should be able to:

- Describe the architecture of data;

- Describe the mechanisms of the major data mining functions;

- Manually compute data mining results from small sample datasets;

- Apply data mining methodologies to discover hidden patterns among large volume of data

- Analyze the obtained data mining results

- Gain hands-on experience with implementation of some data mining algorithms applied to real case studies

8

Prerequisites

Basics of mathematics, statistics and probability theory.

An undergraduate level is fine and there will not be much theory.

There will be some basic linear algebra, e.g., eigenvalues, eigenvectors, matrix algebra and multivariate calculus.

Understand the basic concepts of artificial intelligence & machine learning is beneficial.

Some practical examples and applications might be presented in MATLAB, Python, WEKA /or/ SOLAS.

As for the project, programming skill is required, students should be able to develop applications in:

a. MTLAB

b. Python

c. R

d. C/C++

e. FORTRAN

9

Evaluation

1. Participation: 5

2. Assignment: 10

3. Midterm Exam: 30

4. Final Assignment: 10

5. Course Project:

a. Primary Report: 5

b. Primary Demo: 5

c. Class Presentation: 5

d. Final Demo: 25

e. Final Report: 5

Re-examination: None

A+ will only be given to outstanding achievement.

10

Course schedule

Note: subject to changes

Week Tuesdays (16-15:20 PM) Thursdays (16-15:20 PM) 1 May 12: May 14: 2 May 19: May 21: Uploading projects 3 May 26: May 28: Deadline to take a project (groups) 4 June 2: June 4: Uploading assignment 5 June 9: June 11: Deadline to submit primary reports 6 June 16: Deadline to submit assignment June 18: 8 June 30: Midterm exam July 2: PD-1 9 July 7: CP-1, PD-2 July 9: CP-2

10 July 14: CP-3 July 16: 11 July 21: CP-4 July 23: 12 July 28: CP-5, Uploading final assignment July 30: 13 Aug 4: Deadline to submit final assignment Aug 6: 14 Aug 11: Aug 13: 16 Aug 24: FD Another day for FD will be announced

PD: primary demos CP: class presentation FD: final demos : bonus pre-scheduled presentations

11

Assignments and exam

- All the assignments and exam are paper-based.

- You have to write your own answers.

- Do not copy!

- First assignment prepares you for the midterm exam.

- Final assignment is indeed a final take home exam.

- Bonus: extra 5 points.

12

Course project

Check posted projects on the

clew

Choose a project from

the pool

Implement the project

Multiple deliveries

- All the projects must be demonstrated:

- Quality of presentation

- Quality of reports

- Code implementation and execution

- Performance evaluation

- Reliability of the results

13

Course project

Report

First Demo

Class Presentation

Final Demo

Final Report

Each group deliver one report

- 2 to 3 pages

-Describe the topic

-Summarize the work

-Do not copy the given reference

-Explain the development procedure

14

Course project

Report

First Demo

Class Presentation

Final Demo

Final Report

Private demos given to the instructor

- All group members must participate and bring the primary report previously submitted

- For the date, please check the schedule

- Discuss your implementation procedure (Matlab/Python/R/C/C++/Fortran)

- Installation if needed but not recommended

- Explain your ideas about development/inputs/outputs/evaluation/analysis of results

- Initial presentation, then you get confirmation for class presentation

15

Course project

Report

First Demo

Class Presentation

Final Demo

Final Report

Class presentation

-One member of each group will present in the class for all students

-For the date check the schedule

-Presentation time is 20 minutes including 5 minutes for questions

-Explain the project/goals/method/algorithm

-Explain your ideas about development/inputs/outputs/performance evaluation

-Discuss your own results (if available by the time of the presentation, not mandatory)

16

Course project

Report

First Demo

Class Presentation

Final Demo

Final Report

Private final demos given to the instructor

- All group members must participate and bring the final report in the same date

- For the date check the schedule

- Detailed explanation of the implementation procedure

- Execute your code

- Present and analyze your results

- Be prepared for possible questions

- Deliver your code (I will test with different data) and final report

17

Course project

Report

First Demo

Class Presentation

Final Demo

Final Report

Each group deliver a final report

- Up to 10 pages including diagrams and tables

- Describe the work, algorithm, development procedure

- Explain and analyze the attained results

- Do not copy

18

Collaboration policy

- You are encouraged to discuss your solutions and problem-solving methods with other students, but ultimately you must write your own code and produce your own results.

- If you have collaborated with other students in the planning and design of solutions, provide their names on your report.

- Plagiarism, cheating, misrepresentation of facts and participation in such offences are viewed as serious academic offences.

- Work submitted by a student that is the work of another student or any other person is considered plagiarism which is immediately referred to the Dean.

19

What is data?

20

Data is a set of values of qualitative or quantitative

variables.

What is data?

21

Data is a set of values of qualitative or quantitative

variables.

A measurement or characteristic of an item, e.g., color and height.

What is data?

22

Data is a set of values of qualitative or quantitative

variables.

What is data?

Observable properties and can generally not be measured with a numerical

result, e.g., black or Canada.

http://en.wikipedia.org/wiki/Qualitative_property

23

Data is a set of values of qualitative or quantitative

variables.

What is data?

Measurable properties which contain numerical values in terms of a unit of

measurement, e.g., 4m, 50kg.

What is data?

25

Reality:

Our Expectation:

Raw data vs processed data

26

Raw data vs processed data

Raw data • The original source of data • Often hard to use for data analysis • Has not been changed • It may only need to be processed

once

Processed data • Data that is ready for analysis • Editing, cleaning or modifying

the raw data results in processed data

• There may be standards for processing

• All steps should be recorded

27

28

Data Information Knowledge

Value Volume

• The most elementary description of things, events,

activities, transactions.

• Organized data that has meaning and value

• The concept of understanding information based on recognized

patterns in away that provides insight to information.

Data, information, knowledge

29

Data, information, knowledge

Data

Facts:

No patterns

No relation

Information

Description:

Who?

What?

Where?

When?

Knowledge

Instruction:

How? 4, 2

Temperature: 4°C Dew point: 2°C

There is a chance of icing It could affect the performance I should deice my aircraft

30

Fact gap problem

Data Goal

Company: Analyzing customer satisfaction survey

Car manufacturing: Casual analyze of

repair report

Medical Science: Casual analyze for

heart diseases

The most important thing in data science is the question. Ask yourself: - Can you find the data you need? - Can you understand the data you found? - Can you use the data you found?

31

Why data mining?

Better customized services

Customer retention

Competitive pressure among

companies

Explosive growth of data

lots of data is being collected

Business

Medical Industry

Science & Engineering

Social Media

More availability of computers and data

mining tools

Cheap and

powerful

32

Data rich, information poor

We are drowning in data but starving in knowledge

We need a systematic development of data mining

tools that can turn data tombs into “golden nuggets” of

knowledge [1].

33

What is data mining?

- Data mining is the process of discovering interesting knowledge (patterns, rules, constraints) from large amounts of data [1].

- These knowledge or patterns must be:

• Valid and reliable

• New

• Possibly useful

• Understandable

- Ask yourselves, what problem have you solved, ever, that was worth

solving, where you knew all of the given information in advance? Where you didn’t have a surplus of information and had to filter it out, or you didn’t have sufficient information and had to go find some?

-Dan Meyer-

34

Knowledge

Databases

Data warehouse

Task-relevant

data

Pattern

Data Cleaning

Data Integration

Selection

Data mining

Pattern Evaluation

What is data mining?

Alternative Names:

• Simple search and query processing

• (Deductive) expert system

• e.g. Look up phone number in phone dictionary [1]

What is not data mining?

• Knowledge discovery (mining)

in databases (KDD)

• Knowledge extraction

• Data/Pattern analysis, Data

archeology

• Data dredging, Information

harvesting

• Business intelligence, etc [1].

Knowledge Discovery Process [1]:

1. Data cleaning (to remove noise and incomplete data)

2. Data integration (multiple data sources are combined)

3. Data selection (data relevant to the analysis task are retrieved from the database)

4. Data transformation (data are transformed and consolidated into proper forms for mining by resorting to summary or aggregation operations)

5. Data mining (an essential process where intelligent and/or statistical methods are applied to extract data patterns)

6. Pattern evaluation (identifying the truly interesting patterns to represent knowledge based on interestingness measures)

7. Knowledge presentation (using visualization and knowledge representation techniques to present the mined knowledge to the user)

Data preprocessing /data prepared for mining

What is data mining?

36

37

References

[1] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2011.

THANK YOU!