introduction to data mining
Post on 09-Apr-2023
0 Views
Preview:
TRANSCRIPT
2
- Lecturer: Dr. Roozbeh Razavi-Far Department of Electrical & Computer Engineering University of Windsor
- Contact Info:
roozbeh@uwindsor.ca http://www.researchgate.net/profile/Roozbeh_Razavi-Far
- Office Hours:
Fridays, from 14:30 until 16:00 Office: ECE 3025
- Teaching Assistant:
TBD
Course information
3
Course content
1. Introduction to data mining
2. Data and knowledge representation
3. Data preparation
4. Data mining predictive tasks:
a. Classification
b. Regression
5. Data mining descriptive tasks:
a. Association Rules
b. Clustering
6. Applications
7. Big data (if time permits)
4
Course materials
- Required:
Lecture slides becomes available at:
Data Mining Clew
http://www.researchgate.net/profile/Roozbeh_Razavi-Far
- Recommended textbook:
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, Jian Pei
- Extra readings:
Data Mining: Practical Machine Learning Tools and Techniques
Ian H. Witten, Eibe Frank
The Elements of Statistical Learning
Trevor Hastie, Robert Tibshirani, Jerome Friedman
Pattern Classification
Richard O. Duda, Peter E. Hart, David G. Stork
6
Course objectives
- To provide an introduction to the basic concepts and methods of the data mining to extract patterns and useful knowledge from data and transform it into an understandable structure for further use.
- To develop and apply recent techniques of data mining for extracting knowledge and solving practical problems.
- The material covered in this course is fundamental and is the basis for a wide range of advanced applications including machine learning, predictive analytics, process control, fault diagnosis, pattern recognition and decision making.
- During the course, student will complete an applied project.
7
Course outcomes
By the end of the course, the student should be able to:
- Describe the architecture of data;
- Describe the mechanisms of the major data mining functions;
- Manually compute data mining results from small sample datasets;
- Apply data mining methodologies to discover hidden patterns among large volume of data
- Analyze the obtained data mining results
- Gain hands-on experience with implementation of some data mining algorithms applied to real case studies
8
Prerequisites
Basics of mathematics, statistics and probability theory.
An undergraduate level is fine and there will not be much theory.
There will be some basic linear algebra, e.g., eigenvalues, eigenvectors, matrix algebra and multivariate calculus.
Understand the basic concepts of artificial intelligence & machine learning is beneficial.
Some practical examples and applications might be presented in MATLAB, Python, WEKA /or/ SOLAS.
As for the project, programming skill is required, students should be able to develop applications in:
a. MTLAB
b. Python
c. R
d. C/C++
e. FORTRAN
9
Evaluation
1. Participation: 5
2. Assignment: 10
3. Midterm Exam: 30
4. Final Assignment: 10
5. Course Project:
a. Primary Report: 5
b. Primary Demo: 5
c. Class Presentation: 5
d. Final Demo: 25
e. Final Report: 5
Re-examination: None
A+ will only be given to outstanding achievement.
10
Course schedule
Note: subject to changes
Week Tuesdays (16-15:20 PM) Thursdays (16-15:20 PM) 1 May 12: May 14: 2 May 19: May 21: Uploading projects 3 May 26: May 28: Deadline to take a project (groups) 4 June 2: June 4: Uploading assignment 5 June 9: June 11: Deadline to submit primary reports 6 June 16: Deadline to submit assignment June 18: 8 June 30: Midterm exam July 2: PD-1 9 July 7: CP-1, PD-2 July 9: CP-2
10 July 14: CP-3 July 16: 11 July 21: CP-4 July 23: 12 July 28: CP-5, Uploading final assignment July 30: 13 Aug 4: Deadline to submit final assignment Aug 6: 14 Aug 11: Aug 13: 16 Aug 24: FD Another day for FD will be announced
PD: primary demos CP: class presentation FD: final demos : bonus pre-scheduled presentations
11
Assignments and exam
- All the assignments and exam are paper-based.
- You have to write your own answers.
- Do not copy!
- First assignment prepares you for the midterm exam.
- Final assignment is indeed a final take home exam.
- Bonus: extra 5 points.
12
Course project
Check posted projects on the
clew
Choose a project from
the pool
Implement the project
Multiple deliveries
- All the projects must be demonstrated:
- Quality of presentation
- Quality of reports
- Code implementation and execution
- Performance evaluation
- Reliability of the results
13
Course project
Report
First Demo
Class Presentation
Final Demo
Final Report
Each group deliver one report
- 2 to 3 pages
-Describe the topic
-Summarize the work
-Do not copy the given reference
-Explain the development procedure
14
Course project
Report
First Demo
Class Presentation
Final Demo
Final Report
Private demos given to the instructor
- All group members must participate and bring the primary report previously submitted
- For the date, please check the schedule
- Discuss your implementation procedure (Matlab/Python/R/C/C++/Fortran)
- Installation if needed but not recommended
- Explain your ideas about development/inputs/outputs/evaluation/analysis of results
- Initial presentation, then you get confirmation for class presentation
15
Course project
Report
First Demo
Class Presentation
Final Demo
Final Report
Class presentation
-One member of each group will present in the class for all students
-For the date check the schedule
-Presentation time is 20 minutes including 5 minutes for questions
-Explain the project/goals/method/algorithm
-Explain your ideas about development/inputs/outputs/performance evaluation
-Discuss your own results (if available by the time of the presentation, not mandatory)
16
Course project
Report
First Demo
Class Presentation
Final Demo
Final Report
Private final demos given to the instructor
- All group members must participate and bring the final report in the same date
- For the date check the schedule
- Detailed explanation of the implementation procedure
- Execute your code
- Present and analyze your results
- Be prepared for possible questions
- Deliver your code (I will test with different data) and final report
17
Course project
Report
First Demo
Class Presentation
Final Demo
Final Report
Each group deliver a final report
- Up to 10 pages including diagrams and tables
- Describe the work, algorithm, development procedure
- Explain and analyze the attained results
- Do not copy
18
Collaboration policy
- You are encouraged to discuss your solutions and problem-solving methods with other students, but ultimately you must write your own code and produce your own results.
- If you have collaborated with other students in the planning and design of solutions, provide their names on your report.
- Plagiarism, cheating, misrepresentation of facts and participation in such offences are viewed as serious academic offences.
- Work submitted by a student that is the work of another student or any other person is considered plagiarism which is immediately referred to the Dean.
21
Data is a set of values of qualitative or quantitative
variables.
A measurement or characteristic of an item, e.g., color and height.
What is data?
22
Data is a set of values of qualitative or quantitative
variables.
What is data?
Observable properties and can generally not be measured with a numerical
result, e.g., black or Canada.
http://en.wikipedia.org/wiki/Qualitative_property
23
Data is a set of values of qualitative or quantitative
variables.
What is data?
Measurable properties which contain numerical values in terms of a unit of
measurement, e.g., 4m, 50kg.
26
Raw data vs processed data
Raw data • The original source of data • Often hard to use for data analysis • Has not been changed • It may only need to be processed
once
Processed data • Data that is ready for analysis • Editing, cleaning or modifying
the raw data results in processed data
• There may be standards for processing
• All steps should be recorded
28
Data Information Knowledge
Value Volume
• The most elementary description of things, events,
activities, transactions.
• Organized data that has meaning and value
• The concept of understanding information based on recognized
patterns in away that provides insight to information.
Data, information, knowledge
29
Data, information, knowledge
Data
Facts:
No patterns
No relation
Information
Description:
Who?
What?
Where?
When?
Knowledge
Instruction:
How? 4, 2
Temperature: 4°C Dew point: 2°C
There is a chance of icing It could affect the performance I should deice my aircraft
30
Fact gap problem
Data Goal
Company: Analyzing customer satisfaction survey
Car manufacturing: Casual analyze of
repair report
Medical Science: Casual analyze for
heart diseases
The most important thing in data science is the question. Ask yourself: - Can you find the data you need? - Can you understand the data you found? - Can you use the data you found?
31
Why data mining?
Better customized services
Customer retention
Competitive pressure among
companies
Explosive growth of data
lots of data is being collected
Business
Medical Industry
Science & Engineering
Social Media
More availability of computers and data
mining tools
Cheap and
powerful
32
Data rich, information poor
We are drowning in data but starving in knowledge
We need a systematic development of data mining
tools that can turn data tombs into “golden nuggets” of
knowledge [1].
33
What is data mining?
- Data mining is the process of discovering interesting knowledge (patterns, rules, constraints) from large amounts of data [1].
- These knowledge or patterns must be:
• Valid and reliable
• New
• Possibly useful
• Understandable
- Ask yourselves, what problem have you solved, ever, that was worth
solving, where you knew all of the given information in advance? Where you didn’t have a surplus of information and had to filter it out, or you didn’t have sufficient information and had to go find some?
-Dan Meyer-
34
Knowledge
Databases
Data warehouse
Task-relevant
data
Pattern
Data Cleaning
Data Integration
Selection
Data mining
Pattern Evaluation
What is data mining?
Alternative Names:
• Simple search and query processing
• (Deductive) expert system
• e.g. Look up phone number in phone dictionary [1]
What is not data mining?
• Knowledge discovery (mining)
in databases (KDD)
• Knowledge extraction
• Data/Pattern analysis, Data
archeology
• Data dredging, Information
harvesting
• Business intelligence, etc [1].
Knowledge Discovery Process [1]:
1. Data cleaning (to remove noise and incomplete data)
2. Data integration (multiple data sources are combined)
3. Data selection (data relevant to the analysis task are retrieved from the database)
4. Data transformation (data are transformed and consolidated into proper forms for mining by resorting to summary or aggregation operations)
5. Data mining (an essential process where intelligent and/or statistical methods are applied to extract data patterns)
6. Pattern evaluation (identifying the truly interesting patterns to represent knowledge based on interestingness measures)
7. Knowledge presentation (using visualization and knowledge representation techniques to present the mined knowledge to the user)
Data preprocessing /data prepared for mining
What is data mining?
37
References
[1] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2011.
top related