cse591 (575) data mining

22
1 CSE591 (575) Data Mining 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU

Upload: maida

Post on 25-Feb-2016

38 views

Category:

Documents


2 download

DESCRIPTION

CSE591 (575) Data Mining. 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU. Introduction. Introduction to this Course Introduction to Data Mining. Introduction to the Course. First, about you - why take this course? Your background and strength AI, DBMS, Statistics, Biology, … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE591 (575) Data Mining

1

CSE591 (575) Data Mining

1/21/2003 - 5/6/2003Computer Science &

EngineeringASU

Page 2: CSE591 (575) Data Mining

2

Introduction

Introduction to this CourseIntroduction to Data Mining

Page 3: CSE591 (575) Data Mining

3

Introduction to the Course First, about you - why take this course?

Your background and strength AI, DBMS, Statistics, Biology, …

Your interests and requests What is this course about?

Problem solving Handling data

transform data to workable data Mining data

turn data to knowledge validation and presentation of knowledge

Page 4: CSE591 (575) Data Mining

4

This course What can you expect from this course?

Knowledge and experience about DM Problem solving and solution presentation

How is this course conducted? Presentations Individual projects

Course Format Individual Projects 40% Exams and/or quizzes 40% Class participation 20%

off-campus students?

Page 5: CSE591 (575) Data Mining

5

Projects - Start NOW! How to start? Projects should be sufficiently challenging

but reasonable, suitable for one semester How to choose your individual project

Real-world problems Problems that might make differences

Two types of projects Available projects Self-proposed projects (Approval’s needed)

Page 6: CSE591 (575) Data Mining

6

Some project ideas Dealing with high dimensional data

Data of supervised, unsupervised learning Image mining

Feature extraction, clustering of images Active sampling

Various data structures (kd-trees, R-trees, Multi-Dimen Scaling) Meta data (RDF, namespace) for mining Ensemble learning Sequence mining (HMM learning) Bioinformatics and applications (feature selection) Intelligent driving data analysis

Data integration, data reduction (random projection)

Page 7: CSE591 (575) Data Mining

7

How is a project evaluated? It depends on

What do you want to achieve Its impact Your effort

The sooner you start, the better The beginning is not easy

Page 8: CSE591 (575) Data Mining

8

Course Web Site http://www.public.asu.edu/~huanliu/

cse591.html My office and office hours

GWC 342 T 10:30 - 11:30am and Th 4:00-5:00pm

My email: [email protected] Slides and relevant information will be

made available at the course web site

Page 9: CSE591 (575) Data Mining

9

Any questions and suggestions? Your feedback is most welcome!

I need it to adapt the course to your needs. Please feel free to provide yours anytime. Share your questions and concerns with the

class – very likely others may have the same. No pain no gain – no magic for data mining.

The more you put in, the more you get Your grades are proportional to your efforts.

Page 10: CSE591 (575) Data Mining

10

Introduction to Data Mining

DefinitionsMotivations of DM

Interdisciplinary Links of DM

Page 11: CSE591 (575) Data Mining

11

What is DM? Or more precisely KDD (knowledge

discovery from databases)? Many definitions A process, not plug-and-play

raw data transformed data preprocessed data data mining post-processing knowledge

One definition is A non-trivial process of identifying valid,

novel, useful and ultimately understandable patterns in data

Page 12: CSE591 (575) Data Mining

12

Need for Data Mining Data accumulate and double every 9 months There is a big gap from stored data to

knowledge; and the transition won’t occur automatically.

Manual data analysis is not new but a bottleneck

Fast developing Computer Science and Engineering generates new demands

Seeking knowledge from massive data Any personal experience?

Page 13: CSE591 (575) Data Mining

13

When is DM useful Data rich

Two invited talks so far have convincingly demonstrate it

Large data (dimensionality and size) Image data (size) Gene data (dimensionality)

Little knowledge about data (exploratory data analysis) What if we have some knowledge?

Page 14: CSE591 (575) Data Mining

14

DM perspectives Prediction, description, explanation, optimization,

and exploration Completion of knowledge (patterns vs. models) Understandability and representation of

knowledge Some applications

Business intelligence (CRM) Security (Info, Comp Systems, Networks, Data, Privacy) Scientific discovery (bioinformatics)

Page 15: CSE591 (575) Data Mining

15

Challenges Increasing data dimensionality and data

size Various data forms New data types

Streaming data, multimedia data Efficient search and data access Intelligent update and integration

Page 16: CSE591 (575) Data Mining

16

Interdisciplinary Links of DM

Statistics Databases AI Machine Learning Visualization High Performance Computing

supercomputers, distributed/parallel/cluster computing

Page 17: CSE591 (575) Data Mining

17

Statistics Discovery of structures or patterns in data sets

hypothesis testing, parameter estimation Optimal strategies for collecting data

efficient search of large databases Static data

constantly evolving data Models play a central role

algorithms are of a major concern patterns are sought

Page 18: CSE591 (575) Data Mining

18

Relational Databases A relational databases can contain several tables

Tables and schemas The goal in data organization is to maintain data

and quickly locate the requested data Queries and index structures

Query execution and optimization Query optimization is to find the best possible

evaluation method for a given query Providing fast, reliable access to data for data

mining

Page 19: CSE591 (575) Data Mining

19

AI Intelligent agents

Perception-Action-Goal-Environment Search

uniform cost and informed search algorithms Knowledge representation

FOL, production rules, frames with semantic networks

Knowledge acquisition Knowledge maintenance and application

Page 20: CSE591 (575) Data Mining

20

Machine Learning Focusing on complex representations, data-intensive

problems, and search-based methods Flexibility with prior knowledge and collected data Generalization from data and empirical validation

statistical soundness and computational efficiency constrained by finite computing & data recourses

Challenges from KDD scaling up, cost info, auto data preprocessing

Page 21: CSE591 (575) Data Mining

21

Visualization Producing a visual display with insights into the

structure of the data with interactive means zoom in/out, rotating, displaying detailed info

Various branches of visualization methods show summary properties and explore relationships

between variables investigate large databases and convey lots of

information analyze data with geographic/spatial location

A pre- and post-processing tool for KDD

Page 22: CSE591 (575) Data Mining

22

Bibliography W. Klosgen & J.M. Zytkow, edited, 2001,

Handbook of Data Mining and Knowledge Discovery.