data mining and machine learning

43
David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: Data Mining and Machine Learning Lecture 1: Why data is useful, and overview of DMML:

Upload: faris

Post on 19-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Data Mining and Machine Learning. Lecture 1: Why data is useful, and overview of DMML:. Overview of My Lectures. http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html. Module assessment. 100% by coursework Three main items of coursework, CW 1: 30% CW 2: 40% CW 3: 30% - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining and Machine

Learning Lecture 1: Why data is useful, and overview of DMML:

Page 2: Data Mining  and Machine Learning

David Corne Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Overview of My Lectures http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 3: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Module assessment

100% by coursework

Three main items of coursework,

CW 1: 30% CW 2: 40% CW 3: 30%

Two small items of coursework (A and B), worth 0%, but if you don’t do them adequately you fail the module.

Page 4: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Coursework submission

ALL coursework must be submitted as follows• as PDF• by email to [email protected]• the c/w is an attachment• Subject line: DMML Coursework A

– (… or B, 1, 2, 3)

• Body of the email includes your Name and your Course (e.g. Joe Smith, BSc CS – Jill Brown, MSc AI)

Page 5: Data Mining  and Machine Learning

David Corne Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Office Hour Doodle Poll

http://doodle.com/ndb69faydc6ivttw

Page 6: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

At last, the lecture

Page 7: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What some people think can be done with data

Answer simple questions like:

• How many female clients do we have?

• How much paint did we sell in 2007?

• Which is the most profitable branch of our supermarket?

• Which postcodes suffered the most dropped calls in July?

Page 8: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

that is so

Page 9: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

that is so

Boring

Page 10: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More interesting things that can be done with data

Answer difficult and valuable questions like:• How can we predict Ovarian cancer early enough to treat it

successfully?• How can I make significant profit on the stock market next

month?• Two different authors claim to have written this story –

how can we resolve the dispute?• How can we get our customers to spend more money in

the store?• Is this loan applicant a good credit risk?• Is this sonar image a mine, or a rock?• What other websites will this browser be interested in?

Page 11: Data Mining  and Machine Learning

Some competitions at

Page 12: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining - Definition & Goal

Definition• – Data Mining is the exploration and analysis of

(often) large quantities of data in order to discover meaningful patterns and rules

Goal• – To permit some other goal to be achieved or

performance to be improved through a better understanding of the data

Page 13: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some examples of large databases

Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month

Tesco has >500 stores. Per year, 100,000,000 baskets ?

The Internet ~ >20,000,000,000 pages

Lots of datasets: UCI Machine Learning repository

How can we begin to understand and exploit such datasets? Especially the big ones?

Page 14: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Like this …

Page 15: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

and this …

Page 16: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

or this …

• see

http://websom.hut.fi/websom/milliondemo/html/root.html

Page 17: Data Mining  and Machine Learning

What on Earth is ‘big data’ anyway?

Or this

Page 18: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining & Machine Learning - Basics

• Data Mining is the process of discovering patterns and inferring associations in raw data

• … a collection of techniques intended to analyse small or large amounts of data

• … can employ a range of techniques, either individually or in combination with each other

• Machine Learning is the same, but the term ML emphasises a range of more sophisticated algorithms that try to learn accurate predictive models of data

Page 19: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining – Why is it important?

• Data are being generated in enormous quantities• Data are being collected over long periods of time• Data are being kept for long periods of time• Computing power is formidable and cheap• A variety of Data Mining software is available• All of these data contain `hidden knowledge’ –

facts, rules, patterns, that can be usefully exploited if we can find them.

Page 20: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 21: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some basic terminology

Gender weight height Age in mths 100m time

Male 52kg 1.71m 243 13.7s

Male 89kg 1.92m 388 22.3s

Female 48kg 1.67m 219 14.6s

Male 86kg 1.96m 274 9.58s

Male 80kg 1.88m 260 10.56s

etc …

Page 22: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

This is called a data instance or a record or just a line of data

Gender weight height Age in mths 100m time

Male 52kg 1.71m 243 13.7s

Male 89kg 1.92m 388 22.3s

Female 48kg 1.67m 219 14.6s

Male 86kg 1.96m 274 9.58s

Male 80kg 1.88m 260 10.56s

etc …

Page 23: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

This is called a field or an attribute; the value of the Age field in the 4th record is 274

Gender weight height Age in mths 100m time

Male 52kg 1.71m 243 13.7s

Male 89kg 1.92m 388 22.3s

Female 48kg 1.67m 219 14.6s

Male 86kg 1.96m 274 9.58s

Male 80kg 1.88m 260 10.56s

etc …

Page 24: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Usually we are interested in predicting the value of a particular field, given the values of the other fields. What we

want to predict is called the class field, or the target class

Gender weight height Age in mths 100m time

Male 52kg 1.71m 243 13.7s

Male 89kg 1.92m 388 22.3s

Female 48kg 1.67m 219 14.6s

Male 86kg 1.96m 274 9.58s

Male 80kg 1.88m 260 10.56s

etc …

Page 25: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some data-mining related projects that I am currently working on (either myself, or with a PhD student or RA)

Analysing flow cytometry data to detect the presence of specific contaminants in sea-water samples

Predicting which of two or more writers is the author of a givenpiece of text

Discovering which subsets of many thousands of genes play a rolein specific diseases (cancer, diabetes, etc) Discovering technical trading rules for stock market trading

Page 26: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Who wrote text chunk 4?

0.4 0.2 0.001 0.002 0.6 … AuthorA0.3 0.15 0 0.1 0.5 … AuthorA0.2 0.2 0.001 0.002 0.5 … AuthorB0.2 0.15 0 0.002 0.6 … ?

Word usage `Fingerprint’ of a 1,000 word chunk of text

Page 27: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Did the Dow Jones go up or down in the following week?

Page 28: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Down

Page 29: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Will the Dow Jones go up or down tomorrow?

Page 30: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining – Tasks

Classification - Example: high risk for cancer or notEstimation/Prediction - Example: household income / sales Association Rules- Example: people who buy X, often also

buy Y with a probability of ZClustering - similar to classification but no predefined

classes; identifies meaningful segments of a dataset, discovers structure in data

Page 31: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Warehousing • Note that Data Mining is very generic and can be used for

detecting patterns in almost any data– Retail data– Genomes– Climate data– Etc.

• Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector

Page 32: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What you should do this weekBrowse the UCI Machine Learning repository

datasets and associated information; get acquainted with data

Browse the statlib datasets archive, get acquainted with that too.

Browse the http://www.kaggle.com/ website - to give you some idea of how hot data mining is

And then …

Page 33: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Coursework A (0 marks, but you fail if you don’t submit an adequate attempt)

Find three other dataset repositories as follows:1. One that specialises in sports data

2. One that specialises in time series data

3. One that specialises in anything else that is interesting.

For each of these three, tell me the URL, and write one paragraph, ~100 words, in your own words, describing the contents of this repository,

Submit on or before 23:59pm Friday October 11th

Page 34: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Au revoir

Page 35: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

If interested…

Some slides about data warehousing; I don’t consider this an essential part of this module, but in case you want to know what data warehousing is …

Page 36: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Warehousing - Definitions

“A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process”

W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition.

“A copy of transaction data, specifically structured for query and analysis”

Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”

Page 37: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Warehouse – why?

For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way

Data Warehousing allows an organisation to remember its data and what it has learned about its data

Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it

Page 38: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 39: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Warehouse - Contents

• A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting

• The data will normally have been transformed when it was copied into the Data Warehouse

• The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course

Page 40: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Marts

• A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse

• A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities

Page 41: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

From Data Warhousing to Machine Learning, via Data Marts

Page 42: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Big Challenge for Data Mining

• The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse

• It is very important, then, that summary data also be available to get the analysis started

• The sheer volume of data may mask the important relationships in which the Data Miner is interested

• Being able to overcome the volume and interpret the data is essential to successful Data Mining

Page 43: Data Mining  and Machine Learning

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

What happens in practice …

Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions

The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data

There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them