data warehousing and data mining - université de monsinformatique.umons.ac.be/ssi/jef/luc.pdf ·...

49
Jef Wijsen Data Warehousing and Data Mining 1 Data Warehousing and Data Mining Jef Wijsen Universit´ e de Mons-Hainaut (UMH) Service de Science des Syst` emes d’Information [email protected] http://staff.umh.ac.be/Wijsen.Jef/ March 19, 2005

Upload: others

Post on 30-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 1

'

&

$

%

Data Warehousing and Data Mining

Jef Wijsen

Universite de Mons-Hainaut (UMH)

Service de Science des Systemes d’Information

[email protected]

http://staff.umh.ac.be/Wijsen.Jef/

March 19, 2005

Page 2: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 2

'

&

$

%

⇒ • 1 Situation . . . . . . . . . . . . . . . . . . . 3

• 2 Data Warehouse . . . . . . . . . . . . . . 10

• 3 Testimony . . . . . . . . . . . . . . . . . . 16

• 4 OLAP . . . . . . . . . . . . . . . . . . . . 18

• 5 Data Mining . . . . . . . . . . . . . . . . . 28

• 6 Next Generation Data Warehousing/Mining 42

• 7 Important Players . . . . . . . . . . . . . . 47

• 8 Selected Literature . . . . . . . . . . . . . 49

Page 3: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 3

'

&

$

%

1 Situation

• R&D in OnLine Transaction Processing (OLTP) since the sixtieshas resulted in (relational) database systems.

• Digitizing and storing data is simple and cheap. E.g., bar codes.

• Huge amounts of historical and operational data may hidenuggets of information (rules, trends, patterns,. . . ) on thebusiness.

• New challenge: disclose previously unknown knowledge so that itcan be used by managers.

Page 4: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 4

'

&

$

%

Case Study

Borrowed from www.internetweek.com.

Sports car owners fall into a high-risk category, in the conventionalwisdom of auto insurance underwriters.

Knowledge Discovery But by mining driver safety data in its newdata warehouse, Farmers Insurance Group has found that ifsports car enthusiasts also own a second, conventional car, theymay be safe-enough drivers to be attractive as policyholders.

Effective Use “We found a microniche among all sports carowners,” said Tom Boardman, an assistant actuary at Farmers[. . . ]. “As a result, we changed how we underwrite and pricesome sports car policies,” he said.

Page 5: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 5

'

&

$

%

Business Opportunities

Banking Which prospects are most likely to become profitablecustomers?

Retail What will this customer want next?

Government tax agency Which tax returns are likely to benon-compliant?

Government intelligence agency What specific event is mostlikely to be a security threat?B. Thuraisingham: Web Data Mining and Applications inBusiness Intelligence and Counter-Terrorism

Page 6: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 6

'

&

$

%

The Focus of the Talk is on Technological Issues

Three new technological developments in the field of decision supportsystems:

Data warehousing. OLTP data are often scattered over differentsystems, highly detailed, and/or of poor quality. Datawarehousing involves integrating, aggregating/summarizing, andcleaning these data in a new data repository, called datawarehouse.

OnLine Analytical Processing (OLAP). Online analysis of thedata warehouse content; data is represented in multidimensionalspreadsheets, called data cubes.

Data mining. Exploring data in search of interesting, newknowledge (rules, trends, regularities, patterns,. . . ).

Page 7: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 7

'

&

$

%

OLAP versus Data Mining

OLAP User-driven hypothesis verification. The analyst posing thequery usually tells the system exactly what query to execute; i.e.on which portion of the data to focus.

“Give the monthly number of customers that left in thepast year.”

Data mining Data-driven hypothesis building. A data-miningquery goes a step beyond, inviting the system to decide wherethe focus should be.

“What factors affect the loss of customers?”

Page 8: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 8

'

&

$

%

Knowledge Discovery in Databases (KDD)

KDD ≈

data warehousing (integration, aggregation, cleaning)

+

OLAP

+

data mining

Page 9: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 9

'

&

$

%

OLAP versus OLTP

OLTP OLAP

end-user clerk. manager.

workload frequent transactions: regular analyses:

access read and write, mostly read only,

a limited number of records. scanning millions of records.

data actual. actual and historical.

DB size 100 MB to GB. 100 GB to TB (= 106 MB).

These differences constitute an additional argument for building adata warehouse separate from existing transactional databases.

Page 10: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 10

'

&

$

%

2 Data Warehouse

2.1 What is a Data Warehouse?

A data repository for decision support, with the followingcharacteristics:

Subject-oriented and integrated. OLTP data are often scatteredover multiple applications (invoicing, delivery, production,. . . ).Data is integrated in a data warehouse around a number ofsubjects (client, product, supplier,. . . ).

Non-volatile and historical. Data, once entered in the warehouse,is not subject to change.Data covers a certain period of time (e.g., ten years) in order toallow trend analysis.

Page 11: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 11

'

&

$

%

2.2 What is a Data Mart?

A departmental data warehouse focusing on a specific part of thebusiness.E.g., marketing data mart about the subjects client, product, andsales.

Two types of data marts:

Data mart without data warehouse. These data marts can berealized more easily as they do not require a business-wideconceptual data model. However, they can raise complexintegration problems in the long run.

Data mart extracted from the data warehouse. For reasons ofincreased flexibility and performance.

Page 12: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 12

'

&

$

%

2.3 Constructing a Data Warehouse

Extraction Extracting data from transactional databases andother data repositories. Typically an overnight batch process.

Cleaning GIGO principle (Garbage In Garbage Out). . .

• completing missing values and NULLs,

• correcting typos and other errors,

• unifying synonyms,

• . . .

Data that are obviously erroneous but cannot be corrected, areremoved.

Page 13: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 13

'

&

$

%

Integration and transformation Fusing different data sources.

• Matching entity identifications, e.g., client id and cl nr.

• Unifying data expressed along different scales, e.g., BEF andEURO.

• Translating addresses into coordinates.

• Aggregating individual sales into daily sales figures.

• Normalizing variables between 0 and 1.

Page 14: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 14

'

&

$

%

Load and refresh

• Loading the data into the warehouse involves creating indexes tospeed up queries.

• Modifications in OLTP databases are propagated regularly to thethe data warehouse (copy management).

Page 15: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 15

'

&

$

%

DB

DB

Clean

Integrate

Transform

Load

BB

BB

££

££

Data

WarehouseServe B

BBB

££

££Data Mining

OLAP

Metadata

6? 6?

Page 16: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 16

'

&

$

%

3 Testimony

H. Van Puyvelde. De l’information operationnelle a l’intelligencedecisionnelle par le data mining–Etude de faisabilite appliquee au casd’un service public. Master’s thesis, Universite de Mons-Hainaut,2000.

Company uses a dozen of important applications on four differentDBMS platforms.

Initial challenge: Apply data mining to answer questions like

• “Who are our clients?”

• “Which services are most beneficial to our clients?”

However, a thorough data preparation was mandatory. . .

Page 17: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 17

'

&

$

%

Some difficult quality problems:

• Double registration of the same entity.E.g., 〈RAYTEC, Rue de Commerce 2,. . . 〉 and〈S.A. RAYTEC, 2 Rue de Commerce,. . . 〉.

• Multiple use of the hold-all code “others” for attributes likeskills or profession.

• Missing, impossible, or outdated attribute values.

This confirms others’ experiences:

Preparation of the data [. . . ] can easily take up to 80% ofthe time needed for the whole KDD [Knowledge Discovery inDatabases]; this is not surprising, since the difficulties indata integration are well known.”

[Mannila 96]

Page 18: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 18

'

&

$

%

4 OLAP

4.1 Data Cube

Typically, OLAP analyses are based on summary reports, e.g., thedaily sales amounts by store and product.

The data can be naturally represented in a “data cube”:

• The cube dimensions correspond to the independent variables,e.g., day, store, and product.

• The cube cells contain the corresponding values for thedependent variable, e.g., the number of pieces sold.

OLAP software provides several ways of visualizing data cubes.

Page 19: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 19

'

&

$

%

6

HHHHY

©©©©*day

store

product

HHHHHHHH

HHHHHHHH

HHHHHHHH

HHHHHHHH

©©©©

©©©©

©©©©

©©©©

©©©©©©©©©©©©©©©©©©©©

HHHHHHHH

HHHHHHHH

HHHHHHHH

46

44

44

45

33

34

28

27

36

35

27

28

46

50

33

72

36

73

46

5044

5144

5245

51

Kinderdroom

Navona

Cremona

LegoScrabble

1 Jan 2001

15 Jan 2001

1 Feb 2001

15 Feb 2001

A 3-dimensional data cube.

Page 20: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 20

'

&

$

%

Concept Hierarchies

Typically, dimensions are organized in concept hierarchies thatdetermine logical ways of grouping data.

r day

r month

r year

r store

r region

r product

r class

Page 21: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 21

'

&

$

%

4.2 Rollup: A Typical OLAP Query

Rollup queries provide for each dimension the level at which data isto be presented.

“Give total sales amounts by product, region, and month”.

HHHH

HHHH

HHHH

©©©©

©©©©

©©©©

©©©©©©©©©©©©

HHHH HHHH HHHH

90

89

138

110

90

101

138

294

90

10189

103

Belgium

Italy

LegoScrabble

Jan 2001

Feb 2001

The cuboid month region product

Page 22: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 22

'

&

$

%

OLAP queries can reduce the number of dimensions.

“Give total sales figures by product and month, over all stores.”

HHHH

HHHH ©©©©

©©©©©©©©©©©©©©©©

HHHH HHHH HHHH

228

199

228

395228

395199

203LegoScrabble

Jan 2001

Feb 2001

The cuboid month product.

Page 23: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 23

'

&

$

%

“Pre-materializing” Cuboids

q day store product

q month store product

q year store product

q store product

q day region product

q month region product

q year region product

q region product

q day product

q month product

q year product

q product

»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»

q day store class

q month store class

q year store class

q store classq day region class

q month region class

q year region class

q region class

q day class

q month class

q year class

q class

»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»

q day store

q month store

q year store

q storeq day region

q month region

q year region

q regionq day

q monthq year

q

»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

Page 24: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 24

'

&

$

%

4.3 Technological Choice: ROLAP or MOLAP

Technological challenges in OLAP:

efficient support of spreadsheet operations on databases ofmultiple gigabytes.

Depending on the technology used, one can classify OLAP softwareinto two categories:

• ROLAP (Relational OLAP), or

• MOLAP (Multidimensional OLAP).

Page 25: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 25

'

&

$

%

ROLAP

• The data cube is stored in relational tables, in a so-called“star-scheme.”

• Relational database servers are extended with specializedmiddleware for OLAP support. E.g., Microsoft SQL ServerOLAP Services.

• The relational query language SQL is extended with specialOLAP primitives.

Page 26: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 26

'

&

$

%

Star Scheme

day store product count

1 Jan 2001 Kinderdroom Lego 46

1 Jan 2001 Kinderdroom Scrabble 50

. . . . . . . . . . . .

1 Jan 2001 all Lego 115

. . . . . . . . . . . .

product class

Lego toys

. . . . . .

store region

Kinderdroom Begium

. . . . . .

day month year

. . . . . . . . .

£££££££±

½½

½½

½½

½½

½½>

»»»»»»»»:

Page 27: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 27

'

&

$

%

MOLAP

• MOLAP uses multidimensional databases storing data in(sparse) matrices.

• This way of storing data may be more efficient than in ROLAP.

• A drawback is that integration with existing SQL databases ismore difficult.

Page 28: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 28

'

&

$

%

5 Data Mining

• In OLAP, the end user guides the analysis:

1. choice of dimensions and dependent variables, and

2. specification of queries.

• Problem: the content of the data warehouse is often not wellunderstood so that it is quasi impossible to select the right datacube and ask the good questions.

• Starting point of data mining: use computer power to discoverinteresting patterns from databases—rather than verifyinghypotheses.

Page 29: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 29

'

&

$

%

E.g., credit scoring. Using historical warehouse data on overduecredits, a data ming program may discover the following rule:

If income ≤ 20.000 Euro and seniority ≤ 5 years

then risk = high, else risk = low

Page 30: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 30

'

&

$

%

5.1 Data Mining Applications

• Automatic classification of sky objects.

• Fraud detection.

• Credit scoring.

• Targeted mailing.

• Scouting in NBA (IBM Advanced Scout).

• . . .

Page 31: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 31

'

&

$

%

Targeted Mailing

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡

amount of messages sent

amount of responses received

Targeted mail

Mass mail

Gain¾ -

Page 32: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 32

'

&

$

%

5.2 Data Mining Tasks and Techniques

Tasks Techniques Algorithms

Prediction Decision trees ID3, C4.5,. . .

(Classification & Classification rules covering algorithm,. . .

Regression) Bayesian networks

Neural networks

Association rules Apriori and its variants

Clustering Partitioning k-Means, k-Medoids,. . .

Hierarchical BIRCH, CURE,. . .

Density-based DBSCAN, OPTICS,. . .

Page 33: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 33

'

&

$

%

5.3 Classification

• Input = (historical) training data with known class labels. E.g.,

Age . . . Car Risk

young . . . sport high

middle . . . sport high

middle . . . family low

......

......

old . . . sport low

• Build a model that predicts the class given values for the otherattributes.

• Test the model on a separate data set.

• Use the model to predict new cases.

Page 34: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 34

'

&

$

%

Classification by Decision Trees

Age . . . Car Risk

young . . . sport high

middle . . . sport high

middle . . . family low

......

......

old . . . sport low

high

high

low

low

²±

¯°Age

²±

¯°Car

©©©©HHHH

¡¡

@@

young middle old

sport family

Page 35: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 35

'

&

$

%

More Complex 6⇒ Better Prediction

Model complexity

Prediction error

Test data

Training data

Page 36: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 36

'

&

$

%

Classification by Neural Networks

User’s view:

µ´¶³µ´¶³

Input

Input

-Black Box

Caveat:Experience shows that in many applications [. . . ], the explicitknowledge structures that are acquired, the structural descriptions,are at least as important, and often very much more important, thanthe ability to perform well on new examples. People frequently usedata mining to gain knowledge, not just predictions.

[Witten and Frank, pp. 7–8]

Page 37: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 37

'

&

$

%

A Look Insight...

µ´¶³µ´¶³

Input

Input

µ´¶³µ´¶³µ´¶³

µ´¶³

HHj

HHjAAAAAU

©©*

©©*

¢¢¢¢¢

JJ

JJ-

­­

­­Á

-

Page 38: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 38

'

&

$

%

Neuron

&%

'$∑

f

QQ

Qs

´´

3

I1 w1

Inwn

...-

f(∑n

j=1 wj × Ij)

where:

threshold function

f

-6

Once the topology (number of layers, number of neurons per layer) ofthe network is fixed, the weight wj of each connection is chosen so asto optimize the prediction on training data.

Page 39: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 39

'

&

$

%

5.4 Boolean Association Rules

1 {hammer, crowbar, nails}2 {hammer, saw, screw}3 {hammer, crowbar, nails, screw}4 {hammer, crowbar, saw, nails}5 {screw}

• The association rule

hammer→ crowbar, nails

has support 3 and confidence 3/4.Find all rules that exceed given support and confidencethresholds.

• Very popular research topic.

Page 40: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 40

'

&

$

%

5.5 Clustering

• Unlike with classification, there are no known class labels.

• Maximize cohesion:distance between clusters >> distances within clusters

10 20 30 40 50 60 70 80

Age100K

200K

300K Income

rr r

rr

rr

rr

rrr

r

p

rrr

rrr

r

r

r

r

r rrrr rrr

r rr rr rr rr rr r

&%

'$

&%

'$

Page 41: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 41

'

&

$

%

Clustering (continued)

• Recognizing shapes:distance within cluster may exceed distance between clusters.

x

y

r rr r

r rr r

r r

r r rr r r

r r rr

r rr r

r rr r

r r

rrrrrrrrrr

rrrrrrrrrr

rrrrrrrrrr r rr r

rr r

r rr

r rr r

rr r

r rr

Page 42: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 42

'

&

$

%

6 Next Generation Data Warehousing/Mining

6.1 The Future of Data Warehousing

1. Decision support systems will become more pro-active.

2. Future systems will be specialized into a specific business sector,e.g., petrochemical industry.

3. The data warehouse will be ever more extended with backgroundknowledge from external data sources, such as

• information on the Web,

• information from geographical information systems.

Page 43: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 43

'

&

$

%

Web Warehousing. . .

E.g., analyzing information on Web pages of competitors: pricingpolicies, promotional events, price variations, frequency ofpromotions,. . .

Difficulties:

• No historical information. Data may be out of date. How did theprice of product X change at competitor’s site?

• Web sites are autonomous; they can change content andstructure at any one time.

• Which Web sites are credible/authoritative?

• Poor productivity when searching. Which online store sellsproduct X at the lowest price?

Page 44: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 44

'

&

$

%

6.2 The Future of Data Mining

• Specialization into specific business sectors. E.g., pharmaceuticalindustry.

• Increased expressiveness. E.g., inductive logic programming(ILP).

• Paradigms for improved user interaction. E.g., inductivedatabases.

• “Hybrid” data mining techniques (collaboration andcompetition).

Page 45: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 45

'

&

$

%

A Note on Expressiveness

ID Width Height Sides Class

a 2 4 4 standing

b 3 6 4 standing

c 8 10 3 standing

d 2 9 4 standing

e 9 1 4 lying

f 4 3 4 lying

g 7 6 3 lying

h 10 2 3 lying

a bc

¯¯¯¯

LL

LLL

d

Standing

ef

g¢¢¢

JJJ

h³³³ HH

Lying

Page 46: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 46

'

&

$

%

Classical classification systemscompare attributes with con-stant values:If width > 3.5 and height

< 8 then lying

The following rule comparesattributes with each other:If width > height then

lying

-

6

width

height

ab

cd

ef

g

h-

6

width

height

ab

cd

ef

g

h��������

Page 47: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 47

'

&

$

%

7 Important Players

All market analysts agree on a large growth of the data warehousingand data mining software market in the next years.

OLAP

The OLAP-market is strongly fragmented, without dominant marketleaders.

• All important database vendors (IBM, Informix Software,Microsoft, Oracle, Sybase) provide solutions for OLAP and datawarehousing.

• Other important players include Hyperion, Cognos,MicroStrategy, Business Objects.

Source: http://www.olapreport.com/.

Page 48: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 48

'

&

$

%

Data Mining

Important data mining products include

• Clementine (SPSS),

• Enterprise Miner (SAS),

• Intelligent Miner (IBM),

See http://www.kdnuggets.com/.

Page 49: Data Warehousing and Data Mining - Université de Monsinformatique.umons.ac.be/ssi/jef/luc.pdf · 2005. 3. 16. · Jef Wijsen Data Warehousing and Data Mining 6 ’ & $ % The Focus

Jef Wijsen Data Warehousing and Data Mining 49

'

&

$

%

8 Selected Literature

• Certain documents on the Web are more interesting than manytext books.

• See http://www.cs.utoronto.ca/˜mendel/ for:

– an overview of scientific research in data warehousing andOLAP;

– links to white papers written by commercial software vendors.

• J. Han and M. Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, 2000.

• I. Witten and E. Frank. Data Mining. Practical MachineLearning Tools and Techniques with Java Implementations.Morgan Kaufmann, 2000.