lecture 2 themes in this session knowledge discovery in databases data mining multidimensional...

42
Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Post on 21-Dec-2015

230 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Lecture 2

Themes in this session

• Knowledge discovery in databases• Data mining• Multidimensional analysis and OLAP

Page 2: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Knowledge discovery in databases

Page 3: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

What is Knowledge?

• Data– symbols representing properties of events and their

environments

• Information– is contained in descriptions, provides the answers to a

number of basic questions

• Knowledge– basic know-how facilitates allows action

• Understanding– achieved through diagnosis and prescription

• Wisdom– judgement of what is efficient and effective

Page 4: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Characteristics of discovered knowledge

• non-trivial• valid• novel• potential useful• understandable

• An aggregated measure is “interestingness”– validity– novelty– usefulness– simplicity

Page 5: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

A more formal definition of knowledge

• Pattern– A pattern is an expression E in a language L describing

facts in a subset FE of F. E is called a pattern if it is simpler than the enumeration of all the facts in FE

• Knowledge– A pattern E L is called knowledge if for some user-

specified threshold i Mi , I(E,F,C,N,U,S) > i

– where C = validity, N = novelty, U = usefulness, S = simplicity

Page 6: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

What is KDD?

• Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data.

• KDD is a process– involves the extraction, organisation and presentation of

discovered information

• KDD is effected by a human-centred system– is in itself a knowledge intensive task consisting of

complex interactions between a human and a (large) database.

Page 7: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Overview of the analyst’s tasks

DB

Dataset

Output

GoalsInsightgains

formulates

generates

enriches

Analyses

Queries

Page 8: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Characteristics of the KDD process

• highly iterative• protracted over time• numerous sub-tasks• highly complex• numerous input systems

Page 9: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

A description of the KDD process

Goalformulation

Task discovery

Datadiscovery

Dataanalysis

Modeldevelopment

Datacleaning

Outputgeneration

Page 10: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Goal formulation

Based on a means-ends chain extending into the workings of the organisation

• Formulate a goal for improving the operations of the business

• Decide what one needs to know in order to fulfil this goal and perform the business activity in a better manner

• On the basis of what one needs to know formulate goals for how to discover this information by using the KDD process

• Revise all of the goals above if needs on the basis of iterative discovery

Page 11: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Data discovery

• Try and understand the domain in order to determine which entities are relevant to the discovery process

• Check the coverage and content of the data– sift through the source data to see what is available– sift through the source data to see what is not available

• Determine the quality of the data• Determine the structure of the data

Page 12: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Task discovery

• Find means stipulated by the ends contained in the knowledge discovery goals

• Find out what the real requirements on the tasks and the performance of these tasks are

• Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions

Page 13: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Data cleaning

• Ensure the quality of the data that will be used in the KDD process

• Eliminate data quality problems in the data such as…– inconsistencies due to differences between

various data sources– missing data– different forms of data representation– data incompatibility

Page 14: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Model development

Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals

• Select the parameters for the model– formulate measures that can be used to quantify achievement of the

goal (outcome variable or dependent variable)– select a set of independent variables which are deemed to have

relevance to the outcome variables

• Segment the data– find possible relevant subsets in the population

• Choose an analysis model which fits the problem domain

NOTE: This whole phase demands background knowledge of the domain

Page 15: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Data analysis

Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal

• specify the chosen model– use some form of formal expression

• fit the model to the data– perform initial adjustments to some of the parameters

• evaluate the model– check the soundness of the model against the data

• refine the model– modify the model on the basis of its discrepancies with the

evidence presented by the data

Page 16: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Output generation

• Reports of findings in the analysis• Action suggestions on the basis of the findings• Models for use in similar analysis scenarios• Monitoring mechanisms which observe the

variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.

Page 17: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Developing KDD applications

Purpose: an application to answer a key business question

• a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed

• encoding of the discovered knowledge within a specific problem solving architecture

• application of the knowledge in the context of a real world task by a well understood class of end-users

• Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data

Page 18: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Data mining

Page 19: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

What is data mining?

Rather formal definition:• Data mining involves fitting models to, and

observing patterns from, observed data through the application of specific algorithms.

Less formally:• Data analysis in order to explain an aspect of a

complex reality by expressing it as an understandable simplification

Page 20: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Goals for data mining

• Prediction– involve using some variables or fields in the

database to predict unknown or future values of other variables of interest

• Description– focuses on finding human interpretable patterns

describing the data

Page 21: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Rationale for data mining

• Dramatic increase in the amount of data available (the data explosion)

• Increasing competition in the world’s market• The low relative value of easily discovered

information• Increasing cleverness • Emergence of new enabling technology

Page 22: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Enabling factors for data mining

• Increased data storage ability• Increased data gathering ability• Increased processing power• The introduction of new computationally

intensive methods of machine learning

Page 23: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Background to data mining

• Inductive learning– supervised learning– unsupervised learning

• Statistics• Machine learning

– Differences between DM and ML• DM finds understandable knowledge, ML improves the

performance of an agent• DM is concerned with large, real-world databases, ML

with smaller data sets• ML is a broader files, not only learning by example

Page 24: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Data mining algorithms

Specific mix of three components:• The model

– function– representational form– parameters from the data

• The model evaluation (preference) criterion– preference of one set of models or set of parameters over

another– based on goodness-of-fit function

• The search method– a method for finding particular models and parameters– Given: data, family of models, preference criterion

Page 25: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Primary operations in data mining

A number of basic operations can be used for prediction and depiction– Classification– Regression– Clustering– Summarisation– Dependency modelling– Change and deviation detection

Page 26: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Classification

• Learning a function that maps (classifies) a data item into one of several predefined classes

• In supervised learning it is the user that defines the classes.

• The classification is applied in the form of one or more attributes that denotes the class of the data item.

• These classifying attributes are known as predicted attributes. A combination of values for the predicted attributes defines a class

• Other attributes of the data item are known as predicting attributes

Page 27: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Regression

• A common statistical technique for modelling the relationship between two or more variables

• Learning a function which maps a data item to a real-valued prediction variable

• Simple linear regression uses the straight line model Y = 0 + 1X + , where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable)

• Multiple regression involves more than two variables and uses the model Y = 0 + 1X1 + 2X2 +…+ nXn + , where Y is the prediction variable and X1… Xn are the predictive variables

Page 28: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Clustering

• A common descriptive task for determining a finite set of categories or clusters to describe the data

• Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories

• A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups

• Correlations and functions of distance between elements are used in defining the clusters

Page 29: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Summarisation

• Methods for finding a compact description for a subset of data

• Often relies on statistical methods such as the calculating of means and standard derivations

• Are often applied to interactive exploratory data analysis and automated report generation.

Page 30: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Dependency modelling

• Consists for finding a model which describes significant dependencies between variables

• There are two levels of dependency in dependency models:

• The structural level specifies which variables are locally dependent on each other

• The quantitative level specifies the strengths of the dependencies using some numerical scale

• Often in the form: x% of all record containing items A and B, also contain items D and E

Page 31: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Change and deviation detection

• Focuses on discovering the most significant changes in the data from previously measured or normative values

• Often used on a long time series of records in order to discover trends

• Often used to discover sequential patterns occurring over extended time periods

Page 32: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Problems and issues in data mining

• Limited information• Noise and missing values• Uncertainty• Size of databases• Irrelevance of certain fields• Updates to databases

Page 33: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Multidimensional analysis and OLAP

Page 34: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

OLAP vs OLTP

• OLTP servers handle mission-critical production data accessed through simple queries

• usually handles queries of an automated nature• OLTP applications consist of a large number of relatively

simple transactions. • Most often contains data organised on the basis of

logical relations between normalised tables

• OLAP servers handle management-critical data accessed through an iterative analytical investigation

• usually handles queries of an ad-hoc nature• supports more complex and demanding transactions• contains logically organised data in multiple dimensions

Page 35: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

What is OLAP?

Definition: The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data.

• Flexible information synthesis• Multiple data dimensions/consolidation paths• Dynamic data analysis

Page 36: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Codd’s four data models for data analysis

• Categorical data models• Exegetical data models• Contemplative data models• Formulaic data models

Page 37: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Dimensionality revisited

Sales

Year

Quarter

Productgroup

Region

Product type

Focal eventDimensions

Page 38: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

OLAP Tool evaluation criteria (1-6)

• Multidimensional conceptual view• Transparency• Accessibility• Consistent reporting performance• Client-Server architecture• Generic dimensionality

Page 39: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

OLAP Tool evaluation criteria (7-12)

• Dynamic Sparse Matrix handling• Multi-user support• Unrestricted cross-dimensional analysis• Intuitive data manipulation• Flexible reporting• Unlimited dimensions and aggregation levels

Page 40: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Functionality of OLAP tools

• Drill-down• Drill-up• Roll-up or consolidation• “Slicing and dicing” by pivoting• Drill-through• Drill-across

Page 41: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

An OLAP “answer set”

Product Group Region First Quarter - 1997Group A ABC 1245Group A XYZ 34534Group B ABC 45543Group B XYZ 34533

Column headers(join constraints)

Column header(application constraint) Answer set representing

focal event

Row headers

Page 42: Lecture 2 Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP

Different forms of OLAP

• True OLAP

• ROLAP (relational OLAP)

• MOLAP (multidimensional OLAP)