data mining, data pattern, machine learning(week 2
TRANSCRIPT
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 1/19
DATA MINING
Data Mining, Data Pattern
and Machine Learning
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 2/19
Definition
• “…the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novelways that are both understandable and useful to the dataowner.”
Hand, Mannila & Smyth
•
“… an interdisciplinary field bringing together techniques frommachine learning, pattern recognition, statistics, databases,and visualization to address the issue of information extractionfrom large data bases.”
Evangelos Simoudis in Cabena et al.
• “… the extraction of implicit, previously unknown, andpotentially useful information from data.”
Witten & Frank
2
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 3/19
Why Has Data Mining Appeared
• Large volumes of data stored by organizations in a
competitive environment combined with advances intechnologies which can be applied to the data
• Background and evolution
–
• The need for exploratory data analysis
– Niche marketing, customer retention, the internet, onlineinteraction, scientific discovery
• The means to implement Data Mining – data warehouses, computing power, effective modelling
approaches
3
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 4/19
Structural Pattern of Data
4
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 5/19
Structural Pattern of Data --cont--
5
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 6/19
Machine Learning
• To learn:
– To get knowledge of by study, experience, or beingtaught
– To become aware by information or from observation
–
o comm t to memory – To be informed
– To receive instruction
•
Learning: – Things learn when they change their behavior in a way
that makes them perform better in the future
6
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 7/19
Machine Learning --cont--
• Machine Learning involves learning in
practical not in theoretical
• Interested in techniques for finding and
for helping to explain that data and make
predictions from it
7
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 8/19
Data Mining
• Preliminary Analysis
– Much interesting information can be found byquerying the data set
– May be supported by a visualisation of the data set
•
Choose a one or more modelling approaches• There are (at least?) two styles of data mining
– Hypothesis testing
– Knowledge discovery
• The styles and approaches are not mutuallyexclusive
8
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 9/19
The Proses of Knowlegde Discovery
• Pre-processing
– data selection
– cleaning
– codin
• Data Mining
– select a model
– apply the model
• Analysis of results and assimilation
– Take action and measure the results
9
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 10/19
Data Selection
• Identify the relevant data, both internal and
external to the organisation
• Select the subset of the data appropriate for
• Store the data in a database separate from
the operational systems
10
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 11/19
Data Pre-Processing
• Cleaning
– Domain consistency: replace certain values with
null
– -
database (DB) on each purchase transaction
– Disambiguation: highlighting ambiguities for a
decision by the user
• e.g., if names differed slightly but addresses were the
same
11
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 12/19
Data Pre-Processing –cont--
• Enrichment
– Additional fields are added to records from externalsources which may be vital in establishingrelationships.
– e.g., take addresses and replace them with regionalcodes
– e.g., transform birth dates into age ranges
• It is often necessary to convert continuous datainto range data for categorisation purposes.
12
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 13/19
Data Mining Task• Various taxonomies exist. E.g. Berry & Linoff 6 tasks:
– Classification
– Estimation (a.k.a. regression)
– Prediction
– Association Rule Discovery (a.k.a. Affinity Grouping )
– Clustering
– Description
• The tasks are also referred to as operations. Cabena et al. define 4 operations:
– Predictive Modelling
– Database Segmentation (a.k.a. clustering)
– Link Analysis
– Deviation Detection
• Beware! Different authors use different names for the same technique, operation
or task.
13
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 14/19
Clasification
• Classification involves considering the
features of some object then assigning it it tosome pre-defined class, for example:
–
– Which phone numbers are fax numbers
– Which customers are high-value
14
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 15/19
Regression
• Regression deals with numerically valued
outcomes rather than discrete categories asoccurs in classification.
–
– Estimating family income
15
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 16/19
Prediction
• Essentially the same as classification and
estimation but involves future behavior
• Historical data is used to build a model
• The model developed is then applied to current
inputs to predict future outputs
– Predict which customers will respond to an
advertising promotion
– Classifying loan applications
16
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 17/19
Association Rule Discovery
• Association Rule Discovery is also referred to
as Market Basket Analysis, or Affinitygrouping
• A common exam le is discoverin which
items are bought together at thesupermarket. Once this is known, decisionscan be made on, for example:
– how to arrange items on the shelves – which items should be promoted together
17
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 18/19
Clustering
•
Clustering is also sometimes referred to assegmentation (though this has other meanings inother fields)
• In clustering there are no pre-defined classes. A
similarity measure is used to group records. The usermust attach meaning to the clusters formed
• Clustering often precedes some other data miningtask, for example:
– once customers are separated into clusters, a promotionmight be carried out based on market basket analysis of the resulting cluster
18
8/8/2019 Data Mining, Data Pattern, Machine Learning(Week 2
http://slidepdf.com/reader/full/data-mining-data-pattern-machine-learningweek-2 19/19
Deviation Detection• Records whose attributes deviate from the norm
by significant amounts are also called outliers• Application areas include:
– fraud detection
–
– tracing defects
• Visualization techniques and statisticaltechniques are useful in finding outliers
• A cluster which contains only a few records mayin fact represent outliers
19