machine learning and data mining: 15 data exploration and preparation
DESCRIPTION
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the data exploration and preparation.TRANSCRIPT
Prof. Pier Luca Lanzi
Data Exploration and PreparationMachine Learning and Data Mining (Unit 15)
2
Prof. Pier Luca Lanzi
References
Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann Series in Data Management Systems (Second Edition)Tom M. Mitchell. “Machine Learning” McGraw Hill 1997Pang-Ning Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining”, Addison WesleyIan H. Witten, Eibe Frank. “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations”2nd Edition
3
Prof. Pier Luca Lanzi
Outline
Data ExplorationTypes of dataDescriptive statisticsVisualization
Data PreprocessingAggregationSamplingDimensionality ReductionFeature creationDiscretizationAttribute TransformationCompressionConcept hierarchies
4
Prof. Pier Luca Lanzi
Types of data sets
RecordData MatrixDocument DataTransaction Data
GraphWorld Wide WebMolecular Structures
OrderedSpatial DataTemporal DataSequential DataGenetic Sequence Data
5
Prof. Pier Luca Lanzi
Important Characteristics of Structured Data
DimensionalityCurse of Dimensionality
SparsityOnly presence counts
ResolutionPatterns depend on the scale
6
Prof. Pier Luca Lanzi
Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
7
Prof. Pier Luca Lanzi
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
8
Prof. Pier Luca Lanzi
Document Data
Each document becomes a `term' vector, each term is a component (attribute) of the vector,the value of each component is the number of times the corresponding term occurs in the document.
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
9
Prof. Pier Luca Lanzi
Transaction Data
A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
10
Prof. Pier Luca Lanzi
Graph Data Generic graph and HTML Links
5
2
1 2
5
<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers
11
Prof. Pier Luca Lanzi
Chemical Data
Benzene Molecule: C6H6
12
Prof. Pier Luca Lanzi
Ordered Data
Sequences of transactions
An element of the sequence
Items/Events
13
Prof. Pier Luca Lanzi
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG
14
Prof. Pier Luca Lanzi
Ordered Data
Spatio-Temporal Data
Average Monthly Temperature of land and ocean
15
Prof. Pier Luca Lanzi
What is data exploration?
A preliminary exploration of the data to better understand its characteristics.
Key motivations of data exploration includeHelping to select the right tool for preprocessing or analysisMaking use of humans’ abilities to recognize patternsPeople can recognize patterns not captured by data analysis tools
Related to the area of Exploratory Data Analysis (EDA)Created by statistician John TukeySeminal book is Exploratory Data Analysis by TukeyA nice online introduction can be found in Chapter 1 of the NISTEngineering Statistics Handbookhttp://www.itl.nist.gov/div898/handbook/index.htm
16
Prof. Pier Luca Lanzi
Techniques Used In Data Exploration
In EDA, as originally defined by TukeyThe focus was on visualizationClustering and anomaly detection were viewed as exploratory techniquesIn data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus onSummary statisticsVisualizationOnline Analytical Processing (OLAP)
17
Prof. Pier Luca Lanzi
Iris Sample Data Set
Many of the exploratory data techniques are illustrated with the Iris Plant data set http://www.ics.uci.edu/~mlearn/MLRepository.htmlFrom the statistician Douglas Fisher
Three flower types (classes): Setosa, Virginica, VersicolourFour (non-class) attributes, sepal width and length, petal width and length
Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
18
Prof. Pier Luca Lanzi
Mining Data Descriptive Characteristics
MotivationTo better understand the data: central tendency, variation and spread
Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervalsData dispersion: analyzed with multiple granularities of precisionBoxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measuresFolding measures into numerical dimensionsBoxplot or quantile analysis on the transformed cube
19
Prof. Pier Luca Lanzi
Summary Statistics
They are numbers that summarize properties of the data
Summarized properties include frequency, location and spread
ExamplesLocation, meanSpread, standard deviation
Most summary statistics can be calculated in a single pass through the data
20
Prof. Pier Luca Lanzi
Frequency and Mode
The frequency of an attribute value is the percentage of time the value occurs in the data set
For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’occurs about 50% of the time.
The mode of a an attribute is the most frequent attribute value
The notions of frequency and mode are typically used with categorical data
21
Prof. Pier Luca Lanzi
Percentiles
For continuous data, the notion of a percentile is more useful
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value xp of x such that p% of the observed values of x are less than xp
For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%
xp
22
Prof. Pier Luca Lanzi
Measures of Location: Mean and Median
The mean is the most common measure of the location of a set of points
However, the mean is very sensitive to outliers
Thus, the median or a trimmed mean is also commonly used
23
Prof. Pier Luca Lanzi
Measures of Spread: Range and Variance
Range is the difference between the max and minThe variance or standard deviation is the most common measure of the spread of a set of points
However, this is also sensitive to outliers, so that other measures are often used.
24
Prof. Pier Luca Lanzi
Visualization
Visualization is the conversion of data into a visual or tabularformat so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported
Visualization of data is one of the most powerful and appealing techniques for data exploration
Humans have a well developed ability to analyze large amounts of information that is presented visuallyCan detect general patterns and trendsCan detect outliers and unusual patterns
25
Prof. Pier Luca Lanzi
Example: Sea Surface Temperature
The following shows the Sea Surface Temperature for July 1982Tens of thousands of data points are summarized in a single figure
26
Prof. Pier Luca Lanzi
Representation
Is the mapping of information to a visual format
Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors
Example: Objects are often represented as pointsTheir attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shapeIf position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.
27
Prof. Pier Luca Lanzi
Arrangement
Is the placement of visual elements within a displayCan make a large difference in how easy it is to understand the dataExample:
28
Prof. Pier Luca Lanzi
Selection
Is the elimination or the de-emphasis of certain objects and attributes
Selection may involve the chossing a subset of attributes Dimensionality reduction is often used to reduce the number of dimensions to two or threeAlternatively, pairs of attributes can be considered
Selection may also involve choosing a subset of objectsA region of the screen can only show so many pointsCan sample, but want to preserve points in sparse areas
29
Prof. Pier Luca Lanzi
Visualization Techniques: Histograms
Histogram Usually shows the distribution of values of a single variableDivide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objectsShape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
30
Prof. Pier Luca Lanzi
Two-Dimensional Histograms
Show the joint distribution of the values of two attributesExample: petal width and petal length
What does this tell us?
31
Prof. Pier Luca Lanzi
Visualization Techniques: Box Plots
Box Plots Invented by J. TukeyAnother way of displaying the distribution of data Following figure shows the basic part of a box plot
outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
32
Prof. Pier Luca Lanzi
Example of Box Plots
Box plots can be used to compare attributes
33
Prof. Pier Luca Lanzi
Visualization Techniques: Scatter Plots
Attributes values determine the positionTwo-dimensional scatter plots most common, but can have three-dimensional scatter plotsOften additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
34
Prof. Pier Luca Lanzi
Scatter Plot Array of Iris Attributes
35
Prof. Pier Luca Lanzi
Visualization Techniques: Contour Plots
Useful when a continuous attribute is measured on a spatial gridThey partition the plane into regions of similar valuesThe contour lines that form the boundaries of these regions connect points with equal valuesThe most common example is contour maps of elevationCan also display temperature, rainfall, air pressure, etc.
36
Prof. Pier Luca Lanzi
Contour Plot Example: SST Dec, 1998
Celsius
37
Prof. Pier Luca Lanzi
Visualization Techniques: Matrix Plots
Can plot the data matrix
This can be useful when objects are sorted according to class
Typically, the attributes are normalized to prevent one attribute from dominating the plot
Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects
Examples of matrix plots are presented on the next two slides
38
Prof. Pier Luca Lanzi
Visualization of the Iris Data Matrix
standarddeviation
39
Prof. Pier Luca Lanzi
Visualization of the Iris Correlation Matrix
40
Prof. Pier Luca Lanzi
Visualization Techniques: Parallel Coordinates
Used to plot the attribute values of high-dimensional dataInstead of using perpendicular axes, use a set of parallel axes The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a lineThus, each object is represented as a line Often, the lines representing a distinct class of objects group together, at least for some attributesOrdering of attributes is important in seeing such groupings
41
Prof. Pier Luca Lanzi
Parallel Coordinates Plots for Iris Data
42
Prof. Pier Luca Lanzi
Other Visualization Techniques
Star Plots Similar approach to parallel coordinates, but axes radiate from a central pointThe line connecting the values of an object is a polygon
Chernoff FacesApproach created by Herman ChernoffThis approach associates each attribute with a characteristic of a faceThe values of each attribute determine the appearance of the corresponding facial characteristicEach object becomes a separate faceRelies on human’s ability to distinguish faces
43
Prof. Pier Luca Lanzi
Star Plots for Iris Data
Setosa
Versicolour
Virginica
44
Prof. Pier Luca Lanzi
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica
45
Prof. Pier Luca Lanzi
Why Data Preprocessing?
Data in the real world is dirty
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation=“ ”Noisy: containing errors or outliers
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or namese.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records
46
Prof. Pier Luca Lanzi
Why Is Data Dirty?
Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems
Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission
Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
47
Prof. Pier Luca Lanzi
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality dataE.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
48
Prof. Pier Luca Lanzi
Data Quality
What kinds of data quality problems?How can we detect problems with the data? What can we do about these problems?
Examples of data quality problems: noise and outliers missing values duplicate data
49
Prof. Pier Luca Lanzi
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility
Broad categories:Intrinsic, contextual, representational, and accessibility
50
Prof. Pier Luca Lanzi
Noise
Noise refers to modification of original valuesExamples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
51
Prof. Pier Luca Lanzi
Outliers
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
52
Prof. Pier Luca Lanzi
Missing Values
Reasons for missing valuesInformation is not collected (e.g., people decline to give their age and weight)Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
Handling missing valuesEliminate Data ObjectsEstimate Missing ValuesIgnore the Missing Value During AnalysisReplace with all possible values (weighted by their probabilities)
53
Prof. Pier Luca Lanzi
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples: same person with multiple email addresses
Data cleaning: process of dealing with duplicate data issues
54
Prof. Pier Luca Lanzi
Data Cleaning as a Process
Data discrepancy detectionUse metadata (e.g., domain, range, dependency, distribution)Check field overloading Check uniqueness rule, consecutive rule and null ruleUse commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)
Data migration and integrationData migration tools: allow transformations to be specifiedETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
Integration of the two processesIterative and interactive (e.g., Potter’s Wheels)
55
Prof. Pier Luca Lanzi
Data Preprocessing
AggregationSamplingDimensionality ReductionFeature subset selectionFeature creationDiscretization and BinarizationAttribute Transformation
56
Prof. Pier Luca Lanzi
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
PurposeData reduction: reduce the number of attributes or objectsChange of scale: cities aggregated into regions, states, countries, etcMore “stable” data: aggregated data tends to have less variability
57
Prof. Pier Luca Lanzi
Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
58
Prof. Pier Luca Lanzi
Sampling
Sampling is the main technique employed for data selection
It is often used for both the preliminary investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming
Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming
59
Prof. Pier Luca Lanzi
Key principles for Effective Sampling
Using a sample will work almost as well as using the entire data sets, if the sample is representative
A sample is representative if it has approximately the same property (of interest) as the original set of data
60
Prof. Pier Luca Lanzi
Types of Sampling
Simple Random SamplingThere is an equal probability of selecting any particular item
Sampling without replacementAs each item is selected, it is removed from the population
Sampling with replacementObjects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once
Stratified samplingSplit the data into several partitionsThen draw random samples from each partition
61
Prof. Pier Luca Lanzi
Sample Size
8000 points 2000 Points 500 Points
62
Prof. Pier Luca Lanzi
Sample Size
What sample size is necessary to get at least one object from each of 10 groups.
63
Prof. Pier Luca Lanzi
Curse of Dimensionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
•Randomly generate 500 points
•Compute difference between max and min distance between any pair of points
64
Prof. Pier Luca Lanzi
Dimensionality Reduction
Purpose:Avoid curse of dimensionalityReduce amount of time and memory required by data mining algorithmsAllow data to be more easily visualizedMay help to eliminate irrelevant features or reduce noise
TechniquesPrinciple Component AnalysisSingular Value DecompositionOthers: supervised and non-linear techniques
65
Prof. Pier Luca Lanzi
Dimensionality Reduction: Principal Component Analysis (PCA)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data Steps
Normalize input data: Each attribute falls within the same rangeCompute k orthonormal (unit) vectors, i.e., principal componentsEach input data (vector) is a linear combination of the k principal component vectorsThe principal components are sorted in order of decreasing “significance” or strengthSince the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data
Works for numeric data onlyUsed when the number of dimensions is large
66
Prof. Pier Luca Lanzi
X1
X2
Y1Y2
Principal Component Analysis
67
Prof. Pier Luca Lanzi
Dimensionality Reduction: PCA
Goal is to find a projection that captures the largest amount of variation in data
x2
x1
e
68
Prof. Pier Luca Lanzi
Dimensionality Reduction: PCA
Find the eigenvectors of the covariance matrixThe eigenvectors define the new space
x2
x1
e
69
Prof. Pier Luca Lanzi
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features duplicate much or all of the information contained in one or more other attributesExample: purchase price of a product and the amount of sales tax paid
Irrelevant featurescontain no information that is useful for the data mining task at handExample: students' ID is often irrelevant to the task of predicting students' GPA
70
Prof. Pier Luca Lanzi
Feature Subset Selection
Brute-force approachTry all possible feature subsets as input to data mining algorithm
Embedded approachesFeature selection occurs naturally as part of the data mining algorithm
Filter approachesFeatures are selected using a procedure that is independent from a specific data mining algorithmE.g., attributes are selected based on correlation measures
Wrapper approaches:Use a data mining algorithm as a black box to find best subset of attributesE.g., apply a genetic algorithm and an algorithm for decision tree to find the best set of features for a decision tree
71
Prof. Pier Luca Lanzi
Feature Creation
Create new attributes that can capture the important information in a data set much more efficiently than the original attributesE.g., given the birthday, create the attribute age
Three general methodologies:Feature Extraction: domain-specificMapping Data to New SpaceFeature Construction: combining features
72
Prof. Pier Luca Lanzi
Mapping Data to a New Space
Two Sine Waves Two Sine Waves + Noise Frequency
Fourier transformWavelet transform
73
Prof. Pier Luca Lanzi
Discretization
Three types of attributes:Nominal: values from an unordered set, e.g., colorOrdinal: values from an ordered set, e.g., military or academic rank Continuous: real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervalsSome classification algorithms only accept categorical attributesReduce data size by discretizationPrepare for further analysis
74
Prof. Pier Luca Lanzi
Discretization Approaches
SupervisedAttributes are discretized using the class informationGenerates intervals that tries to minimize the loss of information about the class
UnsupervisedAttributes are discretized solely based on their values
75
Prof. Pier Luca Lanzi
Discretization Using Class Labels
Entropy based approach
3 categories for both x and y 5 categories for both x and y
76
Prof. Pier Luca Lanzi
Discretization Without Using Class Labels
Data
Equal frequency
Equal interval width
K-means
77
Prof. Pier Luca Lanzi
Discretization and Concept Hierarchy
Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervalsInterval labels can then be used to replace actual data valuesSupervised vs. unsupervisedSplit (top-down) vs. merge (bottom-up)Discretization can be performed recursively on an attribute
Concept hierarchy formationRecursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
78
Prof. Pier Luca Lanzi
Typical Discretization Methods
Entropy-based discretization: supervised, top-down split
Interval merging by χ2 Analysis: supervised, bottom-up merge
Segmentation by natural partitioning: top-down split, unsupervised
79
Prof. Pier Luca Lanzi
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the information gain after partitioning is
Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is
where pi is the probability of class i in S1The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretizationThe process is recursively applied to partitions obtained until some stopping criterion is metSuch a boundary may reduce data size and improve classification accuracy
)(||||)(
||||),( 2
21
1 SEntropySSSEntropy
SSTSI +=
∑=
−=m
iii ppSEntropy
121 )(log)(
80
Prof. Pier Luca Lanzi
Interval Merge by χ2 Analysis
Merging-based (bottom-up) vs. splitting-based methodsMerge: Find the best neighboring intervals and merge them to form larger intervals recursivelyChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one intervalχ2 tests are performed for every pair of adjacent intervalsAdjacent intervals with the least χ2 values are merged together, since low χ2 values for a pair indicate similar class distributionsThis merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc.)
81
Prof. Pier Luca Lanzi
Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significantdigit, partition the range into 5 intervals
82
Prof. Pier Luca Lanzi
Example of 3-4-5 Rule
(-$400 -$5,000)
(-$400 - 0)
(-$400 --$300)
(-$300 --$200)
(-$200 --$100)
(-$100 -0)
(0 - $1,000)
(0 -$200)
($200 -$400)
($400 -$600)
($600 -$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -$3,000)
($3,000 -$4,000)
($4,000 -$5,000)
($1,000 - $2, 000)
($1,000 -$1,200)
($1,200 -$1,400)
($1,400 -$1,600)
($1,600 -$1,800) ($1,800 -
$2,000)
Step 4:
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
83
Prof. Pier Luca Lanzi
Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
Simple functions: xk, log(x), ex, |x|Standardization and Normalization
84
Prof. Pier Luca Lanzi
On-Line Analytical Processing (OLAP)
Relational databases put data into tables, while OLAP uses a multidimensional array representation. Such representations of data previously existed in statistics and other fieldsThere are a number of data analysis and data exploration operations that are easier with such a data representation.
85
Prof. Pier Luca Lanzi
Creating a Multidimensional Array
First, identify which attributes are to be the dimensions and which attribute is to be the target attribute whose values appear as entries in the multidimensional array.
The attributes used as dimensions must have discrete valuesThe target value is typically a count or continuous value, e.g., the cost of an itemCan have no target variable at all except the count of objects that have the same set of attribute values
Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry.
86
Prof. Pier Luca Lanzi
Example: Iris data
We show how the attributes, petal length, petal width, and species type can be converted to a multidimensional arrayFirst, we discretized the petal width and length to have categorical values: low, medium, and highWe get the following table - note the count attribute
87
Prof. Pier Luca Lanzi
Example: Iris data (continued)
Each unique tuple of petal width, petal length, and species type identifies one element of the array.This element is assigned the corresponding count value. The figure illustrates the result.All non-specified tuples are 0.
88
Prof. Pier Luca Lanzi
Example: Iris data (continued)
Slices of the multidimensional array are shown by the following cross-tabulationsWhat do these tables tell us?
89
Prof. Pier Luca Lanzi
OLAP Operations: Data Cube
The key operation of a OLAP is the formation of a data cubeA data cube is a multidimensional representation of data, together with all possible aggregates.By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
90
Prof. Pier Luca Lanzi
Data Cube Example
Consider a data set that records the sales of products at a number of company stores at various dates.This data can be represented as a 3 dimensional arrayThere are 3 two-dimensionalaggregates (3 choose 2 ),3 one-dimensional aggregates,and 1 zero-dimensional aggregate (the overall total)
91
Prof. Pier Luca Lanzi
Data Cube Example (continued)
The following figure table shows one of the two dimensional aggregates, along with two of the one-dimensional aggregates, and the overall total
92
Prof. Pier Luca Lanzi
OLAP Operations: Slicing and Dicing
Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. Dicing involves selecting a subset of cells by specifying a range of attribute values.
This is equivalent to defining a subarray from the complete array.
In practice, both operations can also be accompanied by aggregation over some dimensions.
93
Prof. Pier Luca Lanzi
OLAP Operations: Roll-up and Drill-down
Attribute values often have a hierarchical structure.Each date is associated with a year, month, and week.A location is associated with a continent, country, state (province, etc.), and city. Products can be divided into various categories, such as clothing, electronics, and furniture.
Note that these categories often nest and form a tree or lattice
A year contains months which contains dayA country contains a state which contains a city
94
Prof. Pier Luca Lanzi
OLAP Operations: Roll-up and Drill-down
This hierarchical structure gives rise to the roll-up and drill-down operations.
For sales data, we can aggregate (roll up) the sales across all the dates in a month. Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.Likewise, we can drill down or roll up on the location or product ID attributes.
95
Prof. Pier Luca Lanzi
Data Compression
String compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut only limited manipulation is possible without expansion
Audio/video compressionTypically lossy compression, with progressive refinementSometimes small fragments of signal can be reconstructed without reconstructing the whole
Time sequence is not audioTypically short and vary slowly with time
96
Prof. Pier Luca Lanzi
Dimensionality Reduction:Wavelet Transformation
Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space
Method:Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Haar2 Daubechie4
97
Prof. Pier Luca Lanzi
Concept Hierarchy Generation for Categorical Data
Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributesE.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
98
Prof. Pier Luca Lanzi
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchyExceptions, e.g., weekday, month, quarter, year
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Country
Province
City
Street
99
Prof. Pier Luca Lanzi
Summary
Data exploration and preparation, or preprocessing, is a big issue for both data warehousing and data miningDescriptive data summarization is need for quality data preprocessingData preparation includes
Data cleaning and data integrationData reduction and feature selectionDiscretization
A lot a methods have been developed but data preprocessing still an active area of research