Download - CS1004-NOL
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 1/323
1
Chapter 1. Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Major issues in data mining
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 2/323
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 3/323
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 4/323
4
What Is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Data mining: a misnomer? Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,business intelligence, etc.
Watch out: Is everything “data mining”? Simple search and query processing
(Deductive) expert systems
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 5/323
5
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discoveryprocess
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Da
ta
Selection
Data Mining
Pattern Evaluation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 6/323
6
KDD Process: Several KeySteps
Learning the application domain relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining summarization, classification, regression, association, clustering
Choosing the mining algorithm(s) Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 7/323 7
Data Mining and BusinessIntelligence
Increasing potential
to support
business decisions End User
BusinessAnalyst
Data
Analyst
DBA
Decision
Making
Data PresentationVisualization Techniques
Data Mining Information Discovery
Data Exploration Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 8/323 8
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
Algorithm
OtherDisciplines
Visualization
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 9/323 9
Why Not Traditional DataAnalysis?
Tremendous amount of data Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 10/323
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 11/323 11
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be
discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 12/323 12
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data Multimedia database
Text databases
The World-Wide Web
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 13/323 13
Data Mining Functionalities
Multidimensional concept description: Characterization anddiscrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and
distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify
cars based on (gas mileage)
Predict some unknown or missing numerical values
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 14/323 14
Data Mining Functionalities (2)
Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis Outlier: Data object that does not comply with the general behavior of
the data Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis Similarity-based analysis
Other pattern-directed or statistical analyses
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 15/323 15
Major Issues in Data Mining
Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge
fusion
User interaction
Data mining query languages and ad-hoc mining Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 16/323 16
Are All the “Discovered” PatternsInteresting?
Data mining may generate thousands of patterns: Not all of themare interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid onnew or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 17/323 17
Find All and Only InterestingPatterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting patterns?
Approaches
First general all the patterns and then filter out the
uninteresting ones
Generate only the interesting patterns—mining query
optimization
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 18/323 18
Why Data Mining Query Language?
Automated vs. query-driven? Finding all the patterns autonomously in a database?—
unrealistic because the patterns could be too many but
uninteresting
Data mining should be an interactive process User directs what to be mined
Users must be provided with a set of primitives to be used to
communicate with the data mining system
Incorporating these primitives in a data mining query language More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
P i iti th t D fi D t Mi i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 19/323 19
Primitives that Define a Data MiningTask
Task-relevant data Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions Data grouping criteria
Type of knowledge to be mined Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 20/323 20
Primitive 3: Background Knowledge
A typical kind of background knowledge: Concept hierarchies
Schema hierarchy
E.g., street < city < province_or_state < country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address: [email protected]
login-name < department < university < country Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and
(P1 - P2) < $50
r m ve : a ern n eres ngness
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 21/323 21
r m ve : a ern n eres ngness
Measure
Simplicitye.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rulequality, discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise
threshold (description) Novelty
not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign rule
implication support ratio)
P i iti 5 P t ti f Di d
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 22/323 22
Primitive 5: Presentation of DiscoveredPatterns
Different backgrounds/usages may require different forms of
representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchy is also important Discovered knowledge might be more understandable
when represented at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing
provide different perspectives to data
Different kinds of knowledge require different
representation: association, classification, clustering, etc.
DMQL A Data Mining Query
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 23/323
23
DMQL—A Data Mining QueryLanguage
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has
on relational database
Foundation for system development and evolution
Facilitate information exchange, technology
transfer, commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 24/323
24
An Example Query in DMQL
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 25/323
25
Other Data Mining Languages &Standardization Efforts
Association rule language specifications MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft
SQLServer 2005)
Based on OLE, OLE DB, OLE DB for OLAP, C#
Integrating DBMS, data warehouse and data mining
DMML (Data Mining Mark-up Language) by DMG (www.dmg.org)
Providing a platform and process structure for effective data mining
Emphasizing on deploying data mining technology to solve business
problems
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 26/323
26
Integration of Data Mining and DataWarehousing
Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction
by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 27/323
27
Coupling Data Mining with DB/DWSystems
No coupling—flat file processing, not recommended Loose coupling
Fetching data from DB/DW
Semi-tight coupling—enhanced DM performance
Provide efficient implement a few data mining primitives in aDB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling—A uniform information processing
environment
DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 28/323
28
Architecture: Typical Data MiningSystem
data cleaning, integration, and selection
Database or DataWarehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowledge-Base
DatabaseData
Warehouse
World-Wide
Web
Other Info
Repositories
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 29/323
29
What is Data Warehouse?
Defined in many different ways, but not rigorously. A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
a a are ouse u ec
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 30/323
30
a a are ouse— u ec -Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data thatare not useful in the decision support process
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 31/323
31
Data Warehouse—Integrated
Constructed by integrating multiple,heterogeneous data sources relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques areapplied.
Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different datasources E.g., Hotel price: currency, tax, breakfast covered,
etc. When data is moved to the warehouse, it is converted.
a a are ouse me
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 32/323
32
a a are ouse— meVariant
The time horizon for the data warehouse is
significantly longer than that of operational
systems Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain
“time element”
a a are ouse
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 33/323
33
a a are ouse—Nonvolatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
Data Warehouse vs
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 34/323
34
Data Warehouse vs.Heterogeneous DBMS
Traditional heterogeneous DB integration: A query driven
approach
Build wrappers/mediators on top of heterogeneous databases
When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and analysis
ata are ouse vs
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 35/323
35
ata are ouse vs.Operational DBMS
OLTP (on-line transaction processing) Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 36/323
36
OLTP vs. OLAP
OLTP OLAPusers clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relationalisolated
historical,
summarized, multidimensionalintegrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 37/323
37
Warehouse?
High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data: missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled
Note: There are more and more systems which perform
OLAP analysis directly on relational databases
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 38/323
Chapter 3: Data Generalization, Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 39/323
39
Chapter 3: Data Generalization, DataWarehousing, and On-line Analytical
Processing
Data generalization and concept description
Data warehouse: Basic concept
Data warehouse modeling: Data cube and OLAP
Data warehouse architecture
Data warehouse implementation From data warehousing to data mining
u e: a ce o
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 40/323
40
u e: a ce oCuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 41/323
41
Conceptual Modeling of DataWarehouses
Modeling data warehouses: dimensions &measures Star schema: A fact table in the middle connected to a set
of dimension tables
Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 42/323
42
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_solddollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name branch_type
branch
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 43/323
43
Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_solddollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
xamp e o ac
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 44/323
44
xamp e o acConstellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
streetcity
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_keyshipper_type
shipper
C b D fi iti S t (BNF) i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 45/323
45
Cube Definition Syntax (BNF) inDMQL
Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list> Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables)
First time as “cube definition” define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
Defining Star Schema in
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 46/323
46
Defining Star Schema inDMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),avg_sales = avg(sales_in_dollars),units_sold = count(*)
define dimension time as (time_key, day,day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)define dimension location as (location_key, street,
city, province_or_state, country)
e n ng now a e c ema
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 47/323
47
e n ng now a e c emain DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)define dimension location as (location_key, street, city(city_key,
province_or_state, country))
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 48/323
48
in DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube salesdefine dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Measures of Data Cube: Three
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 49/323
49
Measures of Data Cube: ThreeCategories
Distributive: if the result derived by applying the functionto n aggregate values is the same as that derived by
applying the function on all the data without partitioning E.g., count(), sum(), min(), max()
Algebraic: if it can be computed by an algebraic functionwith M arguments (where M is a bounded integer), each
of which is obtained by applying a distributive aggregate
function E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate. E.g., median(), mode(), rank()
oncept erarc y:
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 50/323
50
oncept erarc y:Dimension (location)
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 51/323
51
Multidimensional Data
Sales volume as a function of product,month, and region
P r o
d u
c t
R e g i o
n
Month
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 52/323
52
A Sample Data Cube
Total annual salesof TV in U.S.A.
Date
P r o
d u c t
C o u n t
r y
sum
sumTV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
u o s orrespon ng o
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 53/323
53
u o s orrespon ng othe Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 54/323
54
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or
detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes Other operations
drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 55/323
55
Fig. 3.10 TypicalOLAP Operations
Design of Data Warehouse: A
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 56/323
56
Design of Data Warehouse: ABusiness Analysis Framework
Four views regarding the design of a datawarehouse Top-down view
allows selection of the relevant information necessary
for the data warehouse Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from
the view of end-user
a a are ouse es gn
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 57/323
57
a a a e ouse es gProcess
Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfall: structured and systematic analysis at each step before
proceeding to the next Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 58/323
58
Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture
Data
Warehouse
ExtractTransform
Load
Refresh
OLAP Engine
Analysis
QueryReports
Data mining
Monitor
&
Integrator Metadata
Data SourcesFront-End Tools
Serve
Data Marts
OperationalDBs
Other
sources
Data Storage
OLAP Server
ree a a are ouse
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 59/323
59
Models
Enterprise warehouse collects all of the information about subjects spanning the
entire organization
Data Mart
a subset of corporate-wide data that is of value to a specificgroups of users. Its scope is confined to specific, selected
groups, such as marketing data mart Independent vs. dependent (directly from warehouse)
data mart
Virtual warehouse A set of views over operational databases Only some of the possible summary views may be
materialized
Development: A
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 60/323
60
Development: ARecommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinementModel refinement
Data Warehouse Back-End Tools
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 61/323
61
and Utilities
Data extraction get data from multiple, heterogeneous, and external
sources Data cleaning
detect errors in the data and rectify them when possible
Data transformation convert data from legacy or host format to warehouse
format Load
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions Refresh
propagate the updates from the data sources to thewarehouse
d i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 62/323
62
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data data lineage (history of migrated data and transformation path), currency of
data (active, archived, or purged), monitoring information (warehouse usagestatistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions Business data
business terms and definitions, ownership of data, charging policies
OLAP Server Architectures
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 63/323
63
OLAP Server Architectures
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
Greater scalability Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake
schemas
Efficient Data Cube
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 64/323
64
Computation
Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L
levels?
Materialization of data cube Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization) Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)11( +∏
==
n
ii
LT
D h I l i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 65/323
65
Data warehouse Implementation
Efficient Cube Computation
Efficient Indexing
Efficient Processing of OLAP Queries
Cube Operation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 66/323
66
Cube Operation
Cube definition and computation in DMQL
define cube sales[item, city, year]:sum(sales_in_dollars)
compute cube sales Transform it into a SQL-like language (with a new
operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year Need compute the following Group-Bys
( date, product, customer),(date,product),(date, customer), (product,
customer),(date), (product), (customer)()
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
u t - ay rrayu - ay rrayA tiA ti
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 67/323
67
y yy yAggregationAggregation
Array-based “bottom-up”
algorithm
Using multi-dimensional chunks
No direct tuple comparisons Simultaneous aggregation on
multiple dimensions
Intermediate aggregate values
are re-used for computingancestor cuboids
Cannot do Apriori pruning: No
iceberg optimization
a l l
A B
A B
A B C
A C B C
C
Multi-way Array Aggregation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 68/323
68
Multi way Array Aggregationfor Cube Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces memory
access and storage cost.
What is the best
traversing order
to do multi-wayaggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
B
44
28 5640
24 5236
20
60
-Aggregation for Cube
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 69/323
69
Aggregation for CubeComputation
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428
5640
2452
3620
60
B
Aggregation for Cube
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 70/323
70
Aggregation for CubeComputation
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428
5640
2452
3620
60
B
-Aggregation for Cube
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 71/323
71
Aggregation for CubeComputation (Cont.)
Method: the planes should be sorted andcomputed according to their size in ascending
order Idea: keep the smallest plane in the main memory, fetch
and compute only one chunk at a time for the largest
plane
Limitation of the method: computing well only for
a small number of dimensions If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods can
be explored
n ex ng a a: mapInde
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 72/323
72
Index
Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the
value for the indexed column not suitable for high cardinality domains
Cust Region Type
C1 Asia RetailC2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 02 0 1
3 0 1
4 1 0
5 0 1
ecI Asia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table Index on Region Index on Type
Indices
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 73/323
73
Indices
Join index: JI(R-id, S-id) where R (R-id, …)
S (S-id, …) Traditional indices map the values to a list
of record ids It materializes relational join in JI file and
speeds up relational join In data warehouses, join index relates the
values of the dimensions of a start schemato rows in the fact table. E.g. fact table: Sales and two dimensions
city and product A join index on city maintains for
each distinct city a list of R-IDs of the
tuples recording the Sales in the city Join indices can span multiple dimensions
Queries
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 74/323
74
Queries
Determine which operations should be performed on the available cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =
selection + projection
Determine which materialized cuboid(s) should be selected for OLAP op.
Let the query to be processed be on {brand, province_or_state} with the condition “year =
2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
Explore indexing structures and compressed vs. dense array structs in MOLAP
Data Warehouse Usage
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 75/323
75
Data Warehouse Usage
Three kinds of data warehouse applications Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
(OLAP)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 76/323
76
( )to On Line Analytical Mining (OLAM)
Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned
data Available information processing structure surrounding
data warehouses ODBC, OLEDB, Web accessing, service
facilities, reporting and OLAP tools OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions
Integration and swapping of multiple miningfunctions, algorithms, and tasks
An OLAM System Architecture
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 77/323
77
An OLAM System Architecture
Data
Warehouse
MetaData
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
UNIT II- Data Preprocessing
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 78/323
78
UNIT II- Data Preprocessing
Data cleaning
Data integration and transformation
Data reduction Summary
Major Tasks in Data Preprocessing
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 79/323
79
Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify orremove outliers, and resolve inconsistencies
Data integration Integration of multiple databases, data cubes, or files
Data transformation Normalization and aggregation Data reduction
Obtains reduced representation in volume but producesthe same or similar analytical results
Data discretization: part of data reduction, of particularimportance for numerical data
Data Cleaning
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 80/323
80
Data Cleaning
No quality data, no quality mining results! Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics “Data cleaning is the number one problem in data
warehousing”—DCI survey Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Data cleaning tasks Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
a a n e ea or sDirty
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 81/323
81
Dirty
incomplete: lacking attribute values, lackingcertain attributes of interest, or containing onlyaggregate data e.g., occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes ornames, e.g., Age=“42” Birthday=“03/07/1997” Was rating “1,2,3”, now rating “A, B, C” discrepancy between duplicate records
Why Is Data Dirty?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 82/323
82
Why Is Data Dirty?
Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was
collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Multi-Dimensional Measure of DataQuality
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 83/323
83
Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness
Believability Value added Interpretability Accessibility
Broad categories: Intrinsic, contextual, representational, and accessibility
Missing Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 84/323
84
Missing Data
Data is not always available E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data
Missing data may be due to equipment malfunction
inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time
of entry not register history or changes of the data
Missing data may need to be inferred
ow o an e ss ngData?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 85/323
85
Data?
Ignore the tuple: usually done when class label is missing(when doing classification)—not effective when the % of
missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
Noisy Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 86/323
86
Noisy Data
Noise: random error or variance in a measuredvariable Incorrect attribute values may due to
faulty data collection instruments data entry problems
data transmission problems technology limitation inconsistency in naming convention
Other data problems which requires data cleaning duplicate records
incomplete data inconsistent data
How to Handle Noisy Data?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 87/323
87
How to Handle Noisy Data?
Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc. Regression
smooth by fitting the data into regression functions Clustering detect and remove outliers
Combined computer and human inspection detect suspicious values and check by human (e.g., deal
with possible outliers)
Simple Discretization Methods: Binning
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 88/323
88
Simple Discretization Methods: Binning
Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B – A)/N.
The most straightforward, but outliers may dominate presentation Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately
same number of samples Good data scaling
Managing categorical attributes can be tricky
Smoothing
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 89/323
89
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,26, 28, 29, 34* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34
Regression
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 90/323
90
Regression
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 91/323
91
Cluster Analysis
Data Cleaning as a Process
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 92/323
92
Data Cleaning as a Process
Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools
Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and makecorrections Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation andclustering to find outliers)
Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels)
Data Integration
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 93/323
93
Data Integration
Data integration: Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,Bill Clinton = William Clinton Detecting and resolving data value conflicts
For the same real world entity, attribute values fromdifferent sources are different
Possible reasons: different representations, differentscales, e.g., metric vs. British units
Han ng Re un ancy n DataIntegration
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 94/323
94
g
Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have
different names in different databases
Derivable data: One attribute may be a “derived” attribute inanother table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis
Careful integration of the data from multiple sourcesmay help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Correlation Analysis (Numerical Data)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 95/323
95
y ( )
Correlation coefficient (also called Pearson’s productmoment coefficient)
where n is the number of tuples, and are the respective means
of p and q, σp and σq are the respective standard deviation of p and q,
and Σ(pq) is the sum of the pq cross-product.
If rp,q > 0, p and q are positively correlated (p’s valuesincrease as q’s). The higher, the stronger correlation.
rp,q = 0: independent; rpq < 0: negatively correlated
q pq p
q p
n
q pn pq
n
qq p pr
σ σ σ σ
)1(
)(
)1(
))((,
−
−=
−
−−=
∑∑
pq
relationship)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 96/323
96
p)
Correlation measures the linear relationshipbetween objects To compute correlation, we standardize
data objects, p and q, and then take theirdot product
)(/))(( p std pmean p p k k −=′
)(/))(( q std qmeanqq k k −=′
q pq pncorrelatio ′•′=),(
Data Transformation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 97/323
97
Data Transformation
A function that maps the entire set of values of a
given attribute to a new set of replacement values s.t.each old value can be identified with one of the newvalues
Methods Smoothing: Remove noise from data
Aggregation: Summarization, data cube construction Generalization: Concept hierarchy climbing Normalization: Scaled to fall within a small, specified range
min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature construction New attributes constructed from the given ones
a a rans orma on:Normalization
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 98/323
98
Normalization
Min-max normalization: to [new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling
716.00)00.1(
000,12000,98
000,12600,73=+−
−
−
A A A
A A
A
minnewminnewmaxnewminmax
minvv _ ) _ _ (' +−
−−
=
A
Avv
σ
µ −='
j
vv
10'= Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73
=−
Data Reduction Strategies
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 99/323
99
g
Why data reduction?
A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to
run on the complete data set Data reduction: Obtain a reduced representation of the
data set that is much smaller in volume but yet produce thesame (or almost the same) analytical results
Data reduction strategies Dimensionality reduction — e.g., remove unimportantattributes
Numerosity reduction (some simply call it: Data Reduction) Data cub aggregation Data compression Regression Discretization (and concept hierarchy generation)
Dimensionality Reduction
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 100/323
100
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasinglysparse Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization
Dimensionality reduction techniques Principal component analysis
Singular value decomposition Supervised and nonlinear techniques (e.g., feature selection)
Dimensionality Reduction: PrincipalComponent Analysis (PCA)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 101/323
101
x2
x1
e
Component Analysis (PCA)
Find a projection that captures the largest amountof variation in data
Find the eigenvectors of the covariance matrix, andthese eigenvectors define the new space
Principal Component Analysis(Steps)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 102/323
102
Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors ( principal components) that can be best
used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principalcomponent vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can bereduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
(Steps)
Feature Subset Selection
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 103/323
103
Another way to reduce dimensionality of data Redundant features
duplicate much or all of the information contained in one
or more other attributes
E.g., purchase price of a product and the amount of salestax paid
Irrelevant features contain no information that is useful for the data mining
task at hand E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Heuristic Search in FeatureSelection
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 104/323
104
There are 2d possible feature combinations of d features
Typical heuristic feature selection methods: Best single features under the feature independence
assumption: choose by significance tests Best step-wise feature selection:
The best single-feature is picked first Then next best feature condition to the
first, ... Step-wise feature elimination:
Repeatedly eliminate the worst feature Best combined feature selection and elimination Optimal branch and bound:
Use feature elimination and backtracking
Feature Creation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 105/323
105
Create new attributes that can capture theimportant information in a data set much moreefficiently than the original attributes
Three general methodologies Feature extraction
domain-specific Mapping data to new space (see: data reduction) E.g., Fourier transformation, wavelet
transformation Feature construction
Combining features Data discretization
Mapping Data to a New Space
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 106/323
106
pp g p
Two Sine Waves Two Sine Waves + Noise Frequency
Fourier transform
Wavelet transform
Numerosity (Data)Reduction
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 107/323
107
Reduction
Reduce data volume by choosing alternative,smaller forms of data representation
Parametric methods (e.g., regression) Assume the data fits some model, estimate model
parameters, store only the parameters, and discard thedata (except possible outliers)
Example: Log-linear models—obtain value at a point inm-D space as the product on appropriate marginalsubspaces
Non-parametric methods Do not assume models Major families: histograms, clustering, sampling
Regression and Log-LinearModels
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 108/323
108
Models
Linear regression: Data are modeled to fit a
straight line
Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y
to be modeled as a linear function of
multidimensional feature vector Log-linear model: approximates discrete
multidimensional probability distributions
Regress Analysis and Log-Linear Models
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 109/323
109
Linear regression: Y = w X + b Two regression coefficients, w and b, specify the line andare to be estimated by using the data at hand
Using the least squares criterion to the known values of Y 1, Y 2, …, X 1, X 2, ….
Multiple regression: Y = b0 + b1 X 1 + b2 X 2. Many nonlinear functions can be transformed into the
above Log-linear models:
The multi-way table of joint probabilities is approximatedby a product of lower-order tables
Probability: p(a, b, c, d) = α ab β ac χ ad δ bcd
Linear Models
Data Reduction:Wavelet Transformation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 110/323
110
Wavelet Transformation
Discrete wavelet transform (DWT): linear signalprocessing, multi-resolutional analysis
Compressed approximation: store only a small
fraction of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but betterlossy compression, localized in space
Method: Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Haar2 Daubechie4
DWT for Image Compression
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 111/323
111
g p
Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass
Data Cube Aggregation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 112/323
112
The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes Further reduce the size of data to deal with
Reference appropriate levels Use the smallest representation which is enough to solve
the task
Queries regarding aggregated information should
be answered using data cube, when possible
Data Compression
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 113/323
113
String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion
Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed
without reconstructing the whole Time sequence is not audio
Typically short and vary slowly with time
Data Compression
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 114/323
114
Original Data Compressed
Data
lossless
Original Data
Approximated
l o s s y
Data Reduction: Histograms
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 115/323
115
Divide data into buckets and
store average (sum) for each
bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-
depth)
V-optimal: with the least
histogram variance (weighted
sum of the original values thateach bucket represents)
MaxDiff: set bucket boundary
between each pair for pairs
have the β–1 largest differences
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
1 0 0 0 0 3 0 0 0 0 5 0 0 0 0 7 0 0 0 0 9 0 0 0 0
a a e uc on e o :Clustering
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 116/323
116
Partition data set into clusters based on similarity,and store cluster representation (e.g., centroid and
diameter) only
Can be very effective if data is clustered but not if
data is “smeared”
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions andclustering algorithms
Cluster analysis will be studied in depth in Chapter 7
Data Reduction Method: Sampling
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 117/323
117
Sampling: obtaining a small sample s to represent thewhole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of thedata Simple random sampling may have very poor performance in
the presence of skew
Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page
at a time)
Types of Sampling
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 118/323
118
Simple random sampling There is an equal probability of selecting any particular
item Sampling without replacement
Once an object is selected, it is removed from the
population Sampling with replacement
A selected object is not removed from the population
Stratified sampling: Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the samepercentage of the data)
Used in conjunction with skewed data
Sampling: With or without
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 119/323
119
Replacement
S R S W O
R
( s i m p l e r a n d o m
s a m p l e w i
t h o u t
r e p l a c e m e
n t )
S R S W R
Raw Data
amp ng: us er orStratified Sampling
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 120/323
120
Raw Data Cluster/Stratified Sample
a a e uc on:Discretization
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 121/323
121
Three types of attributes: Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic
rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
scre za on an oncepHierarchy
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 122/323
122
Discretization
Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low levelconcepts (such as numeric values for age) by higher level
concepts (such as young, middle-aged, or senior)
Discretization and Concept HierarchyGeneration for Numeric Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 123/323
123
Typical methods: All the methods can be applied recursively Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split Interval merging by χ 2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split,
unsupervised
Labels
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 124/323
124
Entropy based approach
3 categories for both x and y 5 categories for both x and y
Entropy-Based Discretization
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 125/323
125
Given a set of samples S, if S is partitioned into two intervals S1 and
S2 using boundary T, the information gain after partitioning is
Entropy is calculated based on class distribution of the samples in
the set. Given m classes, the entropy of S1 is
where pi is the probability of class i in S1
The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization
The process is recursively applied to partitions obtained until somestopping criterion is met
Such a boundary may reduce data size and improve classification
accuracy
)(||
||)(
||
||),( 2
21
1S Entropy
S
S S Entropy
S
S T S I +=
∑=
−=m
i
ii p pS Entropy1
21 )(log)(
Class Labels
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 126/323
126
Data Equal interval width
Equal frequency K-means
n erva erge y χ Analysis
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 127/323
127
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to
form larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered tobe one interval
χ 2tests are performed for every pair of adjacent intervals
Adjacent intervals with the least χ 2values are merged together,
since low χ 2values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-interval,
max inconsistency, etc.)
egmen a on y a uraPartitioning
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 128/323
128
A simply 3-4-5 rule can be used to segmentnumeric data into relatively uniform, “natural”
intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 129/323
129
(-$400 -$5,000)
(-$400 - 0)
(-$400 --$300)
(-$300 -
-$200)
(-$200 -
-$100)
(-$100 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
Concept Hierarchy Generation forCategorical Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 130/323
130
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts street < city < state < country
Specification of a hierarchy for a set of values by
explicit data grouping {Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes E.g., only street < city, not others
Automatic generation of hierarchies (or attributelevels) by the analysis of the number of distinct
values E.g., for a set of attributes: {street, city, state, country}
Automatic Concept HierarchyGeneration
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 131/323
131
Some hierarchies can be automaticallygenerated based on the analysis of the numberof distinct values per attribute in the data set The attribute with the most distinct values is placed at
the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
: n ng requentPatterns, Association andCorrelations
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 132/323
132
Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset
mining methods
Mining various kinds of association rules
From association mining to correlation
analysis
Constraint-based association mining
Summary
a s requen a ernAnalysis?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 133/323
133
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern MiningImportant?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 134/323
134
Discloses an intrinsic and important property of data
sets
Forms the foundation for many essential data mining
tasks Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
Basic Concepts: FrequentPatterns and Association Rules
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 135/323
135
Itemset X = {x1, …, xk} Find all the rules X Y with
minimum support and confidence support, s, probability that a
transaction contains X ∪ Y confidence, c, conditional
probability that a transactionhaving X also contains Y
Let supmin = 50%, conf min = 50%Freq. Pat.: { A:3, B:3, D:4, E:3,
AD:3}
Association rules:
A D (60%, 100%)
D A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
ose a erns an ax-Patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 136/323
136
A long pattern contains a combinatorial number of sub-patterns,
e.g., {a1, …, a100 } contains (100 1) + (100 2) + … + (110000) = 2100 – 1= 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no super-
pattern Y כ X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y כ X (proposed by Bayardo @
SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules
ose a erns an ax-Patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 137/323
137
Exercise. DB = {<a1, …, a100 >, < a1, …,
a50 >}
Min_sup = 1.
What is the set of closed itemset? <a1, …, a100 >: 1
< a1, …, a50 >: 2
What is the set of max-pattern? <a1, …, a100 >: 1
What is the set of all patterns? !!
Scalable Methods for MiningFrequent Patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 138/323
138
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper} Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
Apriori: A Candidate Generation-and-Test Approach
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 139/323
139
Apriori pruning principle: If there is any itemsetwhich is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
Method: Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
T e Apr or A gor t m—AnExample
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 140/323
140
Database TDB
1st scan
C 1 L1
L2
C 2 C 22nd scan
C 3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
The Apriori Algorithm
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 141/323
141
Pseudo-code:
Ck : Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; L
k !=∅; k ++) do begin
Ck+1 = candidates generated from Lk ; for each transaction t in database do
increment the count of all candidates inCk+1 that are contained in t
Lk+1 = candidates in Ck+1 withmin_support
endreturn ∪k Lk ;
Important Details of Apriori
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 142/323
142
How to generate candidates? Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd }
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd }
ow o enera eCandidates?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 143/323
143
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck
select p.item1 , p.item
2 , …, p.item
k-1 , q.item
k-1
from Lk-1
p, Lk-1
q
where p.item1=q.item
1 , …, p.item
k-2=q.item
k-2 , p.item
k-1<
q.itemk-1
Step 2: pruningforall itemsets c in C
k do
forall (k-1)-subsets s of c do
if (s is not in Lk-1 ) then delete c from Ck
How to Count Supports of Candidates?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 144/323
144
Why counting supports of candidates a problem? The total number of candidates can be very huge
One transaction may contain many candidates
Method: Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
Example: Counting Supports of Candidates
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 145/323
145
1,4,7
2,5,8
3,6,9Subset function
2 3 4
5 6 7
1 4 51 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
Efficient Implementation of Apriori inSQL
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 146/323
146
Hard to get good performance out of pure SQL(SQL-92) based approaches alone
Make use of object-relational extensions like
UDFs, BLOBs, Table functions etc.
Get orders of magnitude improvement
S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational
database systems: Alternatives and implications.
In SIGMOD’98
Challenges of Frequent PatternMining
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 147/323
147
Challenges Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
Partition: Scan Database OnlyTwice
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 148/323
148
Any itemset that is potentially frequent in DB mustbe frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in large
databases. In VLDB’95
DHP: Reduce the Number of Candidates
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 149/323
149
A k -itemset whose corresponding hashing bucketcount is below the threshold cannot be frequent
Candidates: a, b, c, d, e
Hash entries: {ab, ad, ae} {bd, be, de} …
Frequent 1-itemset: a, b, d, e
ab is not a candidate 2-itemset if the sum of count of {ab,
ad, ae} is below support threshold
J. Park, M. Chen, and P. Yu. An effective hash-basedalgorithm for mining association rules. In
SIGMOD’95
Sampling for Frequent Patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 150/323
150
Select a sample of original database, mine frequentpatterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent
patterns
H. Toivonen. Sampling large databases for
association rules. In VLDB’96
DIC: Reduce Number of Scans
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 151/323
151
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
Once both A and D are determinedfrequent, the counting of AD begins
Once all length-2 subsets of BCD aredetermined frequent, the counting of BCD begins
Transactions
1-itemsets
2-itemsets
…Apriori
1-itemsets2-items
3-itemsDIC
S. Brin R. Motwani, J. Ullman,and S. Tsur. Dynamic itemsetcounting and implicationrules for market basket data.In SIGMOD’97
Bottleneck of Frequent-patternMining
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 152/323
152
Multiple database scans are costly Mining long patterns needs many passes of
scanning and generates lots of candidates To find frequent itemset i
1
i2
…i100
# of scans: 100
# of Candidates: (1001) + (100
2) + … + (110000) = 2100-1
= 1.27*1030!
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
Mining Frequent PatternsWithout Candidate Generation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 153/323
153
Grow long patterns from short ones using
local frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc abcd is a
frequent pattern
Construct FP-tree from a TransactionDatabase
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 154/323
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3
b 3m 3 p 3
min_support = 3
TID Items bought (ordered) frequent items
100 { f, a, c, d, g, i, m, p} { f, c, a, m, p}200 {a, b, c, f, l, m, o} { f, c, a, b, m}300 {b, f, h, j, o, w } { f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} { f, c, a, m, p}
1. Scan DB once, findfrequent 1-itemset(single item pattern)
2. Sort frequent items
in frequencydescending order, f-list
3. Scan DB again,construct FP-tree F-list=f-c-a-b-m-p
Benefits of the FP-treeStructure
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 155/323
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction
Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently
occurring, the more likely to be shared Never be larger than the original database (not count node-
links and the count field) For Connect-4 DB, compression ratio could be over 100
ar on a erns anDatabases
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 156/323
Frequent patterns can be partitioned intosubsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, p Pattern f
Completeness and non-redundency
Find Patterns Having P From P-conditional Database
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 157/323
157
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent
item p Accumulate all of transformed prefix paths of item p to
form p’ s conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3
b 3m 3 p 3
From Conditional Pattern-bases toConditional FP-trees
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 158/323
158
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequentpatterns relate to m
m,
fm, cm, am, fcm, fam, cam,
fcam
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table Item frequency head f 4c 4
a 3b 3m 3 p 3
Recursion: Mining Each ConditionalFP-tree
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 159/323
159
{}
f:3
c:3
a:3m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3) {}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3){}
f:3
cam-conditional FP-tree
A Special Case: Single Prefix Pathin FP-tree
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 160/323
160
Suppose a (conditional) FP-tree T has a sharedsingle prefix-path P
Mining can be decomposed into two parts Reduction of the single prefix path into one node
Concatenation of the mining results of the two parts
a2:n2
a3:n3
a1:n1
{}
b1:m1C1:k 1
C2:k 2 C3:k 3
b1:m1C1:k 1
C2:k 2 C3:k 3
r 1
+a2:n2
a3:n3
a1:n1
{}
r 1 =
Mining Frequent Patterns WithFP-trees
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 161/323
161
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern anddatabase partition Method
For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only onepath—single path will generate all the combinations of itssub-paths, each of which is a frequent pattern
Scaling FP-growth by DBProjection
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 162/323
162
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected
DBs Then construct and mine FP-tree for each
projected DB
Parallel projection vs. Partition projection techniques Parallel projection is space costly
Partition-based Projection
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 163/323
163
Parallel projection needs
a lot of disk space
Partition projection saves
it
Tran. DBfcampfcabmfbcbpfcamp
p-proj DBfcamcbfcam
m-proj DBfcabfcafca
b-proj DBf cb…
a-proj DBfc…
c-proj DBf …
f-proj DB…
am-proj DBfcfcfc
cm-proj DBf f f
…
FP-Growth vs. Apriori: Scalability Withthe Support Threshold
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 164/323
164
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
R u n t i m e ( s e
D1 FP-growth runtime
D1 Apriori runtime
Data set T25I20D10K
FP-Growth vs. Tree-Projection:Scalability with the Support Threshold
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 165/323
165
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%
R u n t i m e ( s
e c . )
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
Why Is FP-Growth the Winner?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 166/323
166
Divide-and-conquer: decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub FP-
tree, no pattern search and matching
Implications of theMethodology
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 167/323
167
Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00)
Mining sequential patterns
FreeSpan (KDD’00), PrefixSpan (ICDE’01)
Constraint-based mining of frequent patterns
Convertible constraints (KDD’00, ICDE’01)
Computing iceberg data cubes with complex measures
H-tree and H-cubing algorithm (SIGMOD’01)
ax ner: n ng ax-patterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 168/323
168
1st scan: find frequent items A, B, C, D, E
2nd scan: find support for AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE CD, CE, CDE, DE,
Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD’98
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Potentialmax-
patterns
Mining Frequent Closed Patterns:CLOSET
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 169/323
169
Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c
Divide search space Patterns having d
Patterns having d but no a, etc. Find frequent closed pattern recursively
Every transaction having d also has cfa cfad is a frequent
closed pattern
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm forMining Frequent Closed Itemsets", DMKD'00.
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
Min_sup=2
CLOSET+: Mining ClosedItemsets by Pattern-Growth
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 170/323
170
Itemset merging: if Y appears in every occurrence of X,
then Y is merged with X
Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and
all of X’s descendants in the set enumeration tree can be
pruned
Hybrid tree projection
Bottom-up physical tree-projection
Top-down pseudo tree-projection
Item skipping: if a local frequent item has the samesupport in several header tables at different levels, one
can prune it from the header table at higher levels
Efficient subset checking
CHARM: Mining by Exploring VerticalData Format
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 171/323
171
Vertical format: t(AB) = {T11 , T25 , …}
tid-list: list of trans.-ids containing an itemset
Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together
t(X) ⊂ t(Y): transaction having X always has Y
Using diffset to accelerate mining Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}
Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoyet al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
Further Improvements of Mining Methods
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 172/323
172
AFOPT (Liu, et al. @ KDD’03) A “push-right” method for mining condensed frequent
pattern (CFP) tree Carpenter (Pan, et al. @ KDD’03)
Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining
Visualization of Association Rules: PlaneGraph
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 173/323
173
Visualization of Association Rules: RuleGraph
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 174/323
174
Rules(SGI/MineSet 3.0)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 175/323
175
Mining Various Kinds of Association Rules
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 176/323
176
Mining multilevel association
Miming multidimensional association
Mining quantitative association
Mining interesting correlation patterns
Mining Multiple-LevelAssociation Rules
It ft f hi hi
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 177/323
177
Items often form hierarchies
Flexible support settings Items at the lower level are expected to have lower
support Exploration of shared multi-level mining (Agrawal
& Srikant@VLB’95, Han & Fu@VLDB’95)
uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support
Multi-level Association:Redundancy Filtering
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 178/323
178
Some rules may be redundant due to “ancestor”
relationships between items.
Example milk ⇒wheat bread [support = 8%, confidence = 70%]
2% milk ⇒wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second
rule.
A rule is redundant if its support is close to the“expected” value, based on the rule’s ancestor.
Mining Multi-DimensionalAssociation
Si l di i l l
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 179/323
179
Single-dimensional rules:
buys(X, “milk”) ⇒buys(X, “bread”) Multi-dimensional rules: ≥ 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧occupation(X,“student”) ⇒buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒buys(X, “coke”)
Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
Mining Quantitative Associations
T h i b t i d b h i l
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 180/323
180
Techniques can be categorized by how numerical
attributes, such as age or salary are treated1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal &Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
one dimensional clustering then association
1. Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)
Static Discretization of Quantitative Attributes
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 181/323
181
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-
predicate sets will require k or k +1 table scans.
Data cube is well suited for mining.
The cells of an n-dimensional
cuboid correspond to the
predicate sets. Mining from data cubes
can be much faster.
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
Rules
Proposed by Lent Swami and Widom ICDE’97
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 182/323
182
age(X,”34-35”) ∧income(X,”30-50K”)
⇒buys(X,”high resolution TV”)
Proposed by Lent, Swami and Widom ICDE 97
Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules
mined is maximized
2-D quantitative association rules: Aquan1 ∧Aquan2⇒Acat
Cluster adjacent association rulesto form general
rules using a 2-D grid Example
Mining Other InterestingPatterns
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 183/323
183
Flexible support constraints (Wang et al. @
VLDB’02) Some items (e.g., diamond) may occur rarely but are
valuable
Customized supminspecification and application
Top-K closed frequent patterns (Han, et al. @
ICDM’02) Hard to specify supmin, but top-k with lengthmin is more
desirable
Dynamically raise supmin in FP-tree construction and
mining, and select most promising path to mine
Interestingness Measure:Correlations (Lift)
play basketball ⇒eat cereal [40% 66 7%] is misleading
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 184/323
184
play basketball ⇒ eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
Measure of dependent/correlated events: lift
89.05000/3750*5000/3000
5000/2000),( ==C Blift
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000)()(
)(
B P A P
B A P lift
∪=
33.15000/1250*5000/3000
5000/1000),( ==¬C Blift
Are lift and χ2
Good Measures of Correlation?
“Buy walnuts ⇒buy milk [1% 80%]” is misleading
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 185/323
185
Buy walnuts ⇒ buy milk [1%, 80%] is misleading
if 85% of customers buy milk Support and confidence are not good to represent correlations
So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m Σ
)()()( B P A P B A P lift ∪=
DB m, c ~m, c m~c ~m~c lift all-conf coh χ 2
A1 1000 100 100 10,000 9.26 0.91 0.83 9055
A2 100 1000 1000 100,000 8.44 0.09 0.05 670
A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
)sup( _ max_
)sup( _
X item
X conf all =
|)(|
)sup(
X universe
X coh =
Which Measures Should Be Used?
lift and 2 are not
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 186/323
186
lift and χ 2 are not
good measures forcorrelations in largetransactional DBs
all-conf orcoherence could begood measures(Omiecinski@TKDE’03)
Both all-conf andcoherence have thedownward closureproperty
Efficient algorithmscan be derived formining (Lee et al.@ICDM’03sub)
Constraint-based (Query-Directed)Mining
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 187/323
187
Finding all the patterns in a databaseautonomously? — unrealistic! The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining querylanguage (or a graphical user interface)
Constraint-based mining User flexibility: provides constraints on what to be mined
System optimization: explores such constraints forefficient mining—constraint-based mining
Constraints in Data Mining
Knowledge type constraint:
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 188/323
188
Knowledge type constraint: classification, association, etc.
Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in
Dec.’02
Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint
strong rules: min_support ≥ 3%, min_confidence ≥ 60%
Constrained Mining vs. Constraint-Based Search
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 189/323
189
Constrained mining vs. constraint-basedsearch/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some
(or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search
It is an interesting research problem on how to integratethem Constrained mining vs. query processing in DBMS
Database query processing requires to find all Constrained pattern mining shares a similar philosophy as
pushing selections deeply in query processing
Anti-Monotonicity in ConstraintPushing
TDB (min_sup=2)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 190/323
190
Anti-monotonicity When an intemset S violates the
constraint, so does any of its superset
sum(S.Price) ≤ v is anti-monotone
sum(S.Price) ≥ v is not anti-monotone
Example. C: range(S.profit) ≤ 15 is
anti-monotone Itemset ab violates C
So does every superset of ab
TransactionTID
a, b, c, d, f 10
b, c, d, f, g, h20
a, c, d, e, f 30
c, e, f, g40
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Monotonicity for ConstraintPushingTDB (min_sup=2)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 191/323
191
Monotonicity When an intemset S satisfies the
constraint, so does any of its superset
sum(S.Price) ≥ v is monotone
min(S.Price) ≤ v is monotone
Example. C: range(S.profit) ≥ 15 Itemset ab satisfies C
So does every superset of ab
TransactionTID
a, b, c, d, f 10
b, c, d, f, g, h20
a, c, d, e, f 30
c, e, f, g40
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Succinctness
i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 192/323
192
Succinctness:
Given A1,the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on A1 ,
i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) ≤ v is succinct
sum(S.Price) ≥ v is not succinct Optimization: If C is succinct, C is pre-counting
pushable
The Apriori Algorithm —Example
Database D itemset sup it t
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 193/323
193
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2{2} 3
{3} 3
{5} 3
Scan D
C 1 L
1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}{3 5}
itemset sup{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C 2 C 2Scan D
C 3 L3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2
a ve gor m: pr or +Constraint
Database D itemset sup it t
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 194/323
194
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2{2} 3
{3} 3
{5} 3
Scan D
C 1 L
1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}{3 5}
itemset sup{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C 2 C 2Scan D
C 3 L3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2
Constraint:
Sum{S.price} <
5
Algorithm: Push an Anti-monotone Constraint Deep
Database D itemset sup it t
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 195/323
195
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2{2} 3
{3} 3
{5} 3
Scan D
C 1 L
1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}{3 5}
itemset sup{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C 2 C 2Scan D
C 3 L3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2
Constraint:
Sum{S.price} <
5
Algorithm: Push a SuccinctConstraint Deep
Database D itemset sup it t
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 196/323
196
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2{2} 3
{3} 3
{5} 3
Scan D
C 1 L
1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}{3 5}
itemset sup{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C 2 C 2Scan D
C 3 L3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2
Constraint:
min{S.price } <=
1
not immediatelyto be used
Constraints
T tiTID
TDB (min_sup=2)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 197/323
197
Convert tough constraints into anti-monotone or monotone by properly
ordering items
Examine C: avg(S.profit) ≥ 25 Order items in value-descending order
<a, f, g, d, b, h, c, e> If an itemset afb violates C
So does afbh, afb*
It becomes anti-monotone!
TransactionTID
a, b, c, d, f 10
b, c, d, f, g, h20
a, c, d, e, f 30
c, e, f, g40
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Strongly Convertible Constraints
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 198/323
198
avg(X) ≥ 25 is convertible anti-monotonew.r.t. item value descending order R: <a,f, g, d, b, h, c, e> If an itemset af violates a constraint C, so does
every itemset with af as prefix, such as afd
avg(X) ≥ 25 is convertible monotonew.r.t. item value ascending order R-1: <e,c, h, b, d, g, f, a> If an itemset d satisfies a constraint C, so does
itemsets df and dfa, which having d as a prefix
Thus, avg(X) ≥ 25 is strongly convertible
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20h -10
Can Apriori Handle ConvertibleConstraint?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 199/323
199
A convertible, not monotone nor anti-monotone nor succinct constraint cannotbe pushed deep into the an Apriori miningalgorithm Within the level wise framework, no direct
pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned But it can be pushed into frequent-pattern
growth framework!
Item Value
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Mining With ConvertibleConstraints
C a g(X) > 25 min sup 2
Item Value
a 40
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 200/323
200
C: avg(X) >= 25, min_sup=2
List items in every transaction in value
descending order R: <a, f, g, d, b, h, c, e> C is convertible anti-monotone w.r.t. R
Scan TDB once
remove infrequent items Item h is dropped
Itemsets a and f are good, …
Projection-based mining
Imposing an appropriate order on item projection Many tough constraints can be converted into
(anti)-monotone
TransactionTID
a, f, d, b, c10
f, g, d, b, c20
a, f, d, c, e30
f, g, h, c, e40
TDB (min_sup=2)
a 40
f 30
g 20
d 10
b 0
h -10
c -20
e -30
Handling Multiple Constraints
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 201/323
201
Different constraints may require different oreven conflicting item-ordering
If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict
between the two convertible constraints
If there exists conflict on order of items Try to satisfy one constraint first
Then using the order for the other constraint to minefrequent itemsets in the corresponding projected
database
a ons ra n s reConvertible?
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 202/323
202
Constraint Convertible anti-monotone Convertiblemonotone Stronglyconvertible
avg(S) ≤ , ≥ v Yes Yes Yes
median(S) ≤ , ≥ v Yes Yes Yes
sum(S) ≤ v (items could be of any
value, v ≥ 0)
Yes No No
sum(S) ≤ v (items could be of anyvalue, v ≤ 0)
No Yes No
sum(S) ≥ v (items could be of anyvalue, v ≥ 0)
No Yes No
sum(S) ≥ v (items could be of anyvalue, v ≤ 0)
Yes No No
……
onstra nt-Base M n ng—A eneraPicture
Constraint Antimonotone Monotone Succinct
S
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 203/323
203
v ∈ S no yes yes
S ⊇ V no yes yes
S ⊆ V yes no yes
min(S) ≤ v no yes yes
min(S) ≥ v yes no yes
max(S) ≤ v yes no yes
max(S) ≥ v no yes yescount(S) ≤ v yes no weakly
count(S) ≥ v no yes weakly
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
avg(S) θ v, θ ∈ { = , ≤ , ≥ } convertible convertible no
support(S) ≥ ξ yes no no
support(S) ≤ ξ no yes no
A Classification of Constraints
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 204/323
204
Convertible
anti-monotone
Convertible
monotone
Stronglyconvertible
Inconvertible
Succinct
Antimonotone
Monotone
Chapter 6. Classification andPrediction
Wh t i l ifi ti ? Wh t Support Vector Machines (SVM)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 205/323
205
What is classification? What
is prediction?
Issues regarding
classification and prediction
Classification by decision
tree induction
Bayesian classification
Rule-based classification
Classification by back
propagation
Support Vector Machines (SVM)
Associative classification
Lazy learners (or learning from
your neighbors)
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection
Summary
Classification
Classification vs. Prediction
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 206/323
206
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attributeand uses it in classifying new data
Prediction models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications Credit approval Target marketing Medical diagnosis Fraud detection
Classification—A Two-StepProcess
Model construction: describing a set of predeterminedl
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 207/323
207
classes Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with theclassified result from the model
Accuracy rate is the percentage of test set samples
that are correctly classified by the model Test set is independent of training set, otherwiseover-fitting will occur
If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
rocess : o eConstruction
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 208/323
208
Training
Data
N A M E R A N K Y E A R S T E N U R E D
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Process (2): Using the Model inPrediction
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 209/323
209
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Supervised vs. UnsupervisedLearning
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 210/323
210
Supervised learning (classification) Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues: Data Preparation
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 211/323
211
Data cleaning Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection) Remove the irrelevant or redundant attributes
Data transformation Generalize and/or normalize data
Issues: Evaluating Classification
Methods
Accuracy
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 212/323
212
classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes
Speed time to construct the model (training time) time to use the model (classification/prediction time)
Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such asdecision tree size or compactness of classificationrules
Decision Tree Induction: TrainingDataset
age income student credit rating buys comput
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 213/323
213
age income student credit_rating buys_comput
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Thisfollowsan
exampleof Quinlan’sID3(Playing
Tennis)
Output: A Decision Tree for“buys_computer”
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 214/323
214
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair excellentyesno
Induction
Basic algorithm (a greedy algorithm)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 215/323
215
Tree is constructed in a top-down recursive divide-and-conquermanner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selectedattributes Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf There are no samples left
Attribute Selection Measure:Information Gain (ID3/C4.5)
Select the attribute with the highest information
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 216/323
216
Select the attribute with the highest information
gain Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D |/|D|
Expected information (entropy) needed to classifya tuple in D:
Information needed (after using A to split D into v
partitions) to classify D:
Information gained by branching on attribute A
)(log)( 2
1
i
m
i
i p p D Info ∑=−=
)(||
||)(
1
j
v
j
j
A D I D
D D Info ×=∑
=
(D) Info Info(D)Gain(A) A−=
Attribute Selection: Information Gain
Class P: buys_computer =“yes”
)0,4(14
4)3,2(
14
5)( += I I D Info age
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 217/323
217
yes Class N: buys_computer =
“no”
means “age <=30” has
5 out of 14 samples, with 2
yes’es and 3 no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694.0)2,3(14
5
1414
=+ I
048.0) _ (
151.0)(
029.0)(
=
=
=
rating credit Gain
student Gain
incomeGain
246.0)()()( =−= D Info D InfoageGain ageage income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
)3,2(14
5 I
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 =−−== I D Info
Computing Information-Gain forContinuous-Value Attributes
Let attribute A be a continuous-valued attribute
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 218/323
218
Must determine the best split point for A Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2
is the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection(C4.5)
Information gain measure is biased towards
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 219/323
219
g
attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected
as the splitting attribute
)||
||(log
||
||)( 2
1 D
D
D
D DSplitInfo j
v
j
j A ×−= ∑=
926.0)14
4(log
14
4)
14
6(log
14
6)
14
4(log
14
4)( 222 =×−×−×−= DSplitInfo A
Gini index (CART, IBMIntelligentMiner)
If a data set D contains examples from n classes, gini index,
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 220/323
220
gini(D) is defined as
where p j is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit (D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
∑=
−= n
j
p j D gini
1
21)(
)(||
||)(
||
||)( 2
21
1 D gini
D
D D gini
D
D D gini A +=
)()()( D gini D gini A gini A−=∆
Gini index (CART, IBMIntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 221/323
221
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split
values
Can be modified for categorical attributes
459.014
5
14
91)(
22
=
−
−= D gini
)(14
4)(
14
10)( 11},{ DGini DGini D gini mediumlowincome
+
=∈
Comparing Attribute SelectionMeasures
The three measures, in general, return good results
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 222/323
222
but Information gain:
biased towards multivalued attributes Gain ratio:
tends to prefer unbalanced splits in which onepartition is much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized
partitions and purity in both partitions
Other Attribute SelectionMeasures
CHAID: a popular decision tree algorithm, measure based on χ2 test
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 223/323
223
for independence C-SEP: performs better than info. gain and gini index in certain cases
G-statistics: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best? Most give good results, none is significantly superior than others
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 224/323
224
data Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
Enhancements to Basic Decision TreeInduction
Allow for continuous-valued attributes
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 225/323
225
Dynamically define new discrete-valued attributes thatpartition the continuous attribute value into a discrete setof intervals
Handle missing attribute values
Assign the most common value of the attribute Assign probability to each of the possible values
Attribute construction Create new attributes based on existing ones that are
sparsely represented This reduces fragmentation, repetition, and replication
Classification in Large Databases
Classification—a classical problem extensively
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 226/323
226
studied by statisticians and machine learningresearchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with
reasonable speed Why decision tree induction in data mining?
relatively faster learning speed (than other classificationmethods)
convertible to simple and easy to understand classificationrules
can use SQL queries for accessing databases comparable classification accuracy with other methods
Scalable Decision Tree InductionMethods
SLIQ (EDBT’96 — Mehta et al.) Builds an index for each attribute and only class list and
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 227/323
227
Builds an index for each attribute and only class list andthe current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.) Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim) Integrates tree splitting and tree pruning: stop growing the
tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti) Builds an AVC-list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan &Loh) Uses bootstrapping to create several small samples
ca a ty Framewor orRainForest
Separates the scalability aspects from the criteria that
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 228/323
228
determine the quality of the tree
Builds an AVC-list: AVC (Attribute, Value, Class_label)
AVC-set (of an attribute X )
Projection of training dataset onto the attribute X and class label
where counts of individual class label are aggregated
AVC-group (of a node n )
Set of AVC-sets of all predictor attributes at the node n
Rainforest: Training Set and Its AVC
Sets
AVC-set on incomeAVC-set on Age Training Examples
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 229/323
229
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 3 2
31..40 4 0
>40 3 2
Creditrating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income student redit_ratin_com<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on Student
income Buy_Computer yes no
high 2 2
medium 4 2
low 3 1
AVC-set oncredit_rating
Data Cube-Based Decision-TreeInduction
Integration of generalization with decision-tree
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 230/323
230
g g
induction (Kamber et al.’97)
Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy classification-trees
Semantic interpretation problems
Cube-based multi-level classification Relevance analysis at multi-levels
Information-gain analysis with dimension + level
BOAT (Bootstrapped OptimisticAlgorithm for Tree Construction)
Use a statistical technique called bootstrapping to create
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 231/323
231
Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
Each subset is used to create a tree, resulting in several
trees
These trees are examined and used to construct a new
tree T’
It turns out that T’ is very close to the tree that would be
generated using the whole data set together
Adv: requires only two scans of DB, an incremental alg.
Presentation of Classification Results
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 232/323
232
Visualization of a Decision Tree in SGI/MineSet 3.0
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 233/323
233
Interactive Visual Mining by Perception-Based Classification (PBC)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 234/323
234
Chapter 6. Classification andPrediction
What is classification? What Support Vector Machines (SVM)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 235/323
235
is prediction?
Issues regarding
classification and prediction
Classification by decision
tree induction
Bayesian classification
Rule-based classification Classification by back
propagation
Associative classification
Lazy learners (or learning from
your neighbors)
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection Summary
Bayesian Classification:Why?
A statistical classifier: performs probabilistic prediction,i.e., predicts class membership probabilities
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 236/323
236
Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve
Bayesian classifier , has comparable performance withdecision tree and selected neural network classifiers
Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect — prior knowledge can be combined withobserved data
Standard: Even when Bayesian methods arecomputationally intractable, they can provide a standardof optimal decision making against which other methodscan be measured
Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 237/323
237
Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
P(H) ( prior probability ), the initial probability
E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed
P(X|H) ( posteriori probability ), the probability of observing the
sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
Bayesian Theorem
Given training data X, posteriori probability of a
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 238/323
238
hypothesis H, P(H|X), follows the Bayes theorem
Informally, this can be written asposteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)()()|(
)|(X
XX
P H P H P
H P =
owar s a ve ayes anClassifier
Let D be a training set of tuples and their associatedclass labels, and each tuple is represented by an n-D
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 239/323
239
p p yattribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
Since P(X) is constant for all classes, only
needs to be maximized
)(
)()|()|(
X
XX
P iC P
iC P
iC P =
)()|()|(i
C P i
C P i
C P XX =
er va on o a ve ayesClassifier
A simplified assumption: attributes are conditionallyindependent (i.e., no dependence relation between
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 240/323
240
attributes):
This greatly reduces the computation cost: Only countsthe class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D | (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ andstandard deviation σ
and P(xk|Ci) is
)|(...)|()|(
1
)|()|(21
C i x P C i x P C i x P n
k C i x P C i P
nk ×××=∏
=
=X
2
2
2)(
2
1),,( σ
µ
σ π σ µ
−−
= x
e x g
),,()|(ii C C k x g C i P σ µ =X
Na ve Bayes an ass er: Tra n ngDatasetage income student redit_ratin com
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 241/323
241
Class:
C1:buys_computer =
‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 242/323
242
Compute P(X|Ci) for each classP(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
vo ng e - ro a yProblem
Naïve Bayesian prediction requires each conditional prob. benon-zero. Otherwise, the predicted prob. will be zero
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 243/323
243
Ex. Suppose a dataset with 1000 tuples, income=low (0),income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator) Adding 1 to each case
Prob(income = low) = 1/1003Prob(income = medium) = 991/1003Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their “uncorrected”
counterparts
∏=
=n
k C i xk P C i X P
1
)|()|(
a ve ayes an ass er:Comments
Advantages Easy to implement
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 244/323
244
Good results obtained in most of the cases Disadvantages
Assumption: class conditional independence, therefore lossof accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.Symptoms: fever, cough etc., Disease: lung cancer,diabetes, etc.
Dependencies among these cannot be modeled byNaïve Bayesian Classifier
How to deal with these dependencies? Bayesian Belief Networks
Bayesian Belief Networks
Bayesian belief network allows a subset of the
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 245/323
245
variables conditionally independent
A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution
X Y
ZP
Nodes: random variables
Links: dependency
X and Y are the parents of Z, and
Y is the parent of P
No dependency between Z and P
Has no loops or cycles
Example
Family
Hi
Smoker The conditional probability
bl (CPT) f i bl
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 246/323
246
History
LungCancer
PositiveXRay
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
table (CPT) for variableLungCancer:
∏=
=n
i
Y Parents i xi P x x P n
1
))(|(),...,( 1
CPT shows the conditional probabilityfor each possible combination of itsparents
Derivation of the probability of a
particular combination of valuesof X, from CPT:
Training Bayesian Networks
Several scenarios:
Gi b h h k d ll i blb bl l l th CPT
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 247/323
247
Given both the network structure and all variablesobservable: learn only the CPTs Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,analogous to neural network learning
Network structure unknown, all variables observable:search through the model space to reconstruct network topology
Unknown structure, all hidden variables: No goodalgorithms known for this purpose
Ref. D. Heckerman: Bayesian networks for datamining
Us ng IF-THEN Ru es orClassification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 248/323
248
Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
ncovers= # of tuples covered by R
ncorrect= # of tuples correctly classified by R
coverage(R) = ncovers/|D| /* D: training data set */accuracy(R) = ncorrect / ncovers
If more than one rule is triggered, need conflict resolution Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test )
Class-based ordering: decreasing order of prevalence or misclassificationcost per class
Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
age?
student? credit rating?
<=30 >40
yes
31..40
Rule Extraction from a Decision Tree
Rules are easier to understand than large trees
One rule is created for each path from the root to a leaf
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 249/323
249
no yes yes no
fair excellentyesno
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yesIF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Each attribute-value pair along a path forms a conjunction:
the leaf holds the class prediction
Rules are mutually exclusive and exhaustive
Ru e Extract on rom t e Tra n ng
Data
Sequential covering algorithm: Extracts rules directly from training data
T i l ti l i l ith FOIL AQ CN2 RIPPER
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 250/323
250
Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER Rules are learned sequentially , each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
Steps:
Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are removed
The process repeats on the remaining tuples unless termination condition,
e.g., when no more training examples or when the quality of a rule returned
is below a user-specified threshold
Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-Rule?
Star with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Pi k th th t t i th l lit
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 251/323
251
Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and accuracy
Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
It favors rules that have high accuracy and cover many positive tuples
Rule pruning based on an independent set of test tuples
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
)log''
'(log' _ 22
neg pos pos
neg pos pos posGain FOIL
+−
+×=
neg posneg pos R Prune FOIL +−=)( _
Classification:
di t t i l l l b l
Classification: A MathematicalMapping
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 252/323
252
predicts categorical class labels E.g., Personal homepage classification
xi = (x1, x2, x3, …), yi = +1 or –1 x1 : # of a word “homepage”
x2 : # of a word “welcome” Mathematically
x ∈ X = ℜn, y ∈ Y = {+1, –1} We want a function f: X Y
Linear Classification
Binary ClassificationproblemTh d t b th d
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 253/323
253
The data above the redline belongs to class ‘x’
The data below red linebelongs to class ‘o’
Examples: SVM,Perceptron,Probabilistic Classifiersx
xx
xxx
x
x
x
x o
o
o
o o o
o
o
o o
oo
o
Discriminative Classifiers
Advantages prediction accuracy is generally high
As compared to Bayesian methods in general
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 254/323
254
As compared to Bayesian methods – in general robust, works when training examples contain errors fast evaluation of the learned target function
Bayesian networks are normally slow Criticism
long training time difficult to understand the learned function (weights)
Bayesian networks can be used easily for patterndiscovery
not easy to incorporate domain knowledge
Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
• Scalar: x, y, wx2
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 255/323
255
Input: {(x1, y1), …}
Output: classification functionf(x)
f(xi) > 0 for yi = +1
f(xi) < 0 for yi = -1
f(x) => wx + b = 0
or w1x1+w2x2+b = 0
x1
• Perceptron: updateW additively
• Winnow: update Wmultiplicatively
ass ca on yBackpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop and test
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 256/323
256
Started by psychologists and neurobiologists to develop and testcomputational analogues of neurons
A neural network: A set of connected input/output units where
each connection has a weight associated with it
During the learning phase, the network learns by adjustingthe weights so as to be able to predict the correct class label
of the input tuples
Also referred to as connectionist learning due to the
connections between units
eura e wor as aClassifier
Weakness Long training time
Require a number of parameters typically best determinedi i ll th t k t l `` t t "
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 257/323
257
Require a number of parameters typically best determinedempirically, e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network Strength
High tolerance to noisy data
Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputs Successful on a wide array of real-world data Algorithms are inherently parallel Techniques have recently been developed for the extraction of
rules from trained neural networks
A Neuron (= a perceptron)
µ k -
wx
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 258/323
258
The n-dimensional input vector x is mapped into variable y by means of the scalarproduct and a nonlinear function mapping
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
∑
w0
w1
wn
x0
x1
xn
)sign(y
ExampleFor
n
0i
k ii xw µ += ∑=
A Multi-Layer Feed-Forward NeuralNetwork
Output vector
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 259/323
259
Output layer
Input layer
Hidden layer
Input vector: X
wij
∑ +=i
jiij j Ow I θ
j I j
e
O−
+
=
1
1
))(1( j j j j j OT OO Err −−=
jk k
k j j j w Err OO Err ∑−= )1(
i jijij O Err l ww )(+=
j j j Err l )(+=θ θ
How A Multi-Layer Neural Network Works?
The inputs to the network correspond to the attributes measured for each training
tuple
Inputs are fed simultaneously into the units making up the input layer
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 260/323
260
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the
output layer, which emits the network's prediction
The network is feed-forward in that none of the weights cycles back to an inputunit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear regression: Given
enough hidden units and enough training samples, they can closely approximate
any function
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 261/323
Backpropagation
Iteratively process a set of training tuples & compare the
network's prediction with the actual known target value
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 262/323
262
For each training tuple, the weights are modified to minimize
the mean squared error between the network's prediction
and the actual target value
Modifications are made in the “backwards” direction: from
the output layer, through each hidden layer down to the first
hidden layer, hence “backpropagation”
Steps Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases) Terminating condition (when error is very small, etc.)
ac propaga on anInterpretability
Efficiency of backpropagation: Each epoch (one interation
through the training set) takes O(|D| * w), with |D| tuples and w
weights but # of epochs can be exponential to n the number
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 263/323
263
weights, but # of epochs can be exponential to n, the numberof inputs, in the worst case
Rule extraction from networks: network pruning Simplify the network structure by removing weighted links that
have the least effect on the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit
layers
Sensitivity analysis: assess the impact that a given inputvariable has on a network output. The knowledge gained from
this analysis can be represented in rules
Associative Classification
Associative classification
Association rules are generated and analyzed for use in classification
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 264/323
264
Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl
“Aclass = C” (conf, sup) Why effective?
It explores highly confident associations among multiple attributes and
may overcome some constraints introduced by decision-tree induction,
which considers only one attribute at a time
In many studies, associative classification has been found to be more
accurate than some traditional classification methods, such as C4.5
Typ ca Assoc at ve ass cat on
Methods
CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) Mine association possible rules in the form of
C d ( f ib l i ) l l b l
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 265/323
265
Cond-set (a set of attribute-value pairs) class label Build classifier: Organize rules according to decreasing precedence
based on confidence and then support
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) Classification: Statistical analysis on multiple rules
CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03) Generation of predictive rules (FOIL-like analysis)
High efficiency, accuracy similar to CMAR
RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.SIGMOD’05) Explore high-dimensional classification, using top-k rule groups
Achieve high classification accuracy and high run-time efficiency
A Closer Look at CMAR
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei,ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains thedistribution of class labels among tuples satisfying each
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 266/323
266
Efficiency: Uses an enhanced FP tree that maintains thedistribution of class labels among tuples satisfying eachfrequent itemset
Rule pruning whenever a rule is inserted into the tree Given two rules, R1 and R2, if the antecedent of R1 is more general
than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are notpositively correlated, based on a χ2 test of statistical significance Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule If a rule set S satisfies X, CMAR
divides S into groups according to class labels uses a weighted χ2 measure to find the strongest group
of rules, based on the statistical correlation of ruleswithin a group
assigns X the class label of the strongest group
High Accuracy and Efficiency (Cong et al.SIGMOD05)
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 267/323
267
The k -Nearest NeighborAlgorithm
All instances correspond to points in the n-D
spaceTh t i hb d fi d i t f
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 268/323
268
space The nearest neighbor are defined in terms of Euclidean distance, dist(X
1, X
2)
Target function could be discrete- or real-valued
For discrete-valued, k -NN returns the mostcommon value among the k training examplesnearest to x q
Vonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples .
_ +
_ x q
+
_ _ +
_
_
+
..
.
. .
Discussion on the k -NNAlgorithm
k-NN for real-valued prediction for a given unknown tuple
Returns the mean values of the k nearest neighbors
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 269/323
269
Returns the mean values of the nearest neighbors Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to their
distance to the query x q
Give greater weight to closer neighbors
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes To overcome it, axes stretch or elimination of the least relevant
attributes
2),( 1i
xq xd w≡
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 270/323
Genetic Algorithms (GA)
Genetic Algorithm: based on an analogy to biological evolution
An initial population is created consisting of randomlyt d l
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 271/323
271
An initial population is created consisting of randomlygenerated rules Each rule is represented by a string of bits
E.g., if A1 and ¬A2 then C2 can be encoded as 100
If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population isformed to consist of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy
on a set of training examples
Offsprings are generated by crossover and mutation The process continues until a population P evolves when each
rule in P satisfies a prespecified threshold
Slow but easily parallelizable
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 272/323
Fuzzy SetApproaches
Fuzzy logic uses truth values between 0 0 and 1 0 to
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 273/323
273
Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories {low,medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy valuemay apply
Each applicable rule contributes a vote for
membership in the categories Typically, the truth values for each predicted
category are summed, and these sums arecombined
What Is Prediction?
(Numerical) prediction is similar to classification construct a model
use model to predict continuous or ordered value for a giveninput
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 274/323
274
p ginput Prediction is different from classification
Classification refers to predict categorical class label Prediction models continuous-valued functions
Major method for prediction: regression model the relationship between one or more independent orpredictor variables and a dependent or response variable
Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 275/323
275
y 0 1
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
Multiple linear regression: involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D| , y|D| )
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using SAS, S-Plus Many nonlinear functions can be transformed into the above
∑
∑
=
=
−
−−=||
1
2
||
1
)(
))((
1 D
i
i
D
iii
x x
y y x xw xw yw10
−=
Some nonlinear models can be modeled by apolynomial function
A polynomial regression model can be transformedinto linear regression model For example
Nonlinear Regression
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 276/323
276
A polynomial regression model can be transformedinto linear regression model. For example,
y = w0 + w1 x + w2 x2+ w3 x3
convertible to linear with new variables: x2= x2, x3= x3
y = w0 + w1 x + w2 x2+ w3 x3 Other functions, such as power function, can also be
transformed to linear model Some models are intractable nonlinear (e.g., sum of
exponential terms) possible to obtain least square estimates through extensivecalculation on more complex formulae
Generalized linear model: Foundation on which linear regression can be applied to modeling
categorical response variables
Other Regression-Based Models
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 277/323
277
g p Variance of y is a function of the mean value of y, not a constant
Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
Poisson regression: models the data that exhibit a Poisson
distribution
Log-linear models: (for categorical data) Approximate discrete multidimensional prob. distributions
Also useful for data compression and smoothing
Regression trees and model trees Trees to predict continuous values rather than class labels
Regression Trees and ModelTrees
Regression tree: proposed in CART system (Breiman et al.
1984)
CART: Classification And Regression Trees
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 278/323
278
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the training
tuples that reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than
linear regression when the data are not represented well by a
simple linear model
Predictive modeling: Predict data values orconstruct generalized linear models based on the
database dataO l di t l t
Predictive Modeling in MultidimensionalDatabases
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 279/323
279
database data One can only predict value ranges or category
distributions Method outline:
Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction
Determine the major factors which influence theprediction Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis
Prediction: Numerical Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 280/323
280
Prediction: Categorical Data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 281/323
281
Classifier Accuracy
Measures
classes buy_computer =yes
buy_computer =no
total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy computer = no 412 2588 3000 86.27
C1 C2
C1 True positive False negative
C2 False positive True negative
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 282/323
282
Accuracy of a classifier M, acc(M): percentage of test set tuplesthat are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CM
i,j
, an entry in a confusion matrix, indicates #of tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos /* true positive recognition rate */specificity = t-neg/neg /* true negative recognition rate */precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos +neg) This model can also be used for cost-benefit analysis
buy_computer no 412 2588 3000 86.27
total 7366 2634 10000 95.52
UNIT IV- Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 283/323
283
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering
Methods
4. Partitioning Methods
5. Outlier Analysis
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster Dissimilar to the objects in other clusters
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 284/323
284
Dissimilar to the objects in other clusters
Cluster analysis Finding similarities between data according to the
characteristics found in the data and grouping similar dataobjects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
Clustering: Rich Applicationsand Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 285/323
285
p y Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research) WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Examples of ClusteringApplications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to developd k i
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 286/323
286
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Quality: What Is GoodClustering?
A good clustering method will produce high quality
clusters with
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 287/323
287
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on boththe similarity measure used by the method and its
implementation
The quality of a clustering method is also measuredby its ability to discover some or all of the hidden
patterns
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, typically metric: d (i, j)
There is a separate “quality” function that measures the
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 288/323
288
There is a separate quality function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal ratio, and
vector variables. Weights should be associated with different variables based
on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
Requirements of Clustering inData Mining
Scalability
Ability to deal with different types of attributes Ability to handle dynamic data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 289/323
289
y yp Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality Incorporation of user-specified constraints
Interpretability and usability
Data Structures
Data matrix
(two modes)
... ... ... ... ...
1p x ...1f x ...11 x
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 290/323
290
Dissimilarity matrix (one mode)
np x ...nf x ...n1 x
... ... ... ... ...
ip x ...if x ...i1 x
0...)2,()1,(
:::
)2,3()
...nd nd
0d d(3,1
0d(2,1)
0
Type of data in clusteringanalysis
Interval-scaled variables
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 291/323
291
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Interval-valued variables
Standardize data C l l t th b l t d i ti
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 292/323
292
Calculate the mean absolute deviation:
where
Calculate the standardized measurement ( z-score)
Using mean absolute deviation is more robust than
using standard deviation
.)...21
1nf f f f x x(xnm +++=
|)|...|||(|121 f nf f f f f f
m xm xm xn s −++−+−=
f
f if
if s
m x z
−=
Similarity and DissimilarityBetween Objects
Distances are normally used to measure the
similarity or dissimilarity between two data objects
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 293/323
293
y y j
Some popular ones include: Minkowski distance:
where i = ( x i1, x i2, …, x ip) and j = ( x j1, x j2, …, x jp) are two p-
dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
p p
j x
i x
j x
i x
j x
i x jid )||...|||(|),(
2211−++−+−=
||...||||),(2211 p p j
xi
x j
xi
x j
xi
x jid −++−+−=
Similarity and DissimilarityBetween Objects (Cont.)
If q = 2, d is Euclidean distance:
)||...|||(|),( 22
22
2
11 j x
i x
j x
i x
j x
i x jid −++−+−=
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 294/323
294
Properties
d(i,j) ≥ 0
d(i,i) = 0 d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
2211 p p ji ji ji
Binary Variables
A contingency table for binary data
Distance measure for symmetricd cd c
baba
sum
+
+
0
1
01
Object i
Object j
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 295/323
295
Distance measure for symmetric
binary variables:
Distance measure for asymmetric
binary variables: Jaccard coefficient (similarity
measure for asymmetric binary
variables):
d cba
cb jid
+++
+=),(
cbacb jid +++=),(
pd bca sum ++
cbaa ji sim
Jaccard ++=),(
Dissimilarity between BinaryVariables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 296/323
296
gender is a symmetric attribute
the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
=++=
=++=
=++=
mary jimd
jim jack d
mary jack d
Nominal Variables
A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 297/323
297
Method 1: Simple matching m: # of matches, p: total # of variables
Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal
states
pm p jid −=),(
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 298/323
298
p , g ,
Can be treated like interval-scaled replace x if by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f -th variable by
compute the dissimilarity using methods for interval-scaled variables
1
1
−
−
=
f
if
if M
r z
},...,1{ f if
M r ∈
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 299/323
299
scale, such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)
apply logarithmic transformation
y if = log(x if )
treat them as continuous ordinal data treat their rank as
interval-scaled
Variables of Mixed Types
A database may contain all the six types of
variables symmetric binary asymmetric binary nominal ordinal
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 300/323
300
symmetric binary, asymmetric binary, nominal, ordinal,interval and ratio
One may use a weighted formula to combine theireffects
f is binary or nominal:
dij(f) = 0 if xif = x jf , or dij
(f) = 1 otherwise f is interval-based: use the normalized distance
f is ordinal or ratio-scaled compute ranks rif and
and treat zif as interval-scaled
)(1
)()(1),(
f ij
p f
f ij
f ij
p f
d jid
δ
δ
=
=
Σ
Σ=
1
1
−
−=
f
if
M r
z if
Vector Objects
Vector objects: keywords in documents, gene
features in micro-arrays, etc. Broad applications: information retrieval
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 301/323
301
Broad applications: information retrieval,
biologic taxonomy, etc.
Cosine measure
A variant: Tanimoto coefficient
Major ClusteringApproaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,e.g., minimizing the sum of square errors
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 302/323
302
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Major ClusteringApproaches (II)
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 303/323
303
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based: Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Typical Alternatives to Calculate theDistance between Clusters
Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(K i
, K j
) = min(tip
, t jq
)
Complete link: largest distance between an element in one
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 304/323
304
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(K i, K j) = max(tip, t jq)
Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(K i, K j) = avg(tip, t jq)
Centroid: distance between the centroids of two clusters, i.e.,
dis(K i, K j) = dis(Ci, C j)
Medoid: distance between the medoids of two clusters, i.e.,dis(K i, K j) = dis(Mi, M j)
Medoid: one chosen, centrally located object in the cluster
entro , a us an ameter
of a Cluster (for numerical datasets)
Centroid: the “middle” of a cluster
Radius square root of average distance from any point of
N
t N i ip
mC
)(1=
Σ=
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 305/323
305
Radius: square root of average distance from any point of
the cluster to its centroid
Diameter: square root of average mean squared distance
between all pairs of points in the cluster
N
mcipt N
i
m R
2)(1
−=
Σ=
)1(
2)(11
−
−=Σ=Σ=
N N
iqt ipt N i
N i
m D
Partitioning Algorithms: BasicConcept
Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters, s.t., min sum of squared
distance 2
1 )( iK
k tC −ΣΣ
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 306/323
306
Given a k , find a partition of k clusters that optimizes the
chosen partitioning criterion Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
1 )( mim Kmt m t C mi
ΣΣ∈=
The K-Means Clustering
Method
Given k , the k-means algorithm is implemented
in four steps:P i i bj i k b
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 307/323
307
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point , of the cluster) Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment
The K-Means Clustering
Method
Example
6
7
8
9
10
7
8
9
10
8
9
10
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 308/323
308
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily chooseK object as initial
cluster center
Assigneachobjectstomostsimilarcenter
Updatethecluster
means
Updatetheclustermeans
reassignreassign
ommen s on e - eans
Method
Strength: Relatively efficient : O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k , t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 309/323
309
Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Variations of the K-Means
Method
A few variants of the k-means which differ in
Selection of the initial k means
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 310/323
310
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
W at Is t e Pro em o t e K-Means
Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially
distort the distribution of the data
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 311/323
311
distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
T e K -Me o s uster ng
Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 312/323
312
starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale
well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
A Typical K-Medoids Algorithm (PAM)
5
6
7
8
9
10
Total Cost = 20
6
7
8
9
10
Arbitrar5
6
7
8
9
10
Assignh
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 313/323
313
0
1
2
3
4
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8 9 10
K=2
ychoosek objectasinitial
medoids
0
1
2
3
4
0 1 2 3 4 5 6 7 8 9 10
eachremainingobjectto
nearestmedoids Randomly select a
nonmedoidobject,Oramdom
Computetotal cost
of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping
O andOramdom
If quality isimproved.
Do loop
Until nochange
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
PAM (Partitioning Around Medoids)
(1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
S l t k t ti bj t bit il
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 314/323
314
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i ,
calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the
most similar representative object repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost
TCih=∑ j C jih
2
3
4
5
6
7
8
9
10
j
h
t
2
3
4
5
6
7
8
9
10
t
i h
j
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 315/323
0
1
2
0 1 2 3 4 5 6 7 8 9 10
i
C jih = 0
0
1
2
0 1 2 3 4 5 6 7 8 9 10
i h
C jih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i t
j
C jih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i
h j
C jih = d(j, h) - d(j, t)
a s e ro em w
PAM?
Pam is more robust than k-means in the presence
of noise and outliers because a medoid is less
i fl d b tli th t l
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 316/323
316
influenced by outliers or other extreme values
than a mean
Pam works efficiently for small data sets but doesnot scale well for large data sets. O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
CLARA (Clustering LargeApplications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 317/323
317
It draws multiple samples of the data set, applies PAM
on each sample, and gives the best clustering as the
output
Strength: deals with larger data sets than PAM
Weakness: Efficiency depends on the sample size
A good clustering based on samples will not necessarilyrepresent a good clustering of the whole data set if the sample
is biased
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized Search)
(Ng and Han’94) CLARANS draws sample of neighbors dynamically
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 318/323
318
The clustering process can be presented as searching a graph
where every node is a potential solution, that is, a set of k
medoids
If the local optimum is found, CLARANS starts with new randomlyselected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may further
improve its performance (Ester et al.’95)
What Is Outlier Discovery?
What are outliers? The set of objects are considerably dissimilar from the
remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ...
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 319/323
319
Problem: Define and find outliers in large data sets Applications:
Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
Statistical
Approaches
e Assume a model underlying distribution that
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 320/323
320
e Assume a model underlying distribution thatgenerates data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution distribution parameter (e.g., mean, variance) number of expected outliers
Drawbacks most tests are for single attribute
In many cases, data distribution may not be known
ut er D scovery: D stance-Base
Approach
Introduced to counter the main limitations imposedby statistical methods We need multi-dimensional analysis without knowing data
distribution
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 321/323
321
Distance-based outlier: A DB(p, D)-outlier is anobject O in a dataset T such that at least a fractionp of the objects in T lies at a distance greater than
D from O Algorithms for mining distance-based outliers
Index-based algorithm Nested-loop algorithm Cell-based algorithm
Local Outlier
Detection
Distance-based outlierdetection is based on global
distance distribution It encounters difficulties to
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 322/323
322
identify outliers if data is notuniformly distributed
Ex. C1 contains 400 loosely
distributed points, C2 has 100tightly condensed points, 2outlier points o1, o2
Distance-based method cannotidentify o
2as an outlier
Need the concept of localoutlier
Local outlier factor(LOF) Assume outlier is not
crisp Each point has a LOF
Outlier Discovery: Deviation-Based
Approach
Identifies outliers by examining the main
characteristics of objects in a group Objects that “deviate” from this description are
8/8/2019 CS1004-NOL
http://slidepdf.com/reader/full/cs1004-nol 323/323
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish