data mining edward, hong zhang cs dept, suny, albany csi 668, march,20. 2001
Post on 19-Dec-2015
216 views
TRANSCRIPT
Data Mining
Edward, Hong ZhangCS Dept, SUNY, Albany
CSI 668, March,20. 2001
Presentation Outline
Motivation
Background (KDD Process)
What’s Data Mining?
Why Data Mining?
The Data Mining Process
Data Mining Algorithms
Data Mining Research Trend
Existing Systems
for Data Mining
Conclusions
Motivation “Necessity is the mother of invention”
Data explosion problem: Automated data collection tools, availability of increasingly cheap
storage devices and mature database technology lead to tremendous amounts of data stored in database, data warehouses
and other information repositories.
We are drowning in data, but starving for knowledge! Data is everywhere
Understand and use data—an imminent task!
Solution: Knowledge Discovery (Data warehousing and data mining)
Evolution of Database Technology
1960s-1970s:
Data collection, database creation, IMS and network DBMS.
1970s-1980s:
Relational data model, relational DBMS implementation.
1980s-1990s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.).
1990s-right now: Data mining and data warehousing, multimedia databases, and
Web-based database technology.
BackgroundKnowledge Discovery (KD):
the process of finding general patterns/principles that summarize/explain a set of "observations".
The Knowledge Discovery in Databases (KDD)
Very Large DataBases (VLDB) have become the industry standard, making it impossible for human beings to mine the data "by hand" to look for interesting patterns. Automated tools are therefore needed to help to extract these patterns.
Background Cont.
The knowledge discovery in databases (KDD) consists of 3 steps: Data Integration (Data Warehousing): Collecting the target data observations from the
different data sources, removing noise from the observations, and integrating them into an appropriate format.
Data Mining: (will be covered in detail) Applying a concrete algorithm to find useful and novel
patterns in the integrated data.
Background Cont.
Pattern Evaluation:
Interpreting mined patterns, evaluating them according to usefulness/interestingness criteria, and possibly using visualization tools to aid in understanding the patterns graphically.
See KDD process graph below:
Data Mining: KDD process
Task-relevant Data
Data Cleaning
Data Warehouse
Data Mining
Pattern Evaluation
Selection
Data IntegrationDatabases
Data mining: the core of knowledge discovery process.
What Is Data Mining?
Data Mining (knowledge discovery in databases) Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information (knowledge) or patterns from data in
large databases, data warehouse or other information repositories
What is not data mining? (Deductive) query processing. Expert systems or Machine Learning/statistical programs Online Analytical Processing (OLAP) Software Agents
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database, OLAP,
High Performance Computing
MachineLearning (AI)
Visualization
InformationScience
Pattern recognition
Statistics Modeling
Why Data Mining? – Potential Applications
Database analysis and decision support System (DSS)
Market analysis and management target marketing, customer relation management,
market basket analysis, cross selling, market segmentation.
Risk analysis and management Forecasting, customer retention, improved underwriting,
quality control, competitive analysis.
Text mining (Text Databases, documents), key words search and analysis.DNA sequence analysis and gene expression.
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions
Paper, Files, Information Providers, Database Systems, OLTPData Sources
Data Warehouses / Data MartsOLAP, MDA
Data ExplorationStatistical Analysis, Querying and Reporting
Visualization Techniques
Data Mining
Useful Pattern
MakingDecisions
DBA
DataAnalyst
Business Analyst
End User
Why Data Mining? – Potential Applications (Cont.)
Internet Web Surf-Aid (Web Mining) IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Sports IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.
The Data Mining Process
Data Mining Algorithm
Score model
model
Results Pattern
Data Mining Systemtraining
evaluation
prediction
Historical Training data
New data
Data set
Examples of “Discovered” Patterns
Association rules: find rules between different attributes 98% of AOL users also have EBay accounts
Classification: Classify data based on the values in a classifying attribute People age less than 40 and salary > 40,000$ trade on-line
Clustering: Group data to form new classes Users A and B access similar URLs, they belong to the
same group, which has similar user profiles.
Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Query-based, focused mining
Interestingness measures: A pattern is interesting if it is: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks
to confirm
How can we Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness. Can a data mining system find all the interesting
patterns?Search only interesting patterns: Optimization. Can a data mining system find only the interesting
patterns? Approaches
First generate all the patterns and then filter out the uninteresting ones.
Generate only the interesting patterns --- mining query optimization
Data Mining Algorithms
Four common DM algorithm types:
The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees
The k-Nearest Neighbor Algorithm (KNN)
A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset
Use entire training database as the model
Find nearest data point and do the same thing as you did for that record
. xq
-++
-
-
+-
+
-+
The k-Nearest Neighbor Algorithm (KNN) (Cont.)
Distance-weighted nearest neighbor algorithm. Weight the contribution of each of the k neighbors
according to their distance to the query point Xq. giving greater weight to closer neighbors:
Advantages: Calculate the mean values of the k nearest neighbors. Robust to noisy data by averaging k-nearest neighbors.
Very easy to implement.
Disadvantage: Huge Models ( the entire training database ) More difficult to use in production.
Artificial neural networks Algorithm (ANN)
Non-linear predictive models that learn through training and loosely resemble biological neural networks in structure.
Inputs transformed through a network of simple processors
Processor combines (weighted) inputs and produces an output value
Artificial neural networks (Cont.)
x0
x1
xn
f
w0
w1
wn
k-
Inputvector x
weightvector w
weighted sum
Activationfunction
output y
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
(Learning Rate)
Multi layer perception of Artificial neural networks
Input vector: xi
Input nodes
Hidden nodes
Output nodes
Output vector
Artificial Neural Network evaluation
Advantages: prediction accuracy is generally high robust,still works when training examples contain errors
Disadvantages: Key problem: Difficult to understand
The neural network model is difficult to understand No intuitive understanding of results
Long training time Although after training, process is very quick, the training process itself is time-consuming
Significant pre-processing of data often required
Rule Induction
Rule Induction (rule-based prediction) We first generate a set of rules from a data warehouse,
then use them to predict values for new data item. It works much better on larger (and real)data sets, not
just on samples of data.
Two phases: Rule discovery: analyze a historical database and
generate a set of rules by automatic discovery. Prediction: apply the rules to a new data set and match
the rules to make predictions.
Rule Induction ExampleOutlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
Training Set
Rule Induction Example (Cont.)
4 attributes: Outlook: can be sunny, overcast, rainy 3 cases Temperature: hot, mild, cool 3 cases Humidity: high, normal 2 cases Windy: true, false 2 cases
1 outcome: class (N: no class, P: have class)
Totally we should have 3*3*2*2=36 possible combinations, of which 14 are present in the
set of input examples.
Rule Induction Example (Cont.)
Some rules inducted from above dataset: Classification rules:
If outlook = sunny and humidity = high then class = n.
If outlook = rainy and windy = true then class = n
if outlook = overcast then class = p Association rules:
If temperature = cool then humidity = normal
If windy=false and class=n then outlook = sunny and
humidity = high
What is a decision tree?
A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test
All tuples in branch have the same value for the tested attribute.
Leaf node represents class label or class label distribution.
A series of nested if/then rules Understandable!
A Sample Decision Tree
Outlook
sunny
humidity
high
N
normal
P
The same Training setwith RuleInduction
overcast
P
rain
windy
true false
N P
Another Example for DT
If x=1 and y=0 then class = aIf x=0 and y=1 then class = aIf x=0 and y=0 then class = bIf x=1 and y=1 then class = b
Another Example for DT
salary education label10000 high school reject40000 under graduate accept15000 under graduate reject75000 graduate accept18000 graduate accept
Credit Analysis
salary < 20000
Yes
no
accept
education in graduate
yes
no
rejectaccept
Decision-Tree Classification Methods
The basic top-down decision tree generation approach usually consists of two phases: Tree construction
At start, all the training examples are at the root. Partition examples recursively based on selected
attributes.
Tree pruning Aiming at removing tree branches that may lead to errors
when classifying test data (training data may contain noise, statistical fluctuations, …)
How to construct a tree?
Algorithmgreedy algorithm
make optimal choice at each step: select the best attribute for each tree node.
top-down recursive divide-and-conquer manner
from root to leafsplit node to several branches for each branch, recursively run the algorithm
How to prune a treeA decision tree constructed using the training data may have too many branches/leaf nodes. Caused by noise, overfitting May result poor accuracy for unseen samples
Prune the tree: merge a subtree into a leaf node. Using a set of data different from the training data. At a tree node, if the accuracy without splitting is
higher than the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class.
How to use a tree?
Directly test the attribute value of unknown sample against
the tree. A path is traced from root to a leaf which holds the
label
Indirectly decision tree is converted to classification rules one rule is created for each path from the root to a
leaf IF-THEN is easier for humans to understand
Decision tree for a covering algorithm
y
x
a
b b
b
b
b
bb
b
b b bb
bb
aa
aa
a
y
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
y
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
2·6
Data Mining Algorithm Summary
KNN: Quick and easy Models tend to be very
large
ANN: Difficult to interpret Can require significant
amounts of time to train
Rule Induction: Understandable Need to limit calculations
Decision Trees: Understandable
Relatively fast
Other DM Technologies Genetic Algorithms Rough sets Bayesian networks Mixture models Many more...
Data Mining Research Trend
Text mining: Text database and information retrieval
Multimedia data mining
OLAM (OLAP Mining)
Web mining (Data Mining and WWW) E-commerce Information retrieval (search) Network management
Why Mine the Web?Web: A huge, widely-distributed, highly heterogeneous, semi-structured,
hypertext/hypermedia, interconnected, evolving information repository.
Web is a huge collection of documents plus Hyper-link information Access and usage information
Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)
Lots of data on user access patterns Web logs contain sequence of URLs accessed by users
Why is Web Mining Different?Huge : The Web is a huge collection of documents except for Hyper-link information Access and usage information
Dynamic:The Web is very dynamic New pages are constantly being generated
Unstructured: Complexity of Web pages: far greater than text
document collection
Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental
Types of Web Mining
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web StructureMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web Mining ApplicationsE-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud detection Similar image retrieval
Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents
Network Management Performance management Fault management
Existing Systems for Data Mining
IBM: Intelligent Miner.SAS Institute: Enterprise Miner.Silicon Graphics: MineSet.Integral Solutions Ltd.: Clementine.Information Discovery Inc.: Data Mining Suite.
DBMiner Technology Inc.: DBMinerRutger: DataMine, GMD: Explora, Univ. Munich: VisDB
Microsoft OLE DB for Data Mining
Microsoft OLE, OLE DB, OLE DB for OLAP and OLE
DB for Data Mining
OLE DB for DM: Standardization July 1999 to March
2000
Microsoft SQL Server 2000: Analysis manager Analysis manager consists of OLAP and Data Mining
Data mining: two modules (Classification/Prediction and clustering)
OLDB for DM: Data mining providers (such as association modules
and other classification or clustering modules)
Research Progress for Data Mining in the Last Decade
Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)Association, correlation, and causality analysisClassification: scalability and new approachesClustering and outlier analysisSequential patterns and time-series analysisText mining, Web mining and Weblog analysisSpatial, multimedia, scientific data analysisData preprocessing and database compressionData visualization and visual data mining
Conclusions
Knowledge Discovery in Databases (KDD)
Data warehouse: An industry trend DW stores a huge amount of subject-oriented,
cleansed, integrated, consolidated, time-related data.
Data Mining: A rich, promising, young field with broad applications and many challenging research
issues. Good science - leading position in research community
Conclusions (Cont.)
Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc.
Data mining Algorithms: The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees
Research progress and trend in Data Mining
Future WorkTheoretical foundations of data mining.Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies.
Data mining in advanced information systems: Spatial, multimedia, Web-mining
Data mining applications: content browsing, query optimization, multi-
resolution model, etc.Social issues: A threat to security and privacy.