data mining overview professor p. batchelor furman university
TRANSCRIPT
Data Mining What is Data Mining? “The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz) KDD – “Knowledge Discovery in Databases” “Attempts to discover rules and patterns from data” Discover Rules Make Predictions Areas of Use
Internet – Discover needs of customers Economics – Predict stock prices Science – Predict environmental change Medicine – Match patients with similar problems cure
Example of Data Mining Credit Card Company wants to discover
information about clients from databases. Want to find:
Clients who respond to promotions in “Junk Mail” Clients that are likely to change to another
competitor Clients that are likely to not pay Services that clients use to try to promote services
affiliated with the Credit Card Company Anything else that may help the Company provide/
promote services to help their clients and ultimately make more money.
Data Mining & Data Warehousing
Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz)
Collect data Store in single repository Allows for easier query development as a single
repository can be queried.
Data Mining: Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge. Knowledge is power.
Data Mining Techniques
Classification Clustering Regression (we have already looked at
this) Association Rules
Classification Classification: Given a set of items that have several classes,
and given the past instances (training instances) with their associated class,
Classification is the process of predicting the class of a new item.
Classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers
into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”.
The bank will then attempt to find rules about the customers that respond Frequently and Sometimes.
The rules could be used to predict needs of potential customers.
Technique for Classification
Decision-Tree Classifiers
Job
Income
Job
Income Income
CarpenterEngineer Doctor
Bad Good Bad Good Bad Good
<30K
<40K
<50K
>50K
>90K
>100K
Predicting credit risk of a person with the jobs specified.
Clustering “Clustering algorithms find groups of items that are
similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ”
Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.
The categories are unspecified and this is referred to as ‘unsupervised learning’
Clustering Group Data into Clusters
Similar data is grouped in the same cluster Dissimilar data is grouped in the a differnt cluster
How is this achieved ? Hierarchical
Group data into t-trees K-Nearest Neighbor
A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer)
Association Rules “An association algorithm creates rules that
describe how often events have occurred together.”
Example: When a customer buys a hammer, then 90% of the time they will buy nails.
Association Rules Support: “is a measure of what fraction of the
population satisfies both the antecedent and the consequent of the rule”
Example: People who buy hotdog buns also buy hotdog
sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005%
of cases. = Low support
Situations where there is high support for the antecedent are worth careful attention
E.g. Hotdog sausages should be placed near hotdog buns in supermarkets if there is also high confidence.
Association Rules Confidence: “is a measure of how often the
consequent is true when the antecedent is true.” Example:
90% of Hotdog bun purchases are accompanied by hotdog sausages.
High confidence is meaningful as we can derive rules. Hotdog bun Hotdog sausage 2 rules may have different confidence levels
and have the same support. E.g. Hotdog sausage Hotdog bun may have a
much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support.
Advantages of Data Mining Provides new knowledge from existing data
Public databases Government sources Company Databases
Old data can be used to develop new knowledge
New knowledge can be used to improve services or products
Improvements lead to: Bigger profits More efficient service
Uses of Data Mining Sales/ Marketing
Diversify target market Identify clients needs to increase response rates
Risk Assessment Identify Customers that pose high credit risk
Fraud Detection Identify people misusing the system. E.g. People who
have two Social Security Numbers Customer Care
Identify customers likely to change providers Identify customer needs
Privacy Concerns Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple
data sources Linking sources leads can be problematic for
privacy as follows: If the following histories of a customer were linked: Shopping History Credit History Bank History Employment History
The users life story can be painted from the collected data
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birth date
Sex
Name
Address
Date registered
Party affiliation
Date last voted
Medical Data Voter List
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.
Linking to Re-identify Data
Perceived Concerns Data mining lets you find out about my
private life I don’t want you, my insurance company, the
government knowing everything Data mining doesn’t always get it right
I don’t want to be put in jail because data mining said so
I don’t want to be denied credit, a job, insurance because data mining said so.
Real Concerns Data mining lets you find out about my private life
Learned models allow conjectures Learning the model requires collecting data
Data mining doesn’t always get it right Our legal system is supposed to ensure due process Data mining typically allows businesses to take risks they
otherwise wouldn’t Identify people we can give instant credit
But without data mining, decisions would be slower - and probably more restrictive. Why is credit so easy to get, even though bankruptcies up?
Total Information Awareness (TIA).
The Information Awareness Office (IAO) was established by the Defense Advanced Research Projects Agency in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and monitor terrorists and other threats to U.S. national security, by achieving Total Information Awareness (TIA).
Following public criticism that the development and deployment of this technology could potentially lead to a mass surveillance system, the IAO was defunded by Congress in 2003.
However, several IAO projects continued to be funded, and merely run under different names
Evidence Extraction and Link Discovery
Development of technologies and tools for automated discovery, extraction and linking of sparse evidence contained in large amounts of classified and unclassified data sources (such as phone call records from the NSA call database, internet histories, or bank records)
Design systems with the ability to extract data from multiple sources (e.g., text messages, social networking sites, financial records, and web pages).
Detect patterns comprising multiple types of links between data items or people communicating (e.g., financial transactions, communications, travel, etc.).
Designed to link items relating potential "terrorist" groups and scenarios, and to learn patterns of different groups or scenarios to identify new organizations and emerging threats.
Scalable Social Network Analysis
Aimed at developing techniques based on social network analysis for modeling the key characteristics of terrorist groups and discriminating these groups from other types of societal groups.
Sean McGahan, of Northeastern University said the following in his study of SSNA:
The purpose of the SSNA algorithms program is to extend techniques of social network analysis to assist with distinguishing potential terrorist cells from legitimate groups of people ... In order to be successful SSNA will require information on the social interactions of the majority of people around the globe. Since the Defense Department cannot easily distinguish between peaceful citizens and terrorists, it will be necessary for them to gather data on innocent civilians as well as on potential terrorists.
Does this worry you or make you feel more secure?
Human ID project The Human Identification at a Distance (HumanID)
project developed automated biometric identification technologies to detect, recognize and identify humans at great distances for "force protection", crime prevention, and "homeland security/defense" purposes.
Its goals included programs to: Develop algorithms for locating and acquiring subjects out to 150 meters
(500 ft) in range. Fuse face and gait recognition into a 24/7 human identification system. Develop and demonstrate a human identification system that operates out to
150 meters (500 ft) using visible imagery. Develop a low power millimeter wave radar system for wide field of view
detection and narrow field of view gait classification. Characterize gait performance from video for human identification at a
distance. Develop a multi-spectral infrared and visible face recognition system.
Solutions
Data mining lets you find out about my private life Privacy-preserving data mining
Data mining doesn’t always get it right Data scientists know it and are working on
it Educate the user
Privacy-Preserving Data MiningData Perturbation
Construct a data set with noise added Can be released without revealing private
data Miners given the perturbed data set
Reconstruct distribution to improve results Solutions out there
Decision trees, association rules Debate: Does it really preserve privacy?
Can we prove impossibility of noise removal?
Privacy-Preserving Data MiningDistributed Data Mining
Data owners keep their data Collaborate to get data mining results
Encryption techniques to preserve privacy Proofs that private data is not disclosed
Solutions for Decision Trees, Association Rules, Clustering Different solutions needed depending on
how data is distributed, privacy constraints
What Next? Data mining lets you find out about my private life
Constraints that allow us to restrict what models can be learned
Can we ensure that data mining won’t produce results that are amenable to misuse? (e.g., 100% confidence models) Redlining example
Data mining doesn’t always get it right Educate the public
What data mining does (and doesn’t do)