data mining overview professor p. batchelor furman university

Data Mining Overview

Professor P. Batchelor Furman University

Overview

Introduction Explanation of Data Mining

Techniques Advantages Applications Privacy

Data Mining What is Data Mining? “The process of semi automatically analyzing large

databases to find useful patterns” (Silberschatz) KDD – “Knowledge Discovery in Databases” “Attempts to discover rules and patterns from data” Discover Rules Make Predictions Areas of Use

Internet – Discover needs of customers Economics – Predict stock prices Science – Predict environmental change Medicine – Match patients with similar problems cure

Example of Data Mining Credit Card Company wants to discover

information about clients from databases. Want to find:

Clients who respond to promotions in “Junk Mail” Clients that are likely to change to another

competitor Clients that are likely to not pay Services that clients use to try to promote services

affiliated with the Credit Card Company Anything else that may help the Company provide/

promote services to help their clients and ultimately make more money.

Data Mining & Data Warehousing

Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz)

Collect data Store in single repository Allows for easier query development as a single

repository can be queried.

Data Mining: Analyzing databases or Data Warehouses to discover

patterns about the data to gain knowledge. Knowledge is power.

Discovery of Knowledge

Data Mining Techniques

Classification Clustering Regression (we have already looked at

this) Association Rules

Classification Classification: Given a set of items that have several classes,

and given the past instances (training instances) with their associated class,

Classification is the process of predicting the class of a new item.

Classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers

into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”.

The bank will then attempt to find rules about the customers that respond Frequently and Sometimes.

The rules could be used to predict needs of potential customers.

Technique for Classification

Decision-Tree Classifiers

Job

Income

Job

Income Income

CarpenterEngineer Doctor

Bad Good Bad Good Bad Good

<30K

<40K

<50K

>50K

>90K

>100K

Predicting credit risk of a person with the jobs specified.

Clustering “Clustering algorithms find groups of items that are

similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ”

Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.

The categories are unspecified and this is referred to as ‘unsupervised learning’

Clustering Group Data into Clusters

Similar data is grouped in the same cluster Dissimilar data is grouped in the a differnt cluster

How is this achieved ? Hierarchical

Group data into t-trees K-Nearest Neighbor

A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer)

Association Rules “An association algorithm creates rules that

describe how often events have occurred together.”

Example: When a customer buys a hammer, then 90% of the time they will buy nails.

Association Rules Support: “is a measure of what fraction of the

population satisfies both the antecedent and the consequent of the rule”

Example: People who buy hotdog buns also buy hotdog

sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005%

of cases. = Low support

Situations where there is high support for the antecedent are worth careful attention

E.g. Hotdog sausages should be placed near hotdog buns in supermarkets if there is also high confidence.

Association Rules Confidence: “is a measure of how often the

consequent is true when the antecedent is true.” Example:

90% of Hotdog bun purchases are accompanied by hotdog sausages.

High confidence is meaningful as we can derive rules. Hotdog bun Hotdog sausage 2 rules may have different confidence levels

and have the same support. E.g. Hotdog sausage Hotdog bun may have a

much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support.

Advantages of Data Mining Provides new knowledge from existing data

Public databases Government sources Company Databases

Old data can be used to develop new knowledge

New knowledge can be used to improve services or products

Improvements lead to: Bigger profits More efficient service

Uses of Data Mining Sales/ Marketing

Diversify target market Identify clients needs to increase response rates

Risk Assessment Identify Customers that pose high credit risk

Fraud Detection Identify people misusing the system. E.g. People who

have two Social Security Numbers Customer Care

Identify customers likely to change providers Identify customer needs

Applications of Data Mining

Source IDC 1998

Privacy Concerns Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple

data sources Linking sources leads can be problematic for

privacy as follows: If the following histories of a customer were linked: Shopping History Credit History Bank History Employment History

The users life story can be painted from the collected data

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birth date

Sex

Name

Address

Date registered

Party affiliation

Date last voted

Medical Data Voter List

L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.

Linking to Re-identify Data

{date of birth, gender, 5-digit ZIP} uniquely identifies 87.1% of

USA pop.

Perceived Concerns Data mining lets you find out about my

private life I don’t want you, my insurance company, the

government knowing everything Data mining doesn’t always get it right

I don’t want to be put in jail because data mining said so

I don’t want to be denied credit, a job, insurance because data mining said so.

Real Concerns Data mining lets you find out about my private life

Learned models allow conjectures Learning the model requires collecting data

Data mining doesn’t always get it right Our legal system is supposed to ensure due process Data mining typically allows businesses to take risks they

otherwise wouldn’t Identify people we can give instant credit

But without data mining, decisions would be slower - and probably more restrictive. Why is credit so easy to get, even though bankruptcies up?

Data Mining and Terrorism

Total Information Awareness (TIA).

The Information Awareness Office (IAO) was established by the Defense Advanced Research Projects Agency in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and monitor terrorists and other threats to U.S. national security, by achieving Total Information Awareness (TIA).

Following public criticism that the development and deployment of this technology could potentially lead to a mass surveillance system, the IAO was defunded by Congress in 2003.

However, several IAO projects continued to be funded, and merely run under different names

Evidence Extraction and Link Discovery

Development of technologies and tools for automated discovery, extraction and linking of sparse evidence contained in large amounts of classified and unclassified data sources (such as phone call records from the NSA call database, internet histories, or bank records)

Design systems with the ability to extract data from multiple sources (e.g., text messages, social networking sites, financial records, and web pages).

Detect patterns comprising multiple types of links between data items or people communicating (e.g., financial transactions, communications, travel, etc.).

Designed to link items relating potential "terrorist" groups and scenarios, and to learn patterns of different groups or scenarios to identify new organizations and emerging threats.

Scalable Social Network Analysis

Aimed at developing techniques based on social network analysis for modeling the key characteristics of terrorist groups and discriminating these groups from other types of societal groups.

Sean McGahan, of Northeastern University said the following in his study of SSNA:

The purpose of the SSNA algorithms program is to extend techniques of social network analysis to assist with distinguishing potential terrorist cells from legitimate groups of people ... In order to be successful SSNA will require information on the social interactions of the majority of people around the globe. Since the Defense Department cannot easily distinguish between peaceful citizens and terrorists, it will be necessary for them to gather data on innocent civilians as well as on potential terrorists.

Does this worry you or make you feel more secure?

Human ID project The Human Identification at a Distance (HumanID)

project developed automated biometric identification technologies to detect, recognize and identify humans at great distances for "force protection", crime prevention, and "homeland security/defense" purposes.

Its goals included programs to: Develop algorithms for locating and acquiring subjects out to 150 meters

(500 ft) in range. Fuse face and gait recognition into a 24/7 human identification system. Develop and demonstrate a human identification system that operates out to

150 meters (500 ft) using visible imagery. Develop a low power millimeter wave radar system for wide field of view

detection and narrow field of view gait classification. Characterize gait performance from video for human identification at a

distance. Develop a multi-spectral infrared and visible face recognition system.

Solutions

Data mining lets you find out about my private life Privacy-preserving data mining

Data mining doesn’t always get it right Data scientists know it and are working on

it Educate the user

Privacy-Preserving Data MiningData Perturbation

Construct a data set with noise added Can be released without revealing private

data Miners given the perturbed data set

Reconstruct distribution to improve results Solutions out there

Decision trees, association rules Debate: Does it really preserve privacy?

Can we prove impossibility of noise removal?

Privacy-Preserving Data MiningDistributed Data Mining

Data owners keep their data Collaborate to get data mining results

Encryption techniques to preserve privacy Proofs that private data is not disclosed

Solutions for Decision Trees, Association Rules, Clustering Different solutions needed depending on

how data is distributed, privacy constraints

What Next? Data mining lets you find out about my private life

Constraints that allow us to restrict what models can be learned

Can we ensure that data mining won’t produce results that are amenable to misuse? (e.g., 100% confidence models) Redlining example

Data mining doesn’t always get it right Educate the public

What data mining does (and doesn’t do)

Do You Agree?

There is a great difference between an inanimate machine knowing your secrets and a person knowing the same.

Political solutions can control how and why information goes from the machine to trusted analysts who can act on the knowledge.

data mining overview professor p. batchelor furman university

Documents

data set

data store

data warehouses

data discover rules

clustering group data

clusters similar data

hierarchical group data

data mining overview