data mining

Data MiningData MiningUsing IBM Intelligent MinerUsing IBM Intelligent Miner

Presented by: Presented by:

Qiyan (Jennifer ) HuangQiyan (Jennifer ) Huang

OutlineOutline

• Introduction Introduction

• Mining ProcessMining Process

• Main Functionalities of Intelligent Main Functionalities of Intelligent MinerMiner

• Other Data Mining ProductsOther Data Mining Products

• Data Mining and Privacy Data Mining and Privacy

• SummarySummary

• ReferencesReferences

What is Data MiningWhat is Data Mining

• Data miningData mining: : discovering interesting discovering interesting patterns from large amounts of datapatterns from large amounts of data– Knowledge discovery (mining) in databases Knowledge discovery (mining) in databases

(KDD), data/pattern analysis, information (KDD), data/pattern analysis, information harvesting, business intelligence, etcharvesting, business intelligence, etc..

Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:

– Data collection, database creationData collection, database creation

• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS

implementationimplementation

• 1980s ~ present: 1980s ~ present: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia

databases, and Web databasesdatabases, and Web databases

Data Mining VS. Database Data Mining VS. Database QueryQuery• DatabaseDatabase

• Data MiningData Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.

– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Data Mining Process (KDD)Data Mining Process (KDD)

Data Cleaning

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

J. Han. and M. Kamber. Data Mining: J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001Concepts and Techniques,2001

About DB2 Intelligent MinerAbout DB2 Intelligent Miner

• DB2 Intelligent Miner for DataDB2 Intelligent Miner for Data ““focused on the large-scale mining, such focused on the large-scale mining, such as large volumes of data, parallel data as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and mining on Windows NT, Sun Solaris, and OS/390OS/390” ” – – IBMIBM

Main FunctionalitiesMain Functionalities

• Cluster analysisCluster analysis– Group the data that share similar trends Group the data that share similar trends

and patternsand patterns

• Classification Classification – Predict the outcome based on historical Predict the outcome based on historical

datadata

• Association analysisAssociation analysis – Finding frequent patternsFinding frequent patterns..

age income studentcreditrating

buyscomputer

<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair

This follows an example from Quinlan’s ID3

ClassificationClassification

age income studentcreditrating

buyscomputer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

This follows an example from Quinlan’s ID3


AssociationAssociation

– Association Rule: Association Rule: identifies identifies relationshipsrelationships

– ExampleExample “ “30% customers buy shirts in all the 30% customers buy shirts in all the

transactions, 60% of these transactions, 60% of these customers customers

will also by a tie” will also by a tie” •Confidence factor is 60%Confidence factor is 60%•Support – Support – if buying shirt and tie together is if buying shirt and tie together is

observed in 12% of all transactions, then the observed in 12% of all transactions, then the support is thus 12%support is thus 12%

•Lift = 60% Lift = 60% // 30%=2 30%=2


Support Confidence Type Lift Rule Body Rule Head Support Confidence Type Lift Rule Body Rule Head (%) (%)(%) (%)

5.52865.5286 34.0800 + 2.7300 [203] + [1207] => [1716] 34.0800 + 2.7300 [203] + [1207] => [1716]

7.03887.0388 34.1300 + 2.7400 [203] + [1719] 34.1300 + 2.7400 [203] + [1719] => [1716]=> [1716]

5.46625.4662 34.1700 + 2.7400 [202] + [802] 34.1700 + 2.7400 [202] + [802] => [1716]=> [1716]

5.88055.8805 34.3400 + 2.7500 [203] + [802] 34.3400 + 2.7500 [203] + [802] => [1716]=> [1716]

5.01635.0163 34.4900 + 2.7600 [203] + [705] 34.4900 + 2.7600 [203] + [705] => [1716]=> [1716]

7.12797.1279 34.7400 + 2.7800 [202] + [1718] 34.7400 + 2.7800 [202] + [1718] => [1716]=> [1716]

5.8226 34.7600 + 3.3900 [711] + [203]5.8226 34.7600 + 3.3900 [711] + [203] => [710]=> [710]

5.06975.0697 34.8300 + 2.7400 [202] + [1702] 34.8300 + 2.7400 [202] + [1702] => [1703]=> [1703]

5.28365.2836 34.8300 + 2.7400 [202] + [1207] 34.8300 + 2.7400 [202] + [1207] => [1703]=> [1703]

5.43505.4350 34.9400 + 3.4100 [201] + [711] 34.9400 + 3.4100 [201] + [711] => [710]=> [710]

5.34595.3459 35.0200 + 2.7600 [201] + [1702] 35.0200 + 2.7600 [201] + [1702] => [1703]=> [1703]

Data Mining ProductsData Mining Products

• more than 50 commercial data mining toolsmore than 50 commercial data mining tools

• Wide range of pricing Wide range of pricing – SAS Institute’s Enterprise Miner ~ $80kSAS Institute’s Enterprise Miner ~ $80k– SPSS Inc. Clementine ~ 75KSPSS Inc. Clementine ~ 75K– IBM Intelligent Miner ~ $60kIBM Intelligent Miner ~ $60k– Desktop products start at few hundred dollarsDesktop products start at few hundred dollars

Data Mining ProductsData Mining Products

AlgorithmAlgorithm IBMIBM SASSAS SPSSSPSS

Neural Neural NetworkNetwork

√√ √√ √√

Decision TreeDecision Tree √√ √√ √√

Clustering Clustering √√ √√

AssociationAssociation √√ √√

Nearest Nearest NeighbourNeighbour

√√

Kohonen Self- Kohonen Self- Organizing Organizing

MapMap

√√ √√

Data Ming Product Comparison on Algorithm

Data Mining & PrivacyData Mining & Privacy

• Release limited subset of dataRelease limited subset of data– Hide attributes that potentially related Hide attributes that potentially related

to personal informationto personal information

• Release Encrypted DataRelease Encrypted Data

• Audit to detect misuse of DataAudit to detect misuse of Data

• Set up Data Mining ControllerSet up Data Mining Controller

SummarySummary

• Introduction to Data MiningIntroduction to Data Mining

• A KDD Data Mining ProcessA KDD Data Mining Process

• Functionalities of Intelligent MinerFunctionalities of Intelligent Miner

• Commercial Data Mining ToolsCommercial Data Mining Tools

• Data Mining & PrivacyData Mining & Privacy

ReferencesReferencesAngoss Whitepaper:Angoss Whitepaper:

http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.hthttp://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.ml. Retrieved on Oct26th,2003Retrieved on Oct26th,2003

C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end

Data Mining ToolsData Mining ToolsElder Research. Elder Research. http://www.rgrossman.com/faq/dm-02.htmhttp://www.rgrossman.com/faq/dm-02.htm. . Retrieved on Retrieved on

Oct28th,2003Oct28th,2003IBM. BD2 Intelligent Mine. IBM. BD2 Intelligent Mine. http://www-3.ibm.com/software/data/iminer/http://www-3.ibm.com/software/data/iminer/. . Retrieved on Oct26th,2003Retrieved on Oct26th,2003J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data

Mining ToolsMining ToolsJ. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Rehttp://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003trieved on Nov 10th,2003

Robert GrossmanRobert Grossman http://http://www.datamininglab.com/toolcomp.html#comparisonwww.datamininglab.com/toolcomp.html#comparison. . Retrieved on Retrieved on Oct20th,2003Oct20th,2003

SPSS. SPSS. http://http://www.spss.comwww.spss.com//.. Retrieved on Nov12th,2003 Retrieved on Nov12th,2003

http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.%20Retrieved%20on%20Oct26th,2003




http://www.rgrossman.com/faq/dm-02.htm

http://www-3.ibm.com/software/data/iminer/

http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt%20Retrieved%20on%20Nov%2010th,2003

http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt%20Retrieved%20on%20Nov%2010th,2003

http://www.datamininglab.com/toolcomp.html#comparison

http://www.datamininglab.com/toolcomp.html#comparison

http://www.spss.com/



Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:

– Data collection, database creation, and network Data collection, database creation, and network DBMSDBMS

• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS

implementationimplementation

• 1980s: 1980s: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia

databases, and Web databasesdatabases, and Web databases

Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?

• Data SourcesData Sources– Relational databaseRelational database– Data warehousesData warehouses– Transactional databasesTransactional databases– WWWWWW

• Data typesData types– AudioAudio– ImageImage– TextText

Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Neural networkNeural network

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

0.15

0.29

0.11

0.25

0.09

0.230.32

0.27

n

jjjii outputwinput

1

iinputgaini eoutput

1

1


Applications of Clustering Applications of Clustering

• Pattern RecognitionPattern Recognition

• Image ProcessingImage Processing

• Economic Science (especially market Economic Science (especially market research)research)

• WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of Cluster Weblog data to discover groups of

similar access patternssimilar access patterns

Data Mining & PrivacyData Mining & Privacy

Data Mining Tool

Mining Controller

Data warehouse

Examples of Clustering Examples of Clustering ApplicationsApplications

• Marketing:Marketing: Help marketers discover distinct groups Help marketers discover distinct groups in their customer bases, and then use this in their customer bases, and then use this knowledge to develop targeted marketing knowledge to develop targeted marketing programsprograms

• Insurance:Insurance: Identifying groups of motor insurance Identifying groups of motor insurance policy holders with a high average claim costpolicy holders with a high average claim cost

• City-planning:City-planning: Identifying groups of houses Identifying groups of houses according to their house type, value, and according to their house type, value, and geographical locationgeographical location

• Earth-quake studies:Earth-quake studies: Observed earth quake Observed earth quake epicenters should be clustered along continent epicenters should be clustered along continent faultsfaults


Association and pattern analysisAssociation and pattern analysis– Applications:Applications:

•Basket data analysis, cross-marketing, Basket data analysis, cross-marketing, catalog design, loss-leader analysis, catalog design, loss-leader analysis, clustering, classification, etcclustering, classification, etc..

– Examples.Examples. •buys(x, “diapers”) buys(x, “diapers”) buys(x, “beers”) buys(x, “beers”)

[0.5%, 60%][0.5%, 60%]•major(x, “CS”) ^ takes(x, “DB”) major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]grade(x, “A”) [1%, 75%]

Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?

• Relational databasesRelational databases

• Data warehousesData warehouses

• Transactional databasesTransactional databases

• Advanced DB and information repositoriesAdvanced DB and information repositories– Object-oriented and object-relational Object-oriented and object-relational

databasesdatabases– Text databases and multimedia databasesText databases and multimedia databases– Heterogeneous and legacy databasesHeterogeneous and legacy databases– WWWWWW

Steps of a KDD Steps of a KDD ProcessProcess

• Learning the application domain:Learning the application domain:– relevant prior knowledge and goals of applicationrelevant prior knowledge and goals of application

• Creating a target data set: data selectionCreating a target data set: data selection• Data cleaningData cleaning and preprocessing: (may take 60% of and preprocessing: (may take 60% of

effort!)effort!)• Data reduction and transformationData reduction and transformation::

– Find useful features, dimensionality/variable reduction, Find useful features, dimensionality/variable reduction, invariant representation.invariant representation.

• Choosing functions of data mining Choosing functions of data mining – summarization, classification, regression, association, summarization, classification, regression, association,

clustering.clustering.

• Choosing the mining algorithm(s)Choosing the mining algorithm(s)• Data miningData mining: search for patterns of interest: search for patterns of interest• Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, visualization, transformation, removing redundant patterns, etc.etc.

• Use of discovered knowledgeUse of discovered knowledge

Strength and Weakness Strength and Weakness

StrengthStrength– Algorithm breadth Algorithm breadth – Graphical outputGraphical output– Available for PC and mainframe Available for PC and mainframe

environmentenvironment

WeaknessWeakness– No automationNo automation– Data has to reside in IBM’s database systemData has to reside in IBM’s database system

data mining

Documents

data mining data mining

parallel data mining

data mining controller

data collection

data warehousing

misuse of data

kind of data

relational data model