data mining

42
Data Mining Data Mining Using IBM Intelligent Miner Using IBM Intelligent Miner Presented by: Presented by: Qiyan (Jennifer ) Huang Qiyan (Jennifer ) Huang

Upload: tommy96

Post on 11-May-2015

637 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining

Data MiningData MiningUsing IBM Intelligent MinerUsing IBM Intelligent Miner

Presented by: Presented by:

Qiyan (Jennifer ) HuangQiyan (Jennifer ) Huang

Page 2: Data Mining

OutlineOutline

• Introduction Introduction

• Mining ProcessMining Process

• Main Functionalities of Intelligent Main Functionalities of Intelligent MinerMiner

• Other Data Mining ProductsOther Data Mining Products

• Data Mining and Privacy Data Mining and Privacy

• SummarySummary

• ReferencesReferences

Page 3: Data Mining

What is Data MiningWhat is Data Mining

• Data miningData mining: : discovering interesting discovering interesting patterns from large amounts of datapatterns from large amounts of data– Knowledge discovery (mining) in databases Knowledge discovery (mining) in databases

(KDD), data/pattern analysis, information (KDD), data/pattern analysis, information harvesting, business intelligence, etcharvesting, business intelligence, etc..

Page 4: Data Mining

Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:

– Data collection, database creationData collection, database creation

• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS

implementationimplementation

• 1980s ~ present: 1980s ~ present: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia

databases, and Web databasesdatabases, and Web databases

Page 5: Data Mining

Data Mining VS. Database Data Mining VS. Database QueryQuery• DatabaseDatabase

• Data MiningData Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.

– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Page 6: Data Mining

Data Mining Process (KDD)Data Mining Process (KDD)

Data Cleaning

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

J. Han. and M. Kamber. Data Mining: J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001Concepts and Techniques,2001

Page 7: Data Mining

About DB2 Intelligent MinerAbout DB2 Intelligent Miner

• DB2 Intelligent Miner for DataDB2 Intelligent Miner for Data ““focused on the large-scale mining, such focused on the large-scale mining, such as large volumes of data, parallel data as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and mining on Windows NT, Sun Solaris, and OS/390OS/390” ” – – IBMIBM

Page 8: Data Mining

Main FunctionalitiesMain Functionalities

• Cluster analysisCluster analysis– Group the data that share similar trends Group the data that share similar trends

and patternsand patterns

• Classification Classification – Predict the outcome based on historical Predict the outcome based on historical

datadata

• Association analysisAssociation analysis – Finding frequent patternsFinding frequent patterns..

Page 9: Data Mining
Page 10: Data Mining
Page 11: Data Mining
Page 12: Data Mining
Page 13: Data Mining
Page 14: Data Mining
Page 15: Data Mining
Page 16: Data Mining
Page 17: Data Mining
Page 18: Data Mining

age income studentcreditrating

buyscomputer

<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair

This follows an example from Quinlan’s ID3

ClassificationClassification

Page 19: Data Mining
Page 20: Data Mining

ClassificationClassification

Page 21: Data Mining

age income studentcreditrating

buyscomputer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

This follows an example from Quinlan’s ID3

ClassificationClassification

Page 22: Data Mining

AssociationAssociation

– Association Rule: Association Rule: identifies identifies relationshipsrelationships

– ExampleExample “ “30% customers buy shirts in all the 30% customers buy shirts in all the

transactions, 60% of these transactions, 60% of these customers customers

will also by a tie” will also by a tie” •Confidence factor is 60%Confidence factor is 60%•Support – Support – if buying shirt and tie together is if buying shirt and tie together is

observed in 12% of all transactions, then the observed in 12% of all transactions, then the support is thus 12%support is thus 12%

•Lift = 60% Lift = 60% // 30%=2 30%=2

Page 23: Data Mining

AssociationAssociation

Support Confidence Type Lift Rule Body Rule Head Support Confidence Type Lift Rule Body Rule Head (%) (%)(%) (%)

5.52865.5286 34.0800 + 2.7300 [203] + [1207] => [1716] 34.0800 + 2.7300 [203] + [1207] => [1716]

7.03887.0388 34.1300 + 2.7400 [203] + [1719] 34.1300 + 2.7400 [203] + [1719] => [1716]=> [1716]

5.46625.4662 34.1700 + 2.7400 [202] + [802] 34.1700 + 2.7400 [202] + [802] => [1716]=> [1716]

5.88055.8805 34.3400 + 2.7500 [203] + [802] 34.3400 + 2.7500 [203] + [802] => [1716]=> [1716]

5.01635.0163 34.4900 + 2.7600 [203] + [705] 34.4900 + 2.7600 [203] + [705] => [1716]=> [1716]

7.12797.1279 34.7400 + 2.7800 [202] + [1718] 34.7400 + 2.7800 [202] + [1718] => [1716]=> [1716]

5.8226 34.7600 + 3.3900 [711] + [203]5.8226 34.7600 + 3.3900 [711] + [203] => [710]=> [710]

5.06975.0697 34.8300 + 2.7400 [202] + [1702] 34.8300 + 2.7400 [202] + [1702] => [1703]=> [1703]

5.28365.2836 34.8300 + 2.7400 [202] + [1207] 34.8300 + 2.7400 [202] + [1207] => [1703]=> [1703]

5.43505.4350 34.9400 + 3.4100 [201] + [711] 34.9400 + 3.4100 [201] + [711] => [710]=> [710]

5.34595.3459 35.0200 + 2.7600 [201] + [1702] 35.0200 + 2.7600 [201] + [1702] => [1703]=> [1703]

Page 24: Data Mining

Data Mining ProductsData Mining Products

• more than 50 commercial data mining toolsmore than 50 commercial data mining tools

• Wide range of pricing Wide range of pricing – SAS Institute’s Enterprise Miner ~ $80kSAS Institute’s Enterprise Miner ~ $80k– SPSS Inc. Clementine ~ 75KSPSS Inc. Clementine ~ 75K– IBM Intelligent Miner ~ $60kIBM Intelligent Miner ~ $60k– Desktop products start at few hundred dollarsDesktop products start at few hundred dollars

Page 25: Data Mining

Data Mining ProductsData Mining Products

AlgorithmAlgorithm IBMIBM SASSAS SPSSSPSS

Neural Neural NetworkNetwork

√√ √√ √√

Decision TreeDecision Tree √√ √√ √√

Clustering Clustering √√ √√

AssociationAssociation √√ √√

Nearest Nearest NeighbourNeighbour

√√

Kohonen Self- Kohonen Self- Organizing Organizing

MapMap

√√ √√

Data Ming Product Comparison on Algorithm

Page 26: Data Mining

Data Mining & PrivacyData Mining & Privacy

• Release limited subset of dataRelease limited subset of data– Hide attributes that potentially related Hide attributes that potentially related

to personal informationto personal information

• Release Encrypted DataRelease Encrypted Data

• Audit to detect misuse of DataAudit to detect misuse of Data

• Set up Data Mining ControllerSet up Data Mining Controller

Page 27: Data Mining

SummarySummary

• Introduction to Data MiningIntroduction to Data Mining

• A KDD Data Mining ProcessA KDD Data Mining Process

• Functionalities of Intelligent MinerFunctionalities of Intelligent Miner

• Commercial Data Mining ToolsCommercial Data Mining Tools

• Data Mining & PrivacyData Mining & Privacy

Page 28: Data Mining

ReferencesReferencesAngoss Whitepaper:Angoss Whitepaper:

http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.hthttp://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.ml. Retrieved on Oct26th,2003Retrieved on Oct26th,2003

C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end

Data Mining ToolsData Mining ToolsElder Research. Elder Research. http://www.rgrossman.com/faq/dm-02.htmhttp://www.rgrossman.com/faq/dm-02.htm. . Retrieved on Retrieved on

Oct28th,2003Oct28th,2003IBM. BD2 Intelligent Mine. IBM. BD2 Intelligent Mine. http://www-3.ibm.com/software/data/iminer/http://www-3.ibm.com/software/data/iminer/. . Retrieved on Oct26th,2003Retrieved on Oct26th,2003J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data

Mining ToolsMining ToolsJ. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Rehttp://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003trieved on Nov 10th,2003

Robert GrossmanRobert Grossman http://http://www.datamininglab.com/toolcomp.html#comparisonwww.datamininglab.com/toolcomp.html#comparison. . Retrieved on Retrieved on Oct20th,2003Oct20th,2003

SPSS. SPSS. http://http://www.spss.comwww.spss.com//.. Retrieved on Nov12th,2003 Retrieved on Nov12th,2003

Page 29: Data Mining
Page 30: Data Mining

Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:

– Data collection, database creation, and network Data collection, database creation, and network DBMSDBMS

• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS

implementationimplementation

• 1980s: 1980s: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia

databases, and Web databasesdatabases, and Web databases

Page 31: Data Mining

Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?

• Data SourcesData Sources– Relational databaseRelational database– Data warehousesData warehouses– Transactional databasesTransactional databases– WWWWWW

• Data typesData types– AudioAudio– ImageImage– TextText

Page 32: Data Mining

Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 33: Data Mining

Neural networkNeural network

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

Page 34: Data Mining

0.15

0.29

0.11

0.25

0.09

0.230.32

0.27

n

jjjii outputwinput

1

iinputgaini eoutput

1

1

Neural networkNeural network

Page 35: Data Mining

Neural networkNeural network

Page 36: Data Mining

Applications of Clustering Applications of Clustering

• Pattern RecognitionPattern Recognition

• Image ProcessingImage Processing

• Economic Science (especially market Economic Science (especially market research)research)

• WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of Cluster Weblog data to discover groups of

similar access patternssimilar access patterns

Page 37: Data Mining

Data Mining & PrivacyData Mining & Privacy

Data Mining Tool

Mining Controller

Data warehouse

Page 38: Data Mining

Examples of Clustering Examples of Clustering ApplicationsApplications

• Marketing:Marketing: Help marketers discover distinct groups Help marketers discover distinct groups in their customer bases, and then use this in their customer bases, and then use this knowledge to develop targeted marketing knowledge to develop targeted marketing programsprograms

• Insurance:Insurance: Identifying groups of motor insurance Identifying groups of motor insurance policy holders with a high average claim costpolicy holders with a high average claim cost

• City-planning:City-planning: Identifying groups of houses Identifying groups of houses according to their house type, value, and according to their house type, value, and geographical locationgeographical location

• Earth-quake studies:Earth-quake studies: Observed earth quake Observed earth quake epicenters should be clustered along continent epicenters should be clustered along continent faultsfaults

Page 39: Data Mining

AssociationAssociation

Association and pattern analysisAssociation and pattern analysis– Applications:Applications:

•Basket data analysis, cross-marketing, Basket data analysis, cross-marketing, catalog design, loss-leader analysis, catalog design, loss-leader analysis, clustering, classification, etcclustering, classification, etc..

– Examples.Examples. •buys(x, “diapers”) buys(x, “diapers”) buys(x, “beers”) buys(x, “beers”)

[0.5%, 60%][0.5%, 60%]•major(x, “CS”) ^ takes(x, “DB”) major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]grade(x, “A”) [1%, 75%]

Page 40: Data Mining

Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?

• Relational databasesRelational databases

• Data warehousesData warehouses

• Transactional databasesTransactional databases

• Advanced DB and information repositoriesAdvanced DB and information repositories– Object-oriented and object-relational Object-oriented and object-relational

databasesdatabases– Text databases and multimedia databasesText databases and multimedia databases– Heterogeneous and legacy databasesHeterogeneous and legacy databases– WWWWWW

Page 41: Data Mining

Steps of a KDD Steps of a KDD ProcessProcess

• Learning the application domain:Learning the application domain:– relevant prior knowledge and goals of applicationrelevant prior knowledge and goals of application

• Creating a target data set: data selectionCreating a target data set: data selection• Data cleaningData cleaning and preprocessing: (may take 60% of and preprocessing: (may take 60% of

effort!)effort!)• Data reduction and transformationData reduction and transformation::

– Find useful features, dimensionality/variable reduction, Find useful features, dimensionality/variable reduction, invariant representation.invariant representation.

• Choosing functions of data mining Choosing functions of data mining – summarization, classification, regression, association, summarization, classification, regression, association,

clustering.clustering.

• Choosing the mining algorithm(s)Choosing the mining algorithm(s)• Data miningData mining: search for patterns of interest: search for patterns of interest• Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, visualization, transformation, removing redundant patterns, etc.etc.

• Use of discovered knowledgeUse of discovered knowledge

Page 42: Data Mining

Strength and Weakness Strength and Weakness

StrengthStrength– Algorithm breadth Algorithm breadth – Graphical outputGraphical output– Available for PC and mainframe Available for PC and mainframe

environmentenvironment

WeaknessWeakness– No automationNo automation– Data has to reside in IBM’s database systemData has to reside in IBM’s database system