data mining
TRANSCRIPT
![Page 1: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/1.jpg)
Data MiningData MiningUsing IBM Intelligent MinerUsing IBM Intelligent Miner
Presented by: Presented by:
Qiyan (Jennifer ) HuangQiyan (Jennifer ) Huang
![Page 2: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/2.jpg)
OutlineOutline
• Introduction Introduction
• Mining ProcessMining Process
• Main Functionalities of Intelligent Main Functionalities of Intelligent MinerMiner
• Other Data Mining ProductsOther Data Mining Products
• Data Mining and Privacy Data Mining and Privacy
• SummarySummary
• ReferencesReferences
![Page 3: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/3.jpg)
What is Data MiningWhat is Data Mining
• Data miningData mining: : discovering interesting discovering interesting patterns from large amounts of datapatterns from large amounts of data– Knowledge discovery (mining) in databases Knowledge discovery (mining) in databases
(KDD), data/pattern analysis, information (KDD), data/pattern analysis, information harvesting, business intelligence, etcharvesting, business intelligence, etc..
![Page 4: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/4.jpg)
Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:
– Data collection, database creationData collection, database creation
• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS
implementationimplementation
• 1980s ~ present: 1980s ~ present: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia
databases, and Web databasesdatabases, and Web databases
![Page 5: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/5.jpg)
Data Mining VS. Database Data Mining VS. Database QueryQuery• DatabaseDatabase
• Data MiningData Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)
– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.
– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)
![Page 6: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/6.jpg)
Data Mining Process (KDD)Data Mining Process (KDD)
Data Cleaning
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
J. Han. and M. Kamber. Data Mining: J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001Concepts and Techniques,2001
![Page 7: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/7.jpg)
About DB2 Intelligent MinerAbout DB2 Intelligent Miner
• DB2 Intelligent Miner for DataDB2 Intelligent Miner for Data ““focused on the large-scale mining, such focused on the large-scale mining, such as large volumes of data, parallel data as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and mining on Windows NT, Sun Solaris, and OS/390OS/390” ” – – IBMIBM
![Page 8: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/8.jpg)
Main FunctionalitiesMain Functionalities
• Cluster analysisCluster analysis– Group the data that share similar trends Group the data that share similar trends
and patternsand patterns
• Classification Classification – Predict the outcome based on historical Predict the outcome based on historical
datadata
• Association analysisAssociation analysis – Finding frequent patternsFinding frequent patterns..
![Page 9: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/9.jpg)
![Page 10: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/10.jpg)
![Page 11: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/11.jpg)
![Page 12: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/12.jpg)
![Page 13: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/13.jpg)
![Page 14: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/14.jpg)
![Page 15: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/15.jpg)
![Page 16: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/16.jpg)
![Page 17: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/17.jpg)
![Page 18: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/18.jpg)
age income studentcreditrating
buyscomputer
<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair
This follows an example from Quinlan’s ID3
ClassificationClassification
![Page 19: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/19.jpg)
![Page 20: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/20.jpg)
ClassificationClassification
![Page 21: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/21.jpg)
age income studentcreditrating
buyscomputer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes
This follows an example from Quinlan’s ID3
ClassificationClassification
![Page 22: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/22.jpg)
AssociationAssociation
– Association Rule: Association Rule: identifies identifies relationshipsrelationships
– ExampleExample “ “30% customers buy shirts in all the 30% customers buy shirts in all the
transactions, 60% of these transactions, 60% of these customers customers
will also by a tie” will also by a tie” •Confidence factor is 60%Confidence factor is 60%•Support – Support – if buying shirt and tie together is if buying shirt and tie together is
observed in 12% of all transactions, then the observed in 12% of all transactions, then the support is thus 12%support is thus 12%
•Lift = 60% Lift = 60% // 30%=2 30%=2
![Page 23: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/23.jpg)
AssociationAssociation
Support Confidence Type Lift Rule Body Rule Head Support Confidence Type Lift Rule Body Rule Head (%) (%)(%) (%)
5.52865.5286 34.0800 + 2.7300 [203] + [1207] => [1716] 34.0800 + 2.7300 [203] + [1207] => [1716]
7.03887.0388 34.1300 + 2.7400 [203] + [1719] 34.1300 + 2.7400 [203] + [1719] => [1716]=> [1716]
5.46625.4662 34.1700 + 2.7400 [202] + [802] 34.1700 + 2.7400 [202] + [802] => [1716]=> [1716]
5.88055.8805 34.3400 + 2.7500 [203] + [802] 34.3400 + 2.7500 [203] + [802] => [1716]=> [1716]
5.01635.0163 34.4900 + 2.7600 [203] + [705] 34.4900 + 2.7600 [203] + [705] => [1716]=> [1716]
7.12797.1279 34.7400 + 2.7800 [202] + [1718] 34.7400 + 2.7800 [202] + [1718] => [1716]=> [1716]
5.8226 34.7600 + 3.3900 [711] + [203]5.8226 34.7600 + 3.3900 [711] + [203] => [710]=> [710]
5.06975.0697 34.8300 + 2.7400 [202] + [1702] 34.8300 + 2.7400 [202] + [1702] => [1703]=> [1703]
5.28365.2836 34.8300 + 2.7400 [202] + [1207] 34.8300 + 2.7400 [202] + [1207] => [1703]=> [1703]
5.43505.4350 34.9400 + 3.4100 [201] + [711] 34.9400 + 3.4100 [201] + [711] => [710]=> [710]
5.34595.3459 35.0200 + 2.7600 [201] + [1702] 35.0200 + 2.7600 [201] + [1702] => [1703]=> [1703]
![Page 24: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/24.jpg)
Data Mining ProductsData Mining Products
• more than 50 commercial data mining toolsmore than 50 commercial data mining tools
• Wide range of pricing Wide range of pricing – SAS Institute’s Enterprise Miner ~ $80kSAS Institute’s Enterprise Miner ~ $80k– SPSS Inc. Clementine ~ 75KSPSS Inc. Clementine ~ 75K– IBM Intelligent Miner ~ $60kIBM Intelligent Miner ~ $60k– Desktop products start at few hundred dollarsDesktop products start at few hundred dollars
![Page 25: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/25.jpg)
Data Mining ProductsData Mining Products
AlgorithmAlgorithm IBMIBM SASSAS SPSSSPSS
Neural Neural NetworkNetwork
√√ √√ √√
Decision TreeDecision Tree √√ √√ √√
Clustering Clustering √√ √√
AssociationAssociation √√ √√
Nearest Nearest NeighbourNeighbour
√√
Kohonen Self- Kohonen Self- Organizing Organizing
MapMap
√√ √√
Data Ming Product Comparison on Algorithm
![Page 26: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/26.jpg)
Data Mining & PrivacyData Mining & Privacy
• Release limited subset of dataRelease limited subset of data– Hide attributes that potentially related Hide attributes that potentially related
to personal informationto personal information
• Release Encrypted DataRelease Encrypted Data
• Audit to detect misuse of DataAudit to detect misuse of Data
• Set up Data Mining ControllerSet up Data Mining Controller
![Page 27: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/27.jpg)
SummarySummary
• Introduction to Data MiningIntroduction to Data Mining
• A KDD Data Mining ProcessA KDD Data Mining Process
• Functionalities of Intelligent MinerFunctionalities of Intelligent Miner
• Commercial Data Mining ToolsCommercial Data Mining Tools
• Data Mining & PrivacyData Mining & Privacy
![Page 28: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/28.jpg)
ReferencesReferencesAngoss Whitepaper:Angoss Whitepaper:
http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.hthttp://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.ml. Retrieved on Oct26th,2003Retrieved on Oct26th,2003
C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end
Data Mining ToolsData Mining ToolsElder Research. Elder Research. http://www.rgrossman.com/faq/dm-02.htmhttp://www.rgrossman.com/faq/dm-02.htm. . Retrieved on Retrieved on
Oct28th,2003Oct28th,2003IBM. BD2 Intelligent Mine. IBM. BD2 Intelligent Mine. http://www-3.ibm.com/software/data/iminer/http://www-3.ibm.com/software/data/iminer/. . Retrieved on Oct26th,2003Retrieved on Oct26th,2003J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data
Mining ToolsMining ToolsJ. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Rehttp://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003trieved on Nov 10th,2003
Robert GrossmanRobert Grossman http://http://www.datamininglab.com/toolcomp.html#comparisonwww.datamininglab.com/toolcomp.html#comparison. . Retrieved on Retrieved on Oct20th,2003Oct20th,2003
SPSS. SPSS. http://http://www.spss.comwww.spss.com//.. Retrieved on Nov12th,2003 Retrieved on Nov12th,2003
![Page 29: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/29.jpg)
![Page 30: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/30.jpg)
Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:
– Data collection, database creation, and network Data collection, database creation, and network DBMSDBMS
• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS
implementationimplementation
• 1980s: 1980s: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia
databases, and Web databasesdatabases, and Web databases
![Page 31: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/31.jpg)
Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?
• Data SourcesData Sources– Relational databaseRelational database– Data warehousesData warehouses– Transactional databasesTransactional databases– WWWWWW
• Data typesData types– AudioAudio– ImageImage– TextText
![Page 32: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/32.jpg)
Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
![Page 33: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/33.jpg)
Neural networkNeural network
k-
f
weighted sum
Inputvector x
output y
Activationfunction
weightvector w
w0
w1
wn
x0
x1
xn
![Page 34: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/34.jpg)
0.15
0.29
0.11
0.25
0.09
0.230.32
0.27
n
jjjii outputwinput
1
iinputgaini eoutput
1
1
Neural networkNeural network
![Page 35: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/35.jpg)
Neural networkNeural network
![Page 36: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/36.jpg)
Applications of Clustering Applications of Clustering
• Pattern RecognitionPattern Recognition
• Image ProcessingImage Processing
• Economic Science (especially market Economic Science (especially market research)research)
• WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of Cluster Weblog data to discover groups of
similar access patternssimilar access patterns
![Page 37: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/37.jpg)
Data Mining & PrivacyData Mining & Privacy
Data Mining Tool
Mining Controller
Data warehouse
![Page 38: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/38.jpg)
Examples of Clustering Examples of Clustering ApplicationsApplications
• Marketing:Marketing: Help marketers discover distinct groups Help marketers discover distinct groups in their customer bases, and then use this in their customer bases, and then use this knowledge to develop targeted marketing knowledge to develop targeted marketing programsprograms
• Insurance:Insurance: Identifying groups of motor insurance Identifying groups of motor insurance policy holders with a high average claim costpolicy holders with a high average claim cost
• City-planning:City-planning: Identifying groups of houses Identifying groups of houses according to their house type, value, and according to their house type, value, and geographical locationgeographical location
• Earth-quake studies:Earth-quake studies: Observed earth quake Observed earth quake epicenters should be clustered along continent epicenters should be clustered along continent faultsfaults
![Page 39: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/39.jpg)
AssociationAssociation
Association and pattern analysisAssociation and pattern analysis– Applications:Applications:
•Basket data analysis, cross-marketing, Basket data analysis, cross-marketing, catalog design, loss-leader analysis, catalog design, loss-leader analysis, clustering, classification, etcclustering, classification, etc..
– Examples.Examples. •buys(x, “diapers”) buys(x, “diapers”) buys(x, “beers”) buys(x, “beers”)
[0.5%, 60%][0.5%, 60%]•major(x, “CS”) ^ takes(x, “DB”) major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]grade(x, “A”) [1%, 75%]
![Page 40: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/40.jpg)
Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?
• Relational databasesRelational databases
• Data warehousesData warehouses
• Transactional databasesTransactional databases
• Advanced DB and information repositoriesAdvanced DB and information repositories– Object-oriented and object-relational Object-oriented and object-relational
databasesdatabases– Text databases and multimedia databasesText databases and multimedia databases– Heterogeneous and legacy databasesHeterogeneous and legacy databases– WWWWWW
![Page 41: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/41.jpg)
Steps of a KDD Steps of a KDD ProcessProcess
• Learning the application domain:Learning the application domain:– relevant prior knowledge and goals of applicationrelevant prior knowledge and goals of application
• Creating a target data set: data selectionCreating a target data set: data selection• Data cleaningData cleaning and preprocessing: (may take 60% of and preprocessing: (may take 60% of
effort!)effort!)• Data reduction and transformationData reduction and transformation::
– Find useful features, dimensionality/variable reduction, Find useful features, dimensionality/variable reduction, invariant representation.invariant representation.
• Choosing functions of data mining Choosing functions of data mining – summarization, classification, regression, association, summarization, classification, regression, association,
clustering.clustering.
• Choosing the mining algorithm(s)Choosing the mining algorithm(s)• Data miningData mining: search for patterns of interest: search for patterns of interest• Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, visualization, transformation, removing redundant patterns, etc.etc.
• Use of discovered knowledgeUse of discovered knowledge
![Page 42: Data Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081514/55506213b4c90574428b540c/html5/thumbnails/42.jpg)
Strength and Weakness Strength and Weakness
StrengthStrength– Algorithm breadth Algorithm breadth – Graphical outputGraphical output– Available for PC and mainframe Available for PC and mainframe
environmentenvironment
WeaknessWeakness– No automationNo automation– Data has to reside in IBM’s database systemData has to reside in IBM’s database system