introduction to data mining with weka
DESCRIPTION
Introduction to Data Mining with Weka. Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist. Agenda. Introduction What does Open Source mean? Data Science and Data Mining Open Source Data Mining Tools Weka Overview Profiling Demonstration - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Data Mining with WekaData Science and Business Analytics Denver MeetupNancy AbramsonPrincipal Data Scientist
Slide 2
IntroductionWhat does Open Source mean?Data Science and Data MiningOpen Source Data Mining ToolsWeka
OverviewProfiling DemonstrationAnalysis Demonstration
Summary
Agenda
Slide 3
Datasource Consulting Employee for past 3 year developing, using and evaluating open source and enterprise Business Intelligence toolsNew hire to spotXchange as Principal Data ScientistBachelor of Science degree in Computer Science & Mathematics Masters in Applied StatisticsExperience with databases, ETL, and analyticsUsing “Open Source” or “free software” more than 25 yearsMarket analysis in aerospace, financial, telephony, and retail
Introduction – Who am I?
Slide 4
A software development project in which code is developed by peer production and collaboration, with the end-product, source-code and documentation available at no cost to the public.
Free Access to Source CodeFree RedistributionStrong development community
Examples:LinuxHadoopApache/TomcatMySQLWeka
What is Open Source?
Slide 5
Data Science process defined by Dr. DJ Patil, previous head of Data Analytics at LinkedIn
Clean-up and preparation of dataCreate measurable levers to increase the value of the businessMonitor if state of metrics for changesExperiment with the results of the models
Traditional Data Mining is used for…Profiling data to check for quality e.g. max, min, data types, and patterns between variablesFinding relationships between variables or independent variables, e.g. clusters, regressionsChecking variance of a measure over timeDetermine the level an experiment produced significant results
Data Science and Data Mining
Slide 6
Fun StuffSee what you never thought possible
Name: Mr. EdGenus: EquusAddress: Apt 302, Manhattan, NY 10033
Profiling and Heavy Lifting
Slide 7
Data Mining Tools
Reference: http://www.phiresearchlab.org/downloads/OpenSourceDataMining.pdf
RapidMiner Weka Orange Rattle Knime
url Rapid-i.com www.cs.waikato.ac.nz/ml/weka
www.ailab.si/orange rattle.togaware.com knime.org
Bayes Network yes yes yes no yes
Decision Tree yes yes yes yes yes
Neural Network yes yes no no yes
SVM yes yes yes yes yesClustering yes yes yes yes yes
Association Rules yes yes yes yes yes
Ease of Use Fair Good Excellent Good Good
Data Visualization Good Fair Excellent Excellent Fair
Slide 8
Waikato Environment for Knowledge Analysis (WEKA)Developed by the University of Waikato, New ZealandJava based distributed under the GNU Public License
ExplorerPreprocessing, attribute selection, learning, visualization
ExperimenterTesting and evaluating machine learning algorithms
Knowledge FlowData-flow interface to WEKA
SimpleCLI
Weka Introduction
9
loadfilter analyze
Slide 10
Load and view csv dataCompare pairs of attributesExamine min/max data valueCompare nominal and numeric valuesSave in ARFF format
Weka Pre-process Demo
Derived from census bureau database found at| http://www.census.gov/ftp/pub/DES/www/welcome.html
Slide 11
Attribute-Relation File Format @relation workers
@attribute age numeric@attribute workclass {' State-gov',' Self-emp-not-inc',' Private',' Federal-gov',' Local-gov',' ?',' Self-emp-inc',' Without-pay',' Never-worked'}@attribute ' fnlwgt' numeric
:
@attribute ' wage' {' <=50K',' >50K'}
@data39,' State-gov',77516,' Bachelors',13,' Never-married',' Adm-clerical',' Not-in-family',' White',' Male',2174,0,40,' United-States',' <=50K'
50,' Self-emp-not-inc',83311,' Bachelors',13,' Married-civ-spouse',' Exec-managerial',' Husband',' White',' Male',0,0,13,' United-States',' <=50K'
38,' Private',215646,' HS-grad',9,' Divorced',' Handlers-cleaners',' Not-in-family',' White',' Male',0,0,40,' United-States',' <=50K'
Slide 12
49 data preprocessing tools76 classification/regression algorithms8 clustering algorithms15 attribute/subset evaluators + 10 search algorithms for feature selection. 3 algorithms for finding association rules
Weka Classify Features
Derived from census bureau database found at| http://www.census.gov/ftp/pub/DES/www/welcome.html
Slide 13
Linear RegressionPredicted attribute is continuousCorrelation Coefficient determines fit of data
measures the strength and the direction of a linear relationship-1 < r < +1A correlation greater than 0.8 is generally described as strong, depending on the type of data
UsesForecastingExploring factor effects
Demo: cpu.arff
Slide 14
ClassificationPredicted attribute is categoricalImplemented methods
Naïve Bayesdecision trees and rulesneural networkssupport vector machines
Demo: J48 decision tree with weather.arff