big data: data analysis boot camp iris datasetccartled/teaching/2017-fall/data... · 2017-09-23 ·...
TRANSCRIPT
-
1/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Big Data: Data Analysis Boot CampIris dataset
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
22 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 2017
-
2/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Table of contents (1 of 1)
1 Introduction
2 Built-in datasets
3 Iris dataset
4 Hands-on
5 Q & A
6 Conclusion
7 References
8 Files
-
3/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What are we going to cover?
We’re going to talk about:
A few of the multitudes of R’sbuilt-in datasets.
An overview of tools andtechniques to look at the irisdataset.
-
4/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
R has over 120 built-in datasets
To see the currently installed ones:
1 data ( )
To see the data() function code:
1 data
To see over 2,000 available datasets:
1 data ( package = . packages ( a l l . a v a i l a b l e = TRUE) )
To get detailed information about the iris dataset:
1 ? i r i s
-
5/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Background
Some background
Edgar Anderson collecteddata on 3 different irisspecies on the GaspéPeninsula, Quebec,Canada[1]
Ronald Fisher usedAnderson’s data to see iflinear regression could beused to could be used to“maximize the ratio of thedifference between thespecific means to thestandard deviations withinspecies.”[3]
-
6/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Background
Anderson classified 50 examples of 3 different species
Each specimen was:
Collected on the same day
Collected by the sameperson
Measured using the sameinstruments
I. virginica
I. setosa
I. versicolor
-
7/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Background
Data collected from each specimen:
Sepal length,
Sepal width,
Petal length, and
Petal width
Image from [2].
-
8/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Background
What does his data look like?To see a few rows:
1 head ( i r i s )
To see how many rows:
1 nrow ( i r i s )
To see simple summary information:
1 s t r ( i r i s )
which returns:
’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
-
9/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Other ways
Quick and dirty:plot(iris)
Pairwise plotting of all numerical columns. Missing species (factor)classification.
-
10/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Same image.
Pairwise plotting of all numerical columns. Missing species (factor)classification.
-
11/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
As a 3D-ish plot
Another view:library(ggplot2)
qplot(Petal.Width,
Sepal.Width, data=iris,
colour=Species, size=I(4))
Ideas taken from [7].
Iris sepal and petal widths, showing species classification. Errors?
-
12/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Same image.
Ideas taken from [7].
Iris sepal and petal widths, showing species classification. Errors?
-
13/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
As a decision tree
More informative:library(tree)
tree1
-
14/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Same image.
An iris species classification decision tree.
-
15/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Combining decision tree and 3D-ish plot
A slightly more complex example:library(tree)
tree1
-
16/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Same image.
Why are there misclassifications?
-
17/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
How far should we go?
A even more slightly complexexample:library(ggplot2)
p
-
18/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Iris dataset analysis
Same image.
Perhaps there are attributes that aren’t being captured.
-
19/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What can we learn from it?
Different tools give different views
The default kmeans clustering isapplied with different k values.Different dendogram clusteringgive different views.R script is attached[6].
Different clusterings yielddifferent dendograms.
-
20/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What can we learn from it?
Same image.
Different clusterings yield different dendograms.
-
21/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What can we learn from it?
Viewing the results of many clusters
“Principal componentanalysis (PCA) refers to theprocess by which principalcomponents are computed,. . . PCA is an unsupervisedapproach, since it involvesonly a set of features . . . ,and no associated responseY.”
James, et al. [5]
The default kmeans clustering isapplied with different k values.PCA is plotted for each solution.R script is attached[4]. Some specimens move from one cluster to another.
-
22/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What can we learn from it?
Same image.
Some specimens move from one cluster to another.
-
23/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Some simple exercises to get familiar with data analysis
1 Build an iris classificationtree using only sepal data
2 Build an iris classificationtree using all data withoutspecifying each element
3 Build a 3D-ish decision tree
using only sepal data
4 Looking at the image onpage 17, what can be saidabout using petal width asa decision attribute?
-
24/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Q & A time.
Q: Why was Stonehengeabandoned?A: It wasn’t IBM compatible.
-
25/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
What have we covered?
R has a multitude of built-indatasetsAbout the iris dataset:
1 It isn’t too large (only 150 rows)2 It lends it self reasonably to linear
regression[3]3 There appear to be some
“errors” (as in misclassifications),so it isn’t “pure” data
4 Decision trees aren’t too large,nor too complex
5 It is an easy place to start
Different tools and techniques givedifferent insights into the dataset
Next: Look at R’s built-in Titanic dataset
-
26/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
References (1 of 3)
[1] Edgar Anderson, The irises of the Gaspé Peninsula, Bulletin ofthe American Iris society 59 (1935), 2–5.
[2] Sarthak Dasadia, Machine Learning with Iris Dataset,https://rstudio-pubs-static.s3.amazonaws.com/
202738_7cad2477d76b4acc82b44244f94ccfa8.html#/,2016.
[3] Ronald A Fisher,The Use of Multiple Measurements in Taxonomic Problems,Annals of Eugenics 7 (1936), no. 2, 179–188.
https://rstudio-pubs-static.s3.amazonaws.com/202738_7cad2477d76b4acc82b44244f94ccfa8.html#/https://rstudio-pubs-static.s3.amazonaws.com/202738_7cad2477d76b4acc82b44244f94ccfa8.html#/
-
27/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
References (2 of 3)
[4] Tal Galili,Clustergram: visualization and diagnostics for cluster analysis (R code)”,https:
//www.r-bloggers.com/clustergram-visualization-
and-diagnostics-for-cluster-analysis-r-code/, 2010.
[5] Gareth James, Daniela Witten, Trevor Hastie, and RobertTibshirani, An Introduction to Statistical Learning, vol. 6,Springer, 2013.
[6] RDM Staff, k-means Clustering, http://www.rdatamining.com/examples/kmeans-clustering,2017.
https://www.r-bloggers.com/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/https://www.r-bloggers.com/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/https://www.r-bloggers.com/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/http://www.rdatamining.com/examples/kmeans-clusteringhttp://www.rdatamining.com/examples/kmeans-clustering
-
28/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
References (3 of 3)
[7] Dave Tang, Building a classification tree in R,http://davetang.org/muse/2013/03/12/building-a-
classification-tree-in-r, 2013.
http://davetang.org/muse/2013/03/12/building-a-classification-tree-in-rhttp://davetang.org/muse/2013/03/12/building-a-classification-tree-in-r
-
29/29
Introduction Built-in datasets Iris dataset Hands-on Q & A Conclusion References Files
Files of interest
1 iris dendogram R script
2 iris clustergram R script
3 R library script file
rm(list=ls())## http://rischanlab.github.io/Kmeans.html## http://www.rdatamining.com/examples/kmeans-clustering
source("library.R")
main