data mining using rr and hadoop packages: rhadoop, rhive rhadoop is a collection of r packages: rmr2...
TRANSCRIPT
![Page 1: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/1.jpg)
DATA MINING USING R PROGRAMMING LANGUAGE
Mamdouh Alenezi
1
![Page 2: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/2.jpg)
QUESTIONS
Have you heard of R?
Have you ever used R in your work?
Do you know data mining and its algorithms and techniques?
2
![Page 3: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/3.jpg)
HTTP://WWW.R-PROJECT.ORG/ DOWNLOAD R
3
![Page 4: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/4.jpg)
OUTLINE
What is R? and Why R?
Data Mining in R
Data Import/Export in R
Data Exploration and Visualization
Classification with R
Clustering with R
Text Mining in R
4
![Page 5: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/5.jpg)
WHAT IS R? AND WHY R?
5
![Page 6: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/6.jpg)
WHAT IS R?
R is a free software environment for statistical computing and graphics.
R can be easily extended with around 6,000 packages available on CRAN3.
Many other packages provided on Bioconductor, R-Forge, GitHub, etc.
6
![Page 7: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/7.jpg)
WHY DO DATA SCIENCE WITH R?
Most widely used Data Mining and Machine Learning Package
Machine Learning
Statistics
Software Engineering and Programming with Data
But not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
All modern statistical approaches
Many/most machine learning algorithms
Opportunity to readily add new algorithms
7
![Page 8: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/8.jpg)
HOW POPULAR IS R? DISCUSSION LIST TRAFFIC
Sum of monthly email traffic on each software’s main listserv discussion list.
8
![Page 9: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/9.jpg)
HOW POPULAR IS R? DISCUSSION TOPICS
9
![Page 10: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/10.jpg)
WHY R?
R was ranked no. 1 in the KDnuggets 2014 poll on Top Languages for analytics, data mining, data science (actually R has been no. 1 in 2011, 2012 & 2013!).
http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html
10
![Page 11: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/11.jpg)
DATA MINING IN R
11
![Page 12: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/12.jpg)
DATA MINING
A data driven analysis to uncover otherwise unknown but useful patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and (one day perhaps) wisdom, in a timely manner.
12
![Page 13: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/13.jpg)
DATA MINING
Application of Machine Learning
Statistics
Software Engineering and Programming with Data
Effective Communications and Intuition
…..
To Datasets that vary by: Volume, Velocity, Variety, Value, Veracity
To discover new knowledge
To improve business outcomes
To deliver better tailored services
13
![Page 14: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/14.jpg)
BASIC TOOLS: DATA MINING ALGORITHMS
Cluster Analysis (kmeans, wskm)
Association Analysis (arules)
Linear Discriminant Analysis (lda)
Logistic Regression (glm)
Decision Trees (rpart, wsrpart)
Random Forests (randomForest, wsrf)
Boosted Stumps (ada)
Neural Networks (nnet)
Support Vector Machines (kernlab)
That’s a lot of tools to learn in R!
Many with different interfaces and options.
14
![Page 15: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/15.jpg)
PREDICTIVE MODELLING: CLASSIFICATION
Goal of classification is to build models (sentences) in a knowledge representation (language) from examples of past decisions.
The model is to be used on unseen cases to make decisions.
Often referred to as supervised learning.
Common approaches: decision trees; neural networks; logistic regression; support vector machines.
15
![Page 16: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/16.jpg)
MAJOR CLUSTERING APPROACHES
Partitioning algorithms (kmeans, pam, clara, fanny)
Hierarchical algorithms: (hclust, agnes, diana)
Density-based algorithms
Grid-based algorithms
Model-based algorithms: (mclust for mixture of Gaussians)
16
![Page 17: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/17.jpg)
LANGUAGE: DECISION TREES
Knowledge representation: A flow-chart-like tree structure
Internal nodes denotes a test on a variable
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
17
![Page 18: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/18.jpg)
DATA SCIENTISTS ARE PROGRAMMERS OF DATA
Data Scientists Desire. . .
Scripting
Transparency
Repeatability
Sharing
18
![Page 19: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/19.jpg)
SOCIAL NETWORK ANALYSIS WITH R
Packages: igraph, sna
Centrality measures: degree(), betweenness(), closeness(), transitivity()
Clusters: clusters(), no.clusters()
Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number()
Community detection: fastgreedy.community(), spinglass.community()
19
![Page 20: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/20.jpg)
R AND BIG DATA
Hadoop Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models
R Packages: RHadoop, RHIPE
Spark Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than
Hadoop
SparkR - R frontend for Spark
H2O H2O - an open source in-memory prediction engine for big data science
R Package: h2o
MongoDB MongoDB - an open-source document database
R packages: rmongodb, RMongo
20
![Page 21: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/21.jpg)
R AND HADOOP
Packages: RHadoop, Rhive
RHadoop is a collection of R packages:
rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster
rhdfs - connect to Hadoop Distributed File System (HDFS)
rhbase - connect to the NoSQL HBase database
. . .
You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)!
21
![Page 22: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/22.jpg)
DATA IMPORT/EXPORT IN R
22
![Page 23: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/23.jpg)
DATA IMPORT AND EXPORT
Read data from and write data to
R native formats (incl. Rdata and RDS)
CSV files
EXCEL files
ODBC databases
SAS databases
23
![Page 24: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/24.jpg)
SAVE AND LOAD R OBJECTS
save(): save R objects into a .Rdata file
load(): read R objects from a .Rdata file
rm(): remove objects from R
24
![Page 25: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/25.jpg)
SAVE AND LOAD R OBJECTS - MORE FUNCTIONS
save.image(): save current workspace to a file
It saves everything!
readRDS(): read a single R object from a .rds file
saveRDS(): save a single R object to a file
Advantage of readRDS() and saveRDS(): You can restore the data under a different object name.
Advantage of load() and save(): You can save multiple R objects to one file.
25
![Page 26: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/26.jpg)
IMPORT FROM AND EXPORT TO .CSV FILES write.csv(): write an R object to a .CSV file
read.csv(): read an R object from a .CSV file
26
![Page 27: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/27.jpg)
IMPORT FROM AND EXPORT TO EXCEL FILES
Package xlsx: read, write, format Excel 2007 and Excel 97/2000/XP/2003 files
27
![Page 28: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/28.jpg)
READ FROM DATABASES
Package RODBC: provides connection to ODBC databases.
Function odbcConnect(): sets up a connection to database
sqlQuery(): sends an SQL query to the database
odbcClose() closes the connection.
Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database
28
![Page 29: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/29.jpg)
IMPORT DATA FROM SAS
Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R.
29
![Page 30: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/30.jpg)
DATA EXPLORATION AND VISUALIZATION
30
![Page 31: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/31.jpg)
DATA EXPLORATION AND VISUALIZATION WITH R
Data Exploration and Visualization
Summary and stats
Various charts like pie charts and histograms
Exploration of multiple variables
Level plot, contour plot and 3D plot
Saving charts into files of various formats
31
![Page 32: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/32.jpg)
SIZE AND STRUCTURE OF DATA
32
![Page 33: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/33.jpg)
ATTRIBUTES OF DATA
33
![Page 34: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/34.jpg)
FIRST ROWS OF DATA
34
![Page 35: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/35.jpg)
A SINGLE COLUMN
The first 10 values of Sepal.Length
35
![Page 36: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/36.jpg)
SUMMARY OF DATA Function summary()
numeric variables: minimum, maximum, mean, median, and the first (25%) and third (75%) quartiles
categorical variables (factors): frequency of every level
36
![Page 37: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/37.jpg)
37
![Page 38: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/38.jpg)
MEAN, MEDIAN, RANGE AND QUARTILES
Mean, median and range: mean(), median(), range()
Quartiles and percentiles: quantile()
38
![Page 39: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/39.jpg)
VARIANCE AND HISTOGRAM
39
![Page 40: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/40.jpg)
DENSITY
40
![Page 41: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/41.jpg)
PIE CHART Frequency of factors: table()
41
![Page 42: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/42.jpg)
BAR CHART
42
![Page 43: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/43.jpg)
CORRELATION Covariance and correlation: cov() and cor()
43
![Page 44: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/44.jpg)
AGGREATION Stats of Sepal.Length for every Species with aggregate()
44
![Page 45: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/45.jpg)
BOXPLOT The bar in the middle is median.
The box shows the interquartile range (IQR), i.e., range between the 75% and 25% observation.
45
![Page 46: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/46.jpg)
SCATTER PLOT
46
![Page 47: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/47.jpg)
A MATRIX OF SCATTER PLOTS
47
![Page 48: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/48.jpg)
3D SCATTER PLOT
48
![Page 49: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/49.jpg)
HEAT MAP
Calculate the similarity between different flowers in the iris data with dist() and then plot it with a heat map
49
![Page 50: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/50.jpg)
CONTOUR
contour() and filled.contour() in package graphics
contourplot() in package lattice
50
![Page 51: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/51.jpg)
3D SURFACE
51
![Page 52: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/52.jpg)
PARALLEL COORDINATES
52
![Page 53: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/53.jpg)
SAVE CHARTS TO FILES
Save charts to PDF and PS files: pdf() and postscript()
BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and tiff()
Close files (or graphics devices) with graphics.off() or dev.off() after plotting
53
![Page 54: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/54.jpg)
CLASSIFICATION WITH R
54
![Page 55: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/55.jpg)
CLASSIFICATION WITH R
Decision trees: rpart, party
Random forest: randomForest, party
SVM: e1071, kernlab
Neural networks: nnet, neuralnet, RSNNS
Performance evaluation: ROCR
55
![Page 56: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/56.jpg)
THE IRIS DATASET
56
![Page 57: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/57.jpg)
BUILD A DECISION TREE
57
![Page 58: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/58.jpg)
58
![Page 59: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/59.jpg)
PREDICTION
59
![Page 60: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/60.jpg)
R PACKAGES FOR RANDOM FOREST
Package randomForest
very fast
cannot handle data with missing values
a limit of 32 to the maximum number of levels of each categorical attribute
Package party: cforest()
not limited to the above maximum levels
slow
needs more memory
60
![Page 61: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/61.jpg)
TRAIN A RANDOM FOREST
61
![Page 62: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/62.jpg)
62
![Page 63: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/63.jpg)
ERROR RATE OF RANDOM FOREST
63
![Page 64: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/64.jpg)
VARIABLE IMPORTANCE
64
![Page 65: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/65.jpg)
CLUSTERING WITH R
65
![Page 66: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/66.jpg)
CLUSTERING WITH R
k-means: kmeans(), kmeansruns()
k-medoids: pam(), pamk()
Hierarchical clustering: hclust(), agnes(), diana()
DBSCAN: fpc
BIRCH: birch
66
![Page 67: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/67.jpg)
K-MEANS CLUSTERING
67
![Page 68: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/68.jpg)
68
![Page 69: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/69.jpg)
THE K-MEDOIDS CLUSTERING
Difference from k-means: a cluster is represented with its center in the k-means algorithm, but with the object closest to the center of the cluster in the k-medoids clustering.
more robust than k-means in presence of outliers
PAM (Partitioning Around Medoids) is a classic algorithm for k-medoids clustering.
The CLARA algorithm is an enhanced technique of PAM by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger data.
Functions pam() and clara() in package cluster
Function pamk() in package fpc does not require a user to choose k.
69
![Page 70: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/70.jpg)
CLUSTERING WITH PAMK()
Two clusters:
“setosa"
a mixture of “versicolor" and “virginica"
70
![Page 71: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/71.jpg)
71
![Page 72: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/72.jpg)
HIERARCHICAL CLUSTERING OF THE IRIS DATA
72
![Page 73: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/73.jpg)
73
![Page 74: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/74.jpg)
TEXT MINING
74
![Page 75: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/75.jpg)
TEXT MINING WITH R
Text mining: tm
Topic modelling: topicmodels, lda
Word cloud: wordcloud
Twitter data access: twitteR
75
![Page 76: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/76.jpg)
TEXT MINING
unstructured text data
text categorization
text clustering
entity extraction
sentiment analysis
document summarization
. . .
76
![Page 77: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/77.jpg)
TEXT MINING OF TWITTER DATA WITH R
1. extract data from Twitter
2. clean extracted data and build a document-term matrix
3. find frequent words and associations
4. create a word cloud to visualize important words
5. text clustering
6. topic modelling
77
![Page 78: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/78.jpg)
RETRIEVE TWEETS
Retrieve recent tweets by @RDataMining
78
![Page 79: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/79.jpg)
TEXT CLEANING
79
![Page 80: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/80.jpg)
80
![Page 81: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/81.jpg)
81
![Page 82: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/82.jpg)
82
![Page 83: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/83.jpg)
83
![Page 84: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/84.jpg)
84
![Page 85: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/85.jpg)
FREQUENT WORDS AND ASSOCIATIONS
85
![Page 86: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/86.jpg)
86
![Page 87: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/87.jpg)
87
![Page 88: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/88.jpg)
88
![Page 89: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/89.jpg)
89
![Page 90: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/90.jpg)
WORD CLOUD
90
![Page 91: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/91.jpg)
CLUSTERING
91
![Page 92: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/92.jpg)
92
![Page 93: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/93.jpg)
93
![Page 94: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/94.jpg)
94
![Page 95: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/95.jpg)
TOPIC MODELLING
95
![Page 96: DATA MINING USING RR AND HADOOP Packages: RHadoop, Rhive RHadoop is a collection of R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed306eefebce9623e49d99d/html5/thumbnails/96.jpg)
96