getting and cleaning data with r
DESCRIPTION
This documents explain how data can be read and cleaned using R.It also explained the use of dplyr package and use of lapply,sapply,tapply and many moreTRANSCRIPT
PARTHASARTHI CHAKRABORTY
Getting and cleaningdata with R
Creating directoryif(!file.exists("Rdirectory")){ dir.create("Rdirectory")}
Downloading files from internetfileurl="http://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"download.file(fileurl,destfile="D:/Rdirectory/camera.csv")
** If the URL starts with https:// then we need specify method=”curl”
Reading local files: R has got number of functions to read the local files from the directory below are few of themread.table()read.csv()read.csv2()read.xlsx()
Reading excel files: excel files can be read by R using read.xlsx function of xlsx package.Install.packages(“xlsx”)Library(xlsx)read.xlsx("D:/Rdirectory/studentdata1.xlsx",sheetIndex=1)
Reading XML files: R can read extensive markup language by using xmlTreeParse function of XML package.Install.packages(“XML”)Library(XML)fileUrl="http://www.w3schools.com/xml/simple.xml"doc=xmlTreeParse(fileUrl,useInternal=T)rootNode=xmlRoot(doc)xmlName(rootNode)names(rootNode)
Reading JSON file:JSON file can be read in R using fromJSON functioninstall.packages("jsonlite")install.packages("curl")library(jsonlite)library(curl)jsondata=fromJSON("http://api.github.com/users/jtleek/repos")names(jsondata)
data.table packageinstall.packages("data.table")library(data.table)DF=data.frame(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))DT=data.table(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))head(DT,3)
Directory
URL
Need to specify the sheet index
Accessing the parts of xml file
X y z1: -1.4783896 a -0.70295822: 0.4082885 a 1.08973693: 1.7128416 a -0.5908830
R output
tables()
DT[c(2,3)] subset on rows
DT[,list(mean(x),sum(z))]
DT[,table(y)]
Adding new columnDT[,w:=z^2]head(DT,4)DT[,{tm=(x+z);log2(tm+4)}]head(DT,4)
Subsetting and sorting:
mtcars[which(mtcars$hp>120),]
NAME NROW NCOL MB COLS KEY[1,] DT 9 3 1 x,y,z Total: 1MB R output
x y z1: 0.4082885 a 1.0897372: 1.7128416 a -0.590883
R output
V1 V21: -0.08098949 -2.374091 R output
ya b c 3 3 3
R output
x y z w1: -1.4783896 a -0.7029582 0.49415032: 0.4082885 a 1.0897369 1.18752643: 1.7128416 a -0.5908830 0.34914274: 0.5847973 b 0.3116074 0.0970992
R output
mpg cyl disp hp drat wt qsec vs am gear carbHornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
R output
Subsetting can also be done in other ways one by using subset function and other by applying Filter function of dplyr package
Sorting: with(mtcars,sort(hp)) mtcars[order(mtcars$hp),] Sorting can also be done by using arrange function of plyr package library(plyr) arrange(mtcars,mpg)-ascending order arrange(mtcars,desc(drat))-descending order
Summarizing data:Data can be summarized using summary() and str() functionssummary(mtcars)str(mtcars)
Creating variable using %in%restData=read.csv("D:/Datascience/R in action chapter wise code/R codes of coursera/Restaurants.csv")names(restData)names(restData)=c("Name","ZIP","Neighbourhood","Councildist","Policedist","Location)
restData$nearMe=restData$Neighbourhood %in% c("Roland Park","Homeland")
Creating binary variablerestData$ZIPwrong=ifelse(restData$ZIP<0,TRUE,FALSE)table(restData$ZIPwrong,restData$ZIP<0)
Reshaping datalibrary(reshape2)head(mtcars)names(mtcars)mtcars$Carname=rownames(mtcars)names(mtcars)
Melting the datasetcarMelt=melt(mtcars,id=c("Carname","gear","cyl"),measure.vars=c("mpg","hp"))head(carMelt,5)
Casting the datasetcylData=dcast(carMelt,cyl~variable,mean)cylData
Carname gear cyl variable value1 Mazda RX4 4 6 mpg 21.02 Mazda RX4 Wag 4 6 mpg 21.03 Datsun 710 4 4 mpg 22.84 Hornet 4 Drive 3 6 mpg 21.45 Hornet Sportabout 3 8 mpg 18.7
R output
cyl mpg hp1 4 26.66364 82.636362 6 19.74286 122.285713 8 15.10000 209.21429
R output
Sapply and tapply
head(InsectSprays) -InsectSpray is a dataset available in base package
tapply(InsectSprays$count,InsectSprays$spray,sum)
spIns=split(InsectSprays$count,InsectSprays$spray)spins
sprCount=lapply(spIns,sum)sprCount
sapply(spIns,sum)
A B C D E F 174 184 25 59 42 200
count spray1 10 A2 7 A3 20 A4 14 A5 14 A6 12 A
R output
R output
$A[1] 10 7 20 14 14 12 10 23 17 20 14 13
$B[1] 11 17 21 11 16 14 17 17 19 21 7 13
$C[1] 0 1 7 2 3 1 2 1 3 0 1 4
R output
$A[1] 174
$B[1] 184
$C[1] 25
$D[1] 59
$E[1] 42
$F[1] 200
R output
A B C D E F 174 184 25 59 42 200 R output
Working with dplyrinstall.packages("dplyr")library(dplyr)Select function(subsetting the column)names(mtcars)head(select(mtcars,mpg:hp))
head(select(mtcars,-(mpg:hp)))
Other way
mtcars[,-c(1:4)]
i=match("mpg",names(mtcars))j=match("hp",names(mtcars))head(mtcars[,-(i:j)])Filter function (subsetting the rows)y=filter(mtcars,mpg>16)y
Select columnsfrom mpg to hp
mpg cyl disp hpMazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110Datsun 710 22.8 4 108 93Hornet 4 Drive 21.4 6 258 110Hornet Sportabout 18.7 8 360 175Valiant 18.1 6 225 105
R output
drat wt qsec vs am gear carbMazda RX4 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 3.90 2.875 17.02 0 1 4 4Datsun 710 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 3.15 3.440 17.02 0 0 3 2Valiant 2.76 3.460 20.22 1 0 3 1
Select columns excluding mpg to hp
mpg cyl disp hp drat wt qsec vs am gear carb1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 42 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 43 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 14 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 15 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 26 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 17 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 28 22.8 4 140.8 95 3.92 3.150 22.90 1 0
R output
R output
arrange function for sorting data
srt1=arrange(mtcars,cyl) ascending orderhead(srt1)
srt2=arrange(mtcars,desc(hp))descending order
rename functionmtcars=rename(mtcars,horse_power=hp)mutate function(To create new variable)mtcars=mutate(mtcars,partha=mpg+log(hp))summarize functionsummerize(categoricalvariable,function(numerical valiable))Editing text variabletoupper(names(mtcars))All variable names to uppercasetolower(names(mtcars))All variable names to lower caset=strsplit(names(mtcars),"\\.")t
mpg cyl disp hp drat wt qsec vs am gear carb1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 12 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 23 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 24 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 15 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 26 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
R output
mpg cyl disp hp drat wt qsec vs am gear carb1 15.0 8 301 335 3.54 3.570 14.60 0 1 5 82 15.8 8 351 264 4.22 3.170 14.50 0 1 5 43 14.3 8 360 245 3.21 3.570 15.84 0 0 3 44 13.3 8 350 245 3.73 3.840 15.41 0 0 3 45 14.7 8 440 230 3.23 5.345 17.42 0 0 3 46 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
R output
[[1]][1] "mpg"
[[2]][1] "cyl"
[[3]][1] "disp"
[[4]][1] "hp"
R output
gsub functiontestname="I_am_happy"sub("_","",testname)Iam_happy
gsub("_","",testname)Iamhappy
gsub("_",":",testname) I:am:happy
Finding values using grep()and grepl()grep(6,mtcars$cyl)gives the observations which have cyl value equal to 6
1 2 4 6 10 11 30
table(grepl(8,mtcars$cyl))FALSE TRUE 18 14
Working with date
%d-> days as number(0-31) %a-> abbreviated weekdays %A->unabbriviated weekdays %m->month(00-12) %b->abbriviated month %B->unabbriviated month %y->two year digit %Y->Four year digit
d1=date()
d1"Sun Apr 26 08:52:14 2015" class(d1) "character" d2=Sys.Date()
class(d2) "Date"
format(d2,"%a %b %Y")
format(d2,"%A %B %Y")
weekdays(d2)
months(d2)