getting and cleaning data with r

8
PARTHASARTHI CHAKRABORTY Getting and cleaning data with R

Upload: parthasarathi-chakraborty

Post on 06-Feb-2016

6 views

Category:

Documents


0 download

DESCRIPTION

This documents explain how data can be read and cleaned using R.It also explained the use of dplyr package and use of lapply,sapply,tapply and many more

TRANSCRIPT

Page 1: Getting and Cleaning data with R

PARTHASARTHI CHAKRABORTY

Getting and cleaningdata with R

Page 2: Getting and Cleaning data with R

Creating directoryif(!file.exists("Rdirectory")){ dir.create("Rdirectory")}

Downloading files from internetfileurl="http://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"download.file(fileurl,destfile="D:/Rdirectory/camera.csv")

** If the URL starts with https:// then we need specify method=”curl”

Reading local files: R has got number of functions to read the local files from the directory below are few of themread.table()read.csv()read.csv2()read.xlsx()

Reading excel files: excel files can be read by R using read.xlsx function of xlsx package.Install.packages(“xlsx”)Library(xlsx)read.xlsx("D:/Rdirectory/studentdata1.xlsx",sheetIndex=1)

Reading XML files: R can read extensive markup language by using xmlTreeParse function of XML package.Install.packages(“XML”)Library(XML)fileUrl="http://www.w3schools.com/xml/simple.xml"doc=xmlTreeParse(fileUrl,useInternal=T)rootNode=xmlRoot(doc)xmlName(rootNode)names(rootNode)

Reading JSON file:JSON file can be read in R using fromJSON functioninstall.packages("jsonlite")install.packages("curl")library(jsonlite)library(curl)jsondata=fromJSON("http://api.github.com/users/jtleek/repos")names(jsondata)

data.table packageinstall.packages("data.table")library(data.table)DF=data.frame(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))DT=data.table(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))head(DT,3)

Directory

URL

Need to specify the sheet index

Accessing the parts of xml file

X y z1: -1.4783896 a -0.70295822: 0.4082885 a 1.08973693: 1.7128416 a -0.5908830

R output

Page 3: Getting and Cleaning data with R

tables()

DT[c(2,3)] subset on rows

DT[,list(mean(x),sum(z))]

DT[,table(y)]

Adding new columnDT[,w:=z^2]head(DT,4)DT[,{tm=(x+z);log2(tm+4)}]head(DT,4)

Subsetting and sorting:

mtcars[which(mtcars$hp>120),]

NAME NROW NCOL MB COLS KEY[1,] DT 9 3 1 x,y,z Total: 1MB R output

x y z1: 0.4082885 a 1.0897372: 1.7128416 a -0.590883

R output

V1 V21: -0.08098949 -2.374091 R output

ya b c 3 3 3

R output

x y z w1: -1.4783896 a -0.7029582 0.49415032: 0.4082885 a 1.0897369 1.18752643: 1.7128416 a -0.5908830 0.34914274: 0.5847973 b 0.3116074 0.0970992

R output

mpg cyl disp hp drat wt qsec vs am gear carbHornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

R output

Page 4: Getting and Cleaning data with R

Subsetting can also be done in other ways one by using subset function and other by applying Filter function of dplyr package

Sorting: with(mtcars,sort(hp)) mtcars[order(mtcars$hp),] Sorting can also be done by using arrange function of plyr package library(plyr) arrange(mtcars,mpg)-ascending order arrange(mtcars,desc(drat))-descending order

Summarizing data:Data can be summarized using summary() and str() functionssummary(mtcars)str(mtcars)

Creating variable using %in%restData=read.csv("D:/Datascience/R in action chapter wise code/R codes of coursera/Restaurants.csv")names(restData)names(restData)=c("Name","ZIP","Neighbourhood","Councildist","Policedist","Location)

restData$nearMe=restData$Neighbourhood %in% c("Roland Park","Homeland")

Creating binary variablerestData$ZIPwrong=ifelse(restData$ZIP<0,TRUE,FALSE)table(restData$ZIPwrong,restData$ZIP<0)

Reshaping datalibrary(reshape2)head(mtcars)names(mtcars)mtcars$Carname=rownames(mtcars)names(mtcars)

Melting the datasetcarMelt=melt(mtcars,id=c("Carname","gear","cyl"),measure.vars=c("mpg","hp"))head(carMelt,5)

Casting the datasetcylData=dcast(carMelt,cyl~variable,mean)cylData

Carname gear cyl variable value1 Mazda RX4 4 6 mpg 21.02 Mazda RX4 Wag 4 6 mpg 21.03 Datsun 710 4 4 mpg 22.84 Hornet 4 Drive 3 6 mpg 21.45 Hornet Sportabout 3 8 mpg 18.7

R output

cyl mpg hp1 4 26.66364 82.636362 6 19.74286 122.285713 8 15.10000 209.21429

R output

Page 5: Getting and Cleaning data with R

Sapply and tapply

head(InsectSprays) -InsectSpray is a dataset available in base package

tapply(InsectSprays$count,InsectSprays$spray,sum)

spIns=split(InsectSprays$count,InsectSprays$spray)spins

sprCount=lapply(spIns,sum)sprCount

sapply(spIns,sum)

A B C D E F 174 184 25 59 42 200

count spray1 10 A2 7 A3 20 A4 14 A5 14 A6 12 A

R output

R output

$A[1] 10 7 20 14 14 12 10 23 17 20 14 13

$B[1] 11 17 21 11 16 14 17 17 19 21 7 13

$C[1] 0 1 7 2 3 1 2 1 3 0 1 4

R output

$A[1] 174

$B[1] 184

$C[1] 25

$D[1] 59

$E[1] 42

$F[1] 200

R output

A B C D E F 174 184 25 59 42 200 R output

Page 6: Getting and Cleaning data with R

Working with dplyrinstall.packages("dplyr")library(dplyr)Select function(subsetting the column)names(mtcars)head(select(mtcars,mpg:hp))

head(select(mtcars,-(mpg:hp)))

Other way

mtcars[,-c(1:4)]

i=match("mpg",names(mtcars))j=match("hp",names(mtcars))head(mtcars[,-(i:j)])Filter function (subsetting the rows)y=filter(mtcars,mpg>16)y

Select columnsfrom mpg to hp

mpg cyl disp hpMazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110Datsun 710 22.8 4 108 93Hornet 4 Drive 21.4 6 258 110Hornet Sportabout 18.7 8 360 175Valiant 18.1 6 225 105

R output

drat wt qsec vs am gear carbMazda RX4 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 3.90 2.875 17.02 0 1 4 4Datsun 710 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 3.15 3.440 17.02 0 0 3 2Valiant 2.76 3.460 20.22 1 0 3 1

Select columns excluding mpg to hp

mpg cyl disp hp drat wt qsec vs am gear carb1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 42 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 43 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 14 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 15 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 26 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 17 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 28 22.8 4 140.8 95 3.92 3.150 22.90 1 0

R output

R output

Page 7: Getting and Cleaning data with R

arrange function for sorting data

srt1=arrange(mtcars,cyl) ascending orderhead(srt1)

srt2=arrange(mtcars,desc(hp))descending order

rename functionmtcars=rename(mtcars,horse_power=hp)mutate function(To create new variable)mtcars=mutate(mtcars,partha=mpg+log(hp))summarize functionsummerize(categoricalvariable,function(numerical valiable))Editing text variabletoupper(names(mtcars))All variable names to uppercasetolower(names(mtcars))All variable names to lower caset=strsplit(names(mtcars),"\\.")t

mpg cyl disp hp drat wt qsec vs am gear carb1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 12 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 23 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 24 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 15 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 26 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

R output

mpg cyl disp hp drat wt qsec vs am gear carb1 15.0 8 301 335 3.54 3.570 14.60 0 1 5 82 15.8 8 351 264 4.22 3.170 14.50 0 1 5 43 14.3 8 360 245 3.21 3.570 15.84 0 0 3 44 13.3 8 350 245 3.73 3.840 15.41 0 0 3 45 14.7 8 440 230 3.23 5.345 17.42 0 0 3 46 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4

R output

[[1]][1] "mpg"

[[2]][1] "cyl"

[[3]][1] "disp"

[[4]][1] "hp"

R output

Page 8: Getting and Cleaning data with R

gsub functiontestname="I_am_happy"sub("_","",testname)Iam_happy

gsub("_","",testname)Iamhappy

gsub("_",":",testname) I:am:happy

Finding values using grep()and grepl()grep(6,mtcars$cyl)gives the observations which have cyl value equal to 6

1 2 4 6 10 11 30

table(grepl(8,mtcars$cyl))FALSE TRUE 18 14

Working with date

%d-> days as number(0-31) %a-> abbreviated weekdays %A->unabbriviated weekdays %m->month(00-12) %b->abbriviated month %B->unabbriviated month %y->two year digit %Y->Four year digit

d1=date()

d1"Sun Apr 26 08:52:14 2015" class(d1) "character" d2=Sys.Date()

class(d2) "Date"

format(d2,"%a %b %Y")

format(d2,"%A %B %Y")

weekdays(d2)

months(d2)