introduction flat files - amazon s3 · importing data into r flat files states.csv comma separated...

32
IMPORTING DATA INTO R Introduction Flat Files

Upload: others

Post on 23-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

IMPORTING DATA INTO R

Introduction Flat Files

Page 2: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Importing data into R

?

Page 3: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

5 Types● Flat Files

● Excel Files

● Statistical So!ware

● Databases

● Data from the Web

!

"

Page 4: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Flat FilesComma Separated Values states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Field names

> wanted_df state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

?

Page 5: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

utils package● Loaded by default when you start R

● read.table()

● Read any tabular file as a data frame

● Number of arguments is huge

Page 6: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.table()> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 7: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.table() - file> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

> path <- file.path("~", "datasets", "states.csv")

What if file is not in current working directory?

path to flat file

> read.table(path, header = TRUE, sep = ",", stringsAsFactors = FALSE)

> path [1] "~/datasets/states.csv"

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 8: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.table() - header> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

read.table() sets header = FALSE by default

first row lists variable names

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 9: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.table() - sep> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

states.csv$

field separator is a comma

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 10: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

read.table() - stringsAsFactors

Import strings as categorical variables?

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 11: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.table()> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

Page 12: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Wrappers● Tiring to always specify

● CSV: common and standardized

● Wrapper with different defaults: read.csv()

● header = TRUE sep = ","

> read.table("states.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

> read.csv("states.csv", stringsAsFactors = FALSE)

Page 13: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Tab-delimited files

> read.table("states.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

> read.delim("states.txt", stringsAsFactors = FALSE)

states.txt

state capital pop_mill area_sqm South Dakota Pierre 0.853 77116 New York Albany 19.746 54555 Oregon Salem 3.970 98381 Vermont Montpelier 0.627 9616 Hawaii Honolulu 1.420 10931

#

Page 14: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Tab-delimited files> read.table("states.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

> read.delim("states.txt", stringsAsFactors = FALSE)

state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

Page 15: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Locale differences states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

#

states_eu.csv

state;capital;pop_mill;area_sqm South Dakota;Pierre;0,853;77116 New York;Albany;19,746;54555 Oregon;Salem;3,97;98381 Vermont;Montpelier;0,627;9616 Hawaii;Honolulu;1,42;10931

#

Page 16: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)

%

read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)

%

Locale differences

Page 17: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

states_eu.csv> read.csv("states_eu.csv", stringsAsFactors = FALSE) state.capital.pop_mill.area_sqm South Dakota;Pierre;0 853;77116 New York;Albany;19 746;54555 ... Hawaii;Honolulu;1 42;10931

> read.csv2("states_eu.csv", stringsAsFactors = FALSE) state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 ... 5 Hawaii Honolulu 1.420 10931

states_eu.csv

state;capital;pop_mill;area_sqm South Dakota;Pierre;0,853;77116 New York;Albany;19,746;54555 Oregon;Salem;3,97;98381 Vermont;Montpelier;0,627;9616 Hawaii;Honolulu;1,42;10931

#

Page 18: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

IMPORTING DATA INTO R

Let’s practice!

Page 19: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

IMPORTING DATA INTO R

readr data.table

Page 20: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

Previously● utils package

● Specific R packages

● readr

● data.table

Page 21: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

readr● Hadley Wickham

● Fast, easy to use, consistent

● utils: verbose, slower

> install.packages("readr") > library(readr)

Page 22: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read_delim()

> read_delim("states.csv", delim = ",")

> read.table("states.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 23: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

read_delim()> read_delim("states.csv", delim = ",")

> read.table("states.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)

read.table() read_delim()

header (FALSE) col_names (TRUE)

stringsAsFactorscolClasses col_types

state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

state capital pop_mill area_sqm 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 24: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

col_names> read_delim("states2.csv", delim = ",", col_names = FALSE) X1 X2 X3 X4 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

> read_delim("states2.csv", delim = ",", col_names = c("state", "city", "pop", "area")) state city pop area 1 South Dakota Pierre 0.853 77116 2 New York Albany 19.746 54555 3 Oregon Salem 3.970 98381 4 Vermont Montpelier 0.627 9616 5 Hawaii Honolulu 1.420 10931

states2.csv

South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 25: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

col_types> st <- read_delim("states.csv", delim = ",")

c = characterd = double i = integer l = logical _ = skip

> sapply(st, class) state capital pop_mill area_sqm "character" "character" "numeric" "integer"

> st2 <- read_delim("states.csv", delim = ",", col_types = "ccdd")

> sapply(st2, class) state capital pop_mill area_sqm "character" "character" "numeric" "numeric"

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 26: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

col_types> st3 <- read_delim("states.csv", delim = ",", col_types = "ccd_")

> > st3 state capital pop_mill 1 South Dakota Pierre 0.853 2 New York Albany 19.746 3 Oregon Salem 3.970 4 Vermont Montpelier 0.627 5 Hawaii Honolulu 1.420

Other option: collector functions

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

c = characterd = double i = integer l = logical _ = skip

Page 27: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

skip and n_max> read_delim("states.csv", delim = ",", skip = 2, n_max = 3)

New York Albany 19.746 54555 1 Oregon Salem 3.970 98381 2 Vermont Montpelier 0.627 9616 3 Hawaii Honolulu 1.420 10931

> read_delim("states.csv", delim = ",", col_names = c("state","city","pop","area"), skip = 2, n_max = 3)

state city pop area 1 New York Albany 19.746 54555 2 Oregon Salem 3.970 98381 3 Vermont Montpelier 0.627 9616

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

! states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 28: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

utils and readr

utils readr

read.table() read_delim()

read.csv() read_csv()

read.delim() read_tsv()

Page 29: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

data.table● Ma! Dowle & Arun Srinivasan

● Key metric: speed

● Data manipulation in R

● Function to import data: fread()

● Similar to read.table()

> install.packages("data.table") > library(data.table)

Page 30: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

fread()

> fread("states.csv") state capital pop_mill area_sqm 1: South Dakota Pierre 0.853 77116 2: New York Albany 19.746 54555 3: Oregon Salem 3.970 98381 4: Vermont Montpelier 0.627 9616 5: Hawaii Honolulu 1.420 10931

> fread("states2.csv") V1 V2 V3 V4 1: South Dakota Pierre 0.853 77116 2: New York Albany 19.746 54555 3: Oregon Salem 3.970 98381 4: Vermont Montpelier 0.627 9616 5: Hawaii Honolulu 1.420 10931

states.csv

state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

! states2.csv

South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555 Oregon,Salem,3.970,98381 Vermont,Montpelier,0.627,9616 Hawaii,Honolulu,1.420,10931

!

Page 31: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

Importing Data into R

fread()● Infer column types and separators

● It simply works

● Extremely fast

● Possible to specify numerous parameters

● Improved read.table()

● Fast, convenient, customizable

Page 32: Introduction Flat Files - Amazon S3 · Importing Data into R Flat Files states.csv Comma Separated Values state,capital,pop_mill,area_sqm South Dakota,Pierre,0.853,77116 New York,Albany,19.746,54555

IMPORTING DATA INTO R

Let’s practice!