r programming for data science

R Programming for Data Science

Sovello Hildebrand [email protected]

mailto:[email protected]

2

Outline

● History of R● Installation (Windows and Linux)● Data Types● Reading Data:

– Tabular– Large datasets

● Textual Data Formats● Subsetting:

– Lists, Matrices, Partial matching– Removing missing values

3

Outline● Vectorized operations● Control Structures

– If-else– For, while, repeat, next break

● Functions– Scoping

● Dates and Times● Loop functions

– lapply, tapply, apply, mapply, split,

● Simulation and profiling– Generating random numbers, simulating a linear model, random sampling

● Visualizations

4

History of R

● Originates from S language. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries– History of S:

http://www.stat.bell-labs.com/S/history.html

● R development history:– https://en.wikipedia.org/wiki/R_(programming_la

nguage)

http://www.stat.bell-labs.com/S/history.html

https://en.wikipedia.org/wiki/R_(programming_language

https://en.wikipedia.org/wiki/R_(programming_language

5

R and Statistics

● R developed from S which is a statistical analysis tool, and so is R

● Its functionality is divided into modules– Need to load a module for different functionalities

● Has very sophisticated graphics capabilities than most other statistical packages

● Useful for interactive work: run from terminal● Contains a powerful programming language for

developing new tools– Tools: for visualizations and analysis

6

Design of the R System

● The “base” system, downloaded from CRAN● “All other stuff”● Packages in R

– The “base” has the base package required to run R and has the most fundamental functions

– Other packages contained in the “base”. Need to load these to be able to use them: utils, stats, datasets, graphics, grDevices, tools, etc.

– Recommended packages: boot, class, cluster, codetools, foreign, lattice, etc.

– Load packages with library(), or require()

7

R Resources

● CRAN:– http://cran.r-project.org

● Quick-R: a book– http://www.statmethods.net/

● R bloggers (platform): not a social network– R-Bloggers is about empowering bloggers to empower

other R users– R-Bloggers.com is a blog aggregator of content

contributed by bloggers who write about R (in English)– https://www.r-bloggers.com/

http://cran.r-project.org/

http://www.statmethods.net/

https://www.r-bloggers.com/

8

Installation of R: Ubuntu● Run from terminal:

– sudo apt-get install r-base r-base-dev

● If this doesn’t work, then you need – To add the repositories:

sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list

– Add the keyring: gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9

gpg -a --export E084DAB9 | sudo apt-key add -

– Install R-Base sudo apt-get update; sudo apt-get install r-base r-base-dev

● You can install from a PPA which has the most recent versions– Add the PPA

sudo add-apt-repository ppa:marutter/rrutter

– Install R-Base sudo apt-get update; sudo apt-get install r-base r-base-dev

9

Installation of R: Windows

● Visit CRAN– https://cran.r-project.org/

● CRAN: Comprehensive R Archive Network

https://cran.r-project.org/

10


Click/Select Download R for Windows

https://cran.r-project.org/bin/windows/

11


Then click/select base or install R for the first time

https://cran.r-project.org/bin/windows/base/

https://cran.r-project.org/bin/windows/base/

12


● Then click/select Download R X.X.X for Windows● After the download has finished, locate thedownloaded file and install.

13

RStudio: www.rstudio.com

http://www.rstudio.com/

14

RStudio: Introduction

● RStudio is a set of integrated tools designed to help you be more productive with R.

● How?– It includes a console,– syntax-highlighting editor that supports direct

code execution, – a variety of robust tools for

plotting, viewing history, debugging and managing your workspace.

15

RStudio: Installation

● From the RStudio home page, go to Products then select RStudio– Then scroll down and click

Download RStudio Desktop– Then click Download under RStudio Desktop

Personal License.– Select RStudio for your platform. Clicking on the

link will download the file directly.– Locate the file in your system Downloads folder

and start the installation.

http://www.rstudio.com/

https://www.rstudio.com/products/rstudio/

http://www.rstudio.com/products/rstudio/download/

https://www.rstudio.com/products/rstudio/download3/#download

16

RStudio: Parts

The Console is where you write and run code interactively

The Files tab shows all the files and folders in your default workspace as if you were on a PC/Mac window.

The Plots tab will show all your graphs.

The Packages tab will list a series of packages or add-ons needed to run certain processes.

For additional info see the Help tab

The Environment tab shows all the active objects The History tab shows a list of commands used so far

17

RStudio: Working Directory

● It is important to organize all files for a particular project under one main/parent directory

● A working directory in RStudio is where all the files for a particular project are stored

● All paths used in the console to load data files and scripts are relative to the working directory.

18

● To set the working directory:– Start RStudio the same way you start other

programs in your computer– From the File menu options select New Project then

select New Directory then Empty Project then type the directory name (rprogramming) then under create project as subdirectory of click Browse and select Desktop

●

RStudio: Working Directory

19

R: Getting Started● A few basic commands to test them on the console

– getwd(): get current working directory

– setwd(“/path/to/directory”): set a working directory to the specified path

– install.packages(“package_name”): install a package. Requires internet connection

– library(package_name), require(package_name): load and attach add-on packages

– ?object: provide documentation/help for an object. e.g. ?mtcars

– summary(object): provide a summary of an object like a dataset e.g. summary(mtcars)

● Everytime you run library(package_name) and get an error “there is no package called ‘package_name’”, you will need to install it first then call library on it.

20

Data Visualizations in R: Introduction

● R has different systems (packages) for making graphs (visualizations)

● For this case we are going to use ggplot2 which is more elegant and versatile compared to many others. (ggvis, rgl, htmlwidgets, googleVis, etc.)

● Ggplot2 is built upon the “The Layered Grammar of Graphics”

http://docs.ggplot2.org/current/

http://ggvis.rstudio.com/

http://rgl.neoscientists.org/about.shtml

http://www.htmlwidgets.org/

https://cran.rstudio.com/web/packages/googleVis

http://vita.had.co.nz/papers/layered-grammar.pdf

21

Data Visualizations in R: Tidyverse

● Tidyverse is a set of packages– The packages work in harmony

Reason: they share common data representations and API design.

● The tidyverse package makes it easy to install and load core packages from it in a single command

● To install run: install.packages(“tidyverse”)

● To use it run: library(tidyverse)which loads tidyverse core packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr.– Google each one of these packages to learn what they do

22

Data Visualizations: First Steps● library(tidyverse) loads all the core packages from

tidyverse● The library() function also tells any conflicts with base R

or other packages that arise from loading the named package. ● e.g. for this case filter() and lag() are functions from

tidyverse that conflict with similar functions from dplyr and stats packages

● In this case you may need to call a function explicitly from a package in the form. package::function()● e.g. ggplot2::ggplot() calls the ggplot function from

ggplot2 package.

23

● Which is more fuel efficient: cars with big engines or cars with small engines?

● The mpg data frame:– Data Frame: is a rectangular collection of

variables in columns and observations in rows The mpg data frame in ggplot2 contains observations

collected by the US Environment Protection Agency on 38 models of cars.

● Run (from console) ?mpg to learn more about the data set.

Data Visualizations: First Steps

24

First Steps Creating a ggplot

● To answer the question about fuel efficiency plot fuel consumption (hwy: y-axis) against engine size (displ: x-axis)

● See the magic of this command:– ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy))

25

First Steps Creating a ggplot

> ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

A negative relationship between engine size (displ) and fuel efficiency (hwy) means Cars with bigger engines use more fuel.

26

Creating a ggplot● In ggplot2,

– You begin with the function ggplot() ggplot() creates a coordinate system that you can add layers onto. The first argument is the data set that you are going to use for plotting

– To complete the graph add more layers to the coordinate system created by ggplot()

geom_point() function adds a layer of points to plot (which creates a scatter plot for this case)

Each function in ggplot2 takes a mapping argument which defines how variables are mapped to visual properties.

The mapping argument is always paired with aes()– The x and y arguments of aes() specify which variables to map to the x and y

axes.

– ggplot2 looks for the mapped variable in the data argument, in this case, mpg

27

Creating a ggplot: Template

● A graphing template for ggplot

● You can get a list of <GEOM_FUNCTION>s by following this link (http://docs.ggplot2.org/current/)



28

ggplot: Aesthetics Mappings

● Look at the graph and note the circled dots

● What is special with these big engine cars?

29

ggplot: Aesthetics● Ggplot Aesthetic mappings can help answer the

question● An aesthetic is a visual property of the objects in a

plot. – These are things like size, shape or color of points.

● You can therefore display a point in different ways by changing the values of its aesthetic properties.

● You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.– e.g. you can map the colors of your points to the class

variable to reveal the class of each car.

30

ggplot: Aesthetics● New plot with aesthetics for class:

ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy, color = class))

● Try for year and manufacturer and look at the trends

31

ggplot: Aesthetics

● Other aesthetics:– Size: for ordered variables, so each point reveals

its attribute size– Alpha: controls the transparency of the points– Shape: points will be of different shapes

Exercise: try plotting the same geom with these different aesthetics

● ggplot2 takes care of selecting a reasonable scale to use with the aesthetic and constructs a legend

32

ggplot: Aesthetics

● The aesthetic properties of a geom can be set manually.– For example:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

– Will set all points to blue– Note color is outside the aes()

33

ggplot: Facets

34

● When the data has categorical variables, it is possible to split the plot into facets.

● Facets are subplots that each displays a subset of data.

● To plot facets, with a single variable, use the function facet_wrap(formula, …)– formula is created with ~ variable-name– formula is the name of a data structure in R, not a

synonym for equation.– The variable (variable-name) should be discrete.

ggplot: Facets

35

ggplot: Facets● For example:

– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3)

● This will produce a plot for each element in mpg.class, and the plot will display in three rows.

36

ggplot: Facets

● Can we facet the plot using two discrete variables:● Do this:

– ?facet_grid– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

In the plot, why do we have empty sub-plots?●

37

ggplot: Facets

● Hack:– With facet grid, what happens when you use a . at

the place of one variable?– Is there an advantage of faceting over the color

aesthetic? Any disadvantages? What is the dataset is very large?

– In facet_wrap() what do nrow or ncol do?

– When using facet_grid() put the variable with more unique levels in the columns (RHS of formula), why?

Why doesn’t facet_grid() have nrow, and ncolumn

38

ggplot2::Geometric objects (geoms)

● These are the geometric objects used to represent the data.– e.g. bar geoms, point geoms, line geoms, smooth geoms,

etc.

● To change the geom in your plot, change the geom function (geom_xxx())

● For example:– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

– ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))

● Not every aesthetic works with every geom– e.g. you can’t set a shape of a line but of a point– Read: ?geom_point, ?geom_smooth

39

ggplot2: geoms● ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

● Try: – ggplot(data = mpg) +

geom_line(mapping = aes(x = displ, y = hwy, linetype = drv))

40

ggplot2: geoms

● Plot:– ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

– ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))

What is the difference? Which is better? Why?

41

Ggplot2: combined geoms

● Can we use more than one geoms on the same plot?

● Try:– ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy)) +

geom_smooth(mapping = aes(x = displ, y = hwy))

● When using multiple geoms on the same plot you can use global mappings:– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point() +

geom_smooth()

Which makes the code easy to read and modify.

42

ggplot2: combined geoms● When you use global mappings and set some mappings in a geom function,

these mappings will be treated as local to this layer only.

● For example:– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point(mapping = aes(color = class)) +

geom_smooth()

43

ggplot2: combined geoms

● In the same way, you can specify different data for each layer.– Say you only want to fit a smooth line for one class of

cars– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point(mapping = aes(color = class)) +

geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

– Hack: can we plot more than one of the same

geom? –Try a smooth geom with different car class

44

Ggplot2: combined geoms

45

Combined Geoms: exercise

46

Ggplot2: geoms

● How many geoms does ggplot2 have?– Visit this page:

https://www.rstudio.com/resources/cheatsheets/ Look for Data Visualization Cheat Sheet

● ggplot2 extensions provide more geoms to use. Take a look at available extensions from this gallery (http://www.ggplot2-exts.org/gallery/)

●

https://www.rstudio.com/resources/cheatsheets/

https://www.rstudio.com/resources/cheatsheets/

http://www.ggplot2-exts.org/gallery/

http://www.ggplot2-exts.org/gallery/

47

ggplot2: statistical transformations

● Read: ?diamonds– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut))

– Where does count come from?

48

Statistical Transformations

● Some plots plot raw values – e.g. scatterplots,

● Some plots use calculated values– bar charts, histograms, and frequency polygons bin

your data and then plot bin counts, the number of points that fall in each bin.

– smoothers fit a model to your data and then plot predictions from the model. (Remember regression lines)

– boxplots compute a robust summary of the distribution and then display a specially formatted box.

–

–

49

Statistical Transformation

● The algorithm used to calculate new values for a graph is called a stat, (Statistical Transformation)

● You can check which stat is used by default by looking at the default value of stat.– geom_bar() uses count. Thus you can recreate the bar

chart by running ggplot(data = diamonds) +

stat_count(mapping = aes(x = cut))

● Every geom has a default stat; and vice-versa. This means that you can typically use geoms without worrying about the underlying statistical transformation.

50

Statistical Transformation

● You can explicitly specify a stat:● When you want to override the default stat

e.g. Run demo <- tribble(

~a, ~b,

"bar_1", 20,

"bar_2", 30,

"bar_3", 40

)

Then runggplot(data = demo) +

geom_bar(mapping = aes(x = a, y = b), stat = "identity")

51

Statistical Transformation● Reasons to explicitly specify a stat: cntd

– You want to override the default mapping from transformed variables to aesthetics.

ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))– This will draw a bar chart of proportion instead of count

52

Position Adjustments

● A bar chart can be colored in either of two ways: color and fill.– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, colour = cut))

– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = cut))

53


● Check how the following plots will look like– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity))

– ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +

geom_bar(alpha = 1/5, position = "identity")

– ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +

geom_bar(fill = NA, position = "identity")


geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")


geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

54


● Learn more about position adjustments– ?position_dodge,

– ?position_fill,

– ?position_identity,

– ?position_jitter

– ?position_stack

55

Position Adjustments:overplotting.

● Recall: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

– It displays fewer than 234 points: the number of observations (can you count them?)

– The values of displ and hwy are rounded and many points overlap each other. That is a problem called overplotting.

● You can avoid this gridding by setting the position adjustment to “jitter”– position = “jitter” adds a small amount of random noise to each point

– Since no points can receive the same amount of noise, they are going to be spread out.

● Jittering makes the graph less accurate at small scales, however it will make the graph more revealing at large scales.

● In ggplot2 the shorthand for geom_point(position = "jitter") is geom_jitter()

56

Position Adjustments: jitter● ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

57

Thank You! Asanteni!

58

Working with Data

● In this part we are going to learn how to work with your data.– Getting data

Importing your own data Tidying data

– How to work with different data types: Relational data, Strings, Factors, Dates and Times

59

Importing Data● For importing files, we will use the readr package which

is part of the tidyverse core packages.● Most of readr functions turn flat files into data frames. A

Data Frame is a tabular data format with rows and columns. It is a list of vectors of equal length.– read_csv(): reads comma separated files

– read_csv2(): reads semicolon separated files

– read_tsv(): read tab delimited files

– read_delim(): reads files with any delimiter

● Activity:– Check what read_table(), read_fwf() and read_log()

do?

60

Importing Data: read_csv()● The first argument is the path to the file to read

– read_csv(“data/students.csv”)

● read_csv() prints out a column specification● read_csv() by default uses the first row as the column names

– You can use skip = n, to skip the first n lines if they contain data you don’t need, (most likely metadata)

– You can use comment = “#” to drop all lines that start with # for example

– Use col_names = FALSE so that read_csv() doesn’t treat the first row as the column names

● Missing values in R are specified out by na or NA. When loading files where missing values are specified differently, use na = “.” for example if missing values are specified by a period.– What will this line do?

read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na = “-”)

61

Importing Data: Parsing● The parse_*() functions:

– ?parse_logical, ?parse_integer, ?parse_date

● The parse functions take in a character vector and return a more specialized vector.– Characters include everything, all letters and numbers, e.g.

“dLab”, “2013”, “xyz3”, “12.09”– A specialized would contain say only numbers, or only decimal

numbers, or only characters, and this is what the parse functions do: return a list of specific type of characters

● A vector in R is a list of characters surrounded enclosed in c() – For example names <- c(“John”, “Jean”, “Giovanni”, “Joni”)

dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”)

62

Importing Data: Parsing● What happens to the following?

parse_integer(c("1", "231", ".", "456"), na = ".")

x <- parse_integer(c("123", "345", "abc", "123.45"))

● parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

● parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

● parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

● parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

● parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

63

Importing Data: parsing● One important thing to note is encoding when parsing character.

UTF-8 is the most common, it may save you hours of fixing problems. Specify it when parsing characters like

x <- "El Niño was particularly bad this year"

parse_character(x, locale = locale(encoding = "utf-8"))

● ?parse_datetime, ?parse_date, ?parse_time

● Generate correct format strings to parse each of the following dates and times– d1 <- "January 1, 2010"

– d2 <- "2015-Mar-07"

– d3 <- "06-Jun-2017"

– d4 <- c("August 19 (2015)", "July 1 (2015)")

– d5 <- "12/30/14" # Dec 30, 2014

– t1 <- "1705"

– t2 <- "11:15:10.12 PM"

64

Importing Data: parsing files● example_file <- read_csv(readr_example("challenge.csv"))

● Use the problems() function to look at any issues with the import– problems(example_file)

● Specify the column names explicitly when reading the fileexample_file <- read_csv(readr_example(“challenge.csv”),

col_types = cols(x = col_double(),y = col_date()

)

)

● Use tail(dataframe, n=X) and head(dataframe, n=X) to look at last and first X rows of the data frame.

65

Parsing files

● One more strategy to get the column types is to use the guess_max option when reading in a file.

example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)

66

Writing to a file

● If you want to save the data into CSV you can use either of the functions– write_csv() or write_tsv() where you need

to specify The data frame you are saving The the file path (location) where to save it Optionally:

– you can set how missing values are written with na– You can also append to an existing file

67

Parsing Files

● Group Activity – Download the dataset: Number of Trainees with

Special Needs enrolled in Vocational Training Centres from http://opendata.go.tz

Read it into a data frame and do some manipulations including making some plots

– Inspect read_rds() and write_rds() and see where you can

use these functions

– Explore these packages: Haven, readxl, DBI

http://opendata.go.tz/

68

Tidy Data● A tidy dataset has these features

– Each variable is in its own column– Each observation is in its own row– Each value is in its own cell

● ?gather, ?spread

● Missing Values: – Can be explicitly stated with NA– Can be implicit: not present in the data

● With gather(…, na.rm=TRUE)● You can use the complete() function to make missing

values explicit tidy data.– ?complete

69

Case Study

● Optionally download the data from http://www.who.int/tb/country/data/download/en/

● Load the data from the file or from the package: tidyr::who

● Looking at the data:– Country, iso2, iso3 are similar: representing a

country– Year is clearly a variable– Other columns, have unclear names, look at the

dictionary

http://www.who.int/tb/country/data/download/en/

http://www.who.int/tb/country/data/download/en/

70

Case Study cntd...● Gather all the other columns, removing all missing values

– who1 <- who %>%

gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)

● Look at structure of the values in the new key by counting– who1 %>%

count(key)– Use the data dictionary for the definition of the keys– who2 <- who1 %>% – mutate(key = stringr::str_replace(key, "newrel", "new_rel"))

● Separate the key variable into different columns– who3 <- who2 %>%

separate(key, c("new", "type", "sexage"), sep = "_")

● Look at new key– who3 %>% – count(new)

● Drop new column because it is constant– who4 <- who3 %>%

select(-new)

● Separate sexage into sex and age– who5 <- who4 %>%

separate(sexage, c("sex", "age"), sep = 1)

72

Writing Code in R● Create new objects with <- with the format object_name

<- object_value● The <- symbol is the assignment operator● Examples:

– first_name <- “Sovello”

– date.of.birth <- “12/31/1980”

– PlaceOfBirth <- “Njombe”

– AGE <- 37

– x = 200 * 5

● Object names must start with a letter.● Object names can only contain letters, numbers,

underscore (_), and period (.)– Look at the examples above

73

Writing code in R● You can look at what is in R by typing the name of the object

● You can also print an object explicitly– print(first_name)

[1] “Sovello” The [1] shown in the output indicates that x is a vector and 5 is its first element.

74

Writing code in R

● All values that are not numbers must be enclosed in double/single quotes (“value”, or ‘value’)– Look at definition of place.of.birth in the screenshot

● Typos matter, when using object names. Cases matter a lot such that surname and Surname are not the same.

● The # character indicates a comment. Anything to the right of # is ignored by R

● No multi-line comments

75

Group Exercise (5min)● What is wrong with this code snippet

Surname <- “Mkulima”

surname

● If you start typing a value for an object and press enter before an enclosing quote or paranthesis the code will look like

college <- “College of informatics

+

– A + means you should continue typing. What would you do to fix, stop or escape from the problem?

● Fix errors in this piece of code until it workslibrary(tidyverse)

ggplot(dota = mpg) +

geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)

76

R Objects● R has five atomic objects

– Character– Numeric (real numbers)– Integer– Complex– Logical (True/False)

● The most basic type of R is a vector. An empty vector can be created with vector()

● A vector can only contain objects of the same type.● Numbers are generally treated as numeric objects

– If you want an integer, you have to explicitly specify an L. 1L is an integer 1 is a real number

77

R Objects

● Inf is a special number which represents infinity.– You can use Inf in calculations like 1/Inf

● Creating vectors● Use the c() function to create vectors

> x <- c(0.5, 0.6) ## numeric

> x <- c(TRUE, FALSE) ## logical

> x <- c(T, F) ## logical

> x <- c("a", "b", "c") ## character

> x <- 9:29 ## integer

> x <- c(1+0i, 2+4i) ## complex

78

Coercion of R objects● You can explicitly coerce objects using the as.* functions. ?

as.integer, ?as.character, ?as.logical, ?as.numeric

> x <- 0:6

> class(x)

[1] "integer"

> as.numeric(x)

[1] 0 1 2 3 4 5 6

> as.logical(x)

[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE

> as.character(x)

[1] "0" "1" "2" "3" "4" "5" "6"

● If R fails to coerce an object, it produces NAs.> x <- c("a", "b", "c")

> as.numeric(x)

Warning: NAs introduced by coercion

[1] NA NA NA

> as.logical(x)

[1] NA NA NA

> as.complex(x)

Warning: NAs introduced by coercion

[1] NA NA NA

79

R Objects: Matrices

● Matrices are vectors with a dimension attribute.● The dimension is an integer vector of length 2

(number of rows, number of columns)> m <- matrix(nrow = 2, ncol = 3)

> m

[,1] [,2] [,3]

[1,] NA NA NA

[2,] NA NA NA

> dim(m)

[1] 2 3

> attributes(m)

$dim

[1] 2 3

80

Matrices● Matrices are constructed column-wise and so entries start at the

“upper left” corner and running down the columns> m <- matrix(1:6, nrow = 2, ncol = 3)

> m

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

● You can create matrices from vectors by adding a dimensions attribute> m <- 1:10

> m

[1] 1 2 3 4 5 6 7 8 9 10

> dim(m) <- c(2, 5)

> m

[,1] [,2] [,3] [,4] [,5]

[1,] 1 3 5 7 9

[2,] 2 4 6 8 10

● Matrices must have every element be the same class (e.g. all integers or all numeric).

81

Group work

● What do cbind() and rbind() do?

● Create 3 vectors and 3 matrices.● Create 3 matrices from vectors● Create 2 matrices using cbind() and rbind()

● Read about R lists: how to create using list()

82

R Objects: Factors

● Factors represent categorical data● Factors can be ordered or unordered● Factor objects can be created with the

factor() function> x <- factor(c("yes", "yes", "no", "yes", "no"))

> x

[1] yes yes no yes no

Levels: no yes

> table(x)

x

no yes

2 3

83

Factors● Say you want to sort a vector

> x1 <- c("Dec", "Apr", "Jan", "Mar")

> sort(x1)

[1] "Apr" "Dec" "Jan" "Mar"

● The target was to see months sorted in the order of Jan, Mar, Apr, Dec● To solve this problem we can make use of factors

– Create a vector of monthsmonth_levels <- c(

"Jan", "Feb", "Mar", "Apr", "May", "Jun",

"Jul", "Aug", "Sep", "Oct", "Nov", "Dec”

)

● Then create a vector with month levels.> y1 <- factor(x1, levels = month_levels)

● Applying sort on the new variable, will produce a sorted list in order of months

> sort(y1)

84

R Objects: missing values● Missing values are denoted by NA and NaN for undefined mathematical

operations– is.na() is used to test objects if they are NA

– is.nan() is used to test for NaN

● NA values have a class also, so there are integer NA, character NA, etc.

● A NaN value is also NA but the converse is not true– > ## Create a vector with NAs in it

– > x <- c(1, 2, NA, 10, 3)

– > ## Return a logical vector indicating which elements are NA

– > is.na(x)

– [1] FALSE FALSE TRUE FALSE FALSE

– > ## Return a logical vector indicating which elements are NaN

– > is.nan(x)

– [1] FALSE FALSE FALSE FALSE FALSE

● What is difference between missing values Nas and Zero

85

R Objects:Data Frames

● Data frames store tabular data in R● Data frames are represented as a special type

of list where every element of the list has to have the same length.

● Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.

● Unlike matrices, data frames can store different classes of objects in each column.

86

Data Frames> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

> x

foo bar

1 TRUE

2 TRUE

3 FALSE

4 FALSE

> nrow(x)

[1] 4

> ncol(x)

[1] 2

87

Writing Code in R

● Scripts:– Turning interactive code into scripts

88

Data Transformation

● Filter rows with filter()– Comparisons: >, >=, <, <=, !=, ==

sqrt(2) ^ 2 == 2

– Logical operatorsAnd &

Or | (shorthand x %in% y e.g. 2 %in% c(1, 2, 3, 4))

Not !

– To determing missing values is.na(x)

● Ordering: use arrange()

89

Reading Data: large datasets

● With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.– Read the help page for read.table, which contains many hints– Stop if your RAM is smaller than the size of the file– Set comment.char = "" if there are no commented lines in

your file.– Use the colClasses argument. Specifying this option instead

of using the default can make ’read.table’ run MUCH faster, often twice as fast. You have to know the class of each column

– Set nrows. This doesn’t make R run faster but it helps with memory usage.

90

Reading large datasets

● A quick way to figure out the classes of each column is the following:

> initial <- read.table("datatable.txt", nrows = 100)

> classes <- sapply(initial, class)

> tabAll <- read.table("datatable.txt", colClasses = classes)

91

Control Structures

● Control structures allow to control the flow of execution of a series of R expressions.

● Control structures allow you to put some “logic” into R code, rather than just always executing the same R code every time.

● Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.

92

Control Structures: if-else● This if-else structure allows you to test a condition and act on it depending on

whether it’s true or false– You can only use the if statement

if(<condition>) {

## do something

}

## Continue with rest of code

● Or use the complete if-elseif(<condition>) {

## do something

}

else {

## do something else

}

● You can have a series of tests by following the initial if with any number of else ifs.if(<condition1>) {

## do something

} else if(<condition2>) {

## do something different

} else {

## do something different

}

93

Example: if-else● ## Generate a uniform random number

x <- runif(1, 0, 10)

if(x > 3) {

y <- 10

} else {

y <- 0

}

● This is the same as executingy <- if(x > 3) {

10

} else {

0

}

94

Control Structures: for

● For loops are the only looping construct in Rfor( x in sequence ){

##Execute code

}

● For one line loops, the curly braces are not strictly necessary.

– > for(i in 1:4) print(x[i])

[1] "a"

[1] "b"

[1] "c"

[1] "d"

–

95

Control Structures: while

● While loops begin by testing a condition● If it is true, they loop body is executed and

the condition is tested again until the condition is false

> count <- 0

> while(count < 10) {print(count)count <- count + 1

}

96

Control Structures: next

● Next is used to skip an iteration of a loopfor(i in 1:100) {

if(i <= 20) {

## Skip the first 20 iterations

next

}

## Do something here

}

97

Control Structures: break

● Break is used to exit the loop immediately, regardless of what the loop maybe on.

for(i in 1:100) {

print(i)

if(i > 20) {## Stop loop after 20 iterationsbreak

}

}

98

Functions

99

Functions: scoping

100

Dates and Times

101

Loop functions

102

Simulating and Profiling

103

Vectorized Operations

r programming for data science

Data & Analytics