basic programming in r

42
1 Basic Programming in R Kenneth R. Szulczyk

Upload: kenneth-szulczyk

Post on 21-Nov-2015

58 views

Category:

Documents


1 download

DESCRIPTION

I wrote a guide to help people learn R programming. Although I do not teach any profound statistical or mathematical concepts, a reader should become a proficient programmer after he or she understands all the code used in this guide.

TRANSCRIPT

  • 1

    Basic Programming in R Kenneth R. Szulczyk

  • 2

    Basic Programming in R Copyright 2015 by Kenneth R. Szulczyk All rights reserved Edition 1, February 2015

  • 3

    Table of Contents

    Table of Contents ..................................................................................................................................... 3

    The Basics .................................................................................................................................................. 4

    Handling, manipulating, and creating data ......................................................................................... 6

    Creating Variables ................................................................................................................................ 6

    Importing a spreadsheet saved as an CSV file................................................................................. 8

    Importing an Excel Spreadsheet ...................................................................................................... 10

    Directory .............................................................................................................................................. 10

    Basic Statistics ......................................................................................................................................... 12

    Linear Regression ................................................................................................................................... 16

    Time Series Analysis .............................................................................................................................. 22

    Writing Programs ................................................................................................................................... 26

    The Car and Sunspot Program ......................................................................................................... 26

    Simple Program .................................................................................................................................. 26

    Amortization Table ............................................................................................................................ 27

    Matrices and Vectors ............................................................................................................................. 30

    Least Squares ...................................................................................................................................... 30

    Eigenvalues and eigenvectors .......................................................................................................... 33

    Choleski Factorization ....................................................................................................................... 34

    Appendix The Programs .................................................................................................................... 36

    The Car Analysis ................................................................................................................................ 36

    The Periodogram ................................................................................................................................ 38

    Fitting an ARIMA to the Sunspot Data ........................................................................................... 39

    Amortization Table ............................................................................................................................ 41

  • 4

    The Basics

    I am assuming the readers are familiar with basic statistics and linear algebra. I do not teach you any profound empirical techniques. Instead, I give you a comprehensive overview of programming in R. After finishing this book, you should be able to use R and tailor it for your own use. R is an open source math and statistics software. Researchers can download and use the software for free. You can download the software from:

    http://cran.r-project.org/ Researchers can use R in two ways. Researchers can enter commands directly into the console or write a program and run the program in the console. I show the console below:

    I have heard someone created a Graphic User Interface for R, where the users execute commands via pull down menus, but I have not found it yet. Using the console, enter the command, 2 + 2. The greater than sign indicates R is waiting for a command. Any text, commands, and equations in red indicate commands one can enter directly into R while blue indicates the output. 2 + 2

  • 5

    R calculates the answer. R uses matrices and vectors, and [1] means the answer is a vector of dimension 1. (Or simply a scalar). The brackets, [ ], mean an element and always indicate the index. [1] 4

    Note: R remembers all the variables and subroutines created in the console. Once I finish a program that seems to work, I close R and re-open it to wipe its memory. Then I check if the program still works.

    Type in the command: license() The output: This software is distributed under the terms of the GNU General Public License, either Version 2, June 1991 or Version 3, June 2007. The terms of version 2 of the license are in a file called COPYING which you should have received with this software and which can be displayed by RShowDoc("COPYING"). Version 3 of the license can be displayed by RShowDoc("GPL-3").

    Copies of both versions 2 and 3 of the license can be found at http://www.R-project.org/Licenses/.

    A small number of files (the API header files listed in R_DOC_DIR/COPYRIGHTS) are distributed under the LESSER GNU GENERAL PUBLIC LICENSE, version 2.1 or later. This can be displayed by RShowDoc("LGPL-2.1"), or obtained at the URI given. Version 3 of the license can be displayed by RShowDoc("LGPL-3").

    'Share and Enjoy.'

  • 6

    Handling, manipulating, and creating data

    Creating Variables Create a trend variable starting at 1 and ending at 100. We represent the equal sign as

  • 7

    Refer to the chart below for the main codes:

    Lets say you only wanted the first 50 observations of noise. Did you notice the command? I create a new vector called noise.2 and copy the first 50 observations from the noise vector. noise.2

  • 8

    [6] -0.163007504 0.466372338 0.813717075 0.077994673 0.141952803 [11] -0.417129484 0.247227624 1.019927832 0.991367337 -1.597887670 [16] -0.008955807 -0.229873035 0.079801223 0.135923409 0.883680805 [21] -1.375899489 -1.709316587 0.104933419 0.403238585 -0.630627316 [26] 0.133480828 -1.049068429 1.539173832 -0.638730541 -1.541867808 [31] -1.973639450 1.151270347 -1.022709096 1.336546494 0.422457580 [36] 0.963737604 0.466905061 -1.503421731 1.956695046 0.172492546 [41] -0.374648161 0.612023483 -0.469123654 -0.531325273 1.402142448 [46] 1.749785524 -0.650935475 -0.613352014 1.881250587 -0.290325872

    Importing a spreadsheet saved as an CSV file A user can easily import a spreadsheet into R but you must save it as a comma delimited format or csv. The first row contains the headings while the data falls below the headings. Headings can be upper and/or lower case. You can also use periods in the name but no other special characters. The spreadsheet should fit the format below:

    We will import the spreadsheet and name it dataset. dataset

  • 9

    Note: You must be consistent when using upper and lower case letters. For example, R views the variable MPG, mpg, and Mpg as three different variables. If you wrote a program and created a variable MPG, and further in the code you called the variable mpg, R will create a new variable called mpg. You must be consistent with your names and labels.

    I will create the variables I want to use. Remember, I used capital letters for miles per gallon (MPG) in the original dataset. The dataset is an object and the $ allows a user to access specific information from this object. In our case, the $ refers to the variable as a subset of the dataset. mpg

  • 10

    Then I multiply large.eng to get a vector of large engine sizes. The variable will only have engine sizes greater than 2. Smaller engines are transformed into zeros. The asterisk, *, means multiply. For vectors, R will multiply the first element of one vector to the first element of the second vector. Then the second element, the third element, and so on. large.eng

  • 11

    If you are not sure which directory R has installed itself in, then run this program. list.dirs

  • 12

    Basic Statistics

    I can use many commands to get descriptive statistics. Remember, this dataset has many categorical data. summary(dataset) The partial output: Observations Model Make MPG Min. : 1.00 Passat : 6 BMW : 25 Min. :13.00 1st Qu.: 94.75 Accord : 5 Mitsubishi: 24 1st Qu.:26.00 Median :188.50 Cavalier: 5 Volkswagen: 22 Median :29.00 Mean :188.50 Mustang : 5 Chevrolet : 21 Mean :29.48 3rd Qu.:282.25 S70 : 5 Ford : 20 3rd Qu.:32.00 Max. :376.00 Sunfire : 5 Toyota : 18 Max. :50.00 (Other) :345 (Other) :246 Remember, we created new variables. If I want the summary statistics for the variables I had created, then I use the cbind to combine the variables into a matrix. Cbind means I take the vectors and combine them together into a matrix. The matrix x will have three columns, and C refers to the columns. We also have the command rbind which combines rows. x

  • 13

    I want to calculate the correlations on my data. I redefine my x matrix to add more variables. x

  • 14

    I will calculate something a little more complicated - canonical correlation. I create two matrices x and y. x

  • 15

    Similarly, I need the third column from the $ycoef. We can index any matrix by using [row, column]. The row is blank, so R will copy all the rows. The 3 indicates we only want the third column. vector

  • 16

    Linear Regression

    With linear regression, we estimate a dependent variable, yt, with one or more explanatory variables. Refer to the equation below:

    = + , + , ++ , + We define the variables as:

    i represents one observation. If we have time series data, then we switch i to t.

    The dependent variable, yi o We try to explain or predict yi based on the x variables

    The independent variable, xj,i o We assume these variables are fixed and constant

    i represents the white noise process, assumed to be normal, mean of zero, and constant variance.

    We need to estimate the parameters o The intercept, 0 o 1, 2, until k are the slopes.

    I have data for 376 cars in 1998, or 376 observations. I believe the following relationship:

    A cars petrol consumption depends on the explanatory variables. o yt is measured in miles per gallon, or mpg.

    The explanatory variables o Compact cars should use less petrol than regular cars

    Dummy variable One if the car is compact, zero otherwise.

    o Cars with automatic transmissions use more petrol than sticks. Dummy variable

    o Larger engines use more petrol than smaller engines. o An engine with more cylinders uses more petrol.

    = + , + , + , + , +

    In R, we use lm as the command for linear regression. I store all the results under the object fit. Fit is not a variable. Fit constitutes an object containing many pieces of information. In this case, dataset is redundant. I could drop this term because I created vectors, or variables in R. fit

  • 17

    Call: lm(formula = mpg ~ compact + trans + eng.size + cylinders, data = dataset)

    Residuals: Min 1Q Median 3Q Max -6.9325 -2.2384 -0.2697 1.7869 17.1587

    Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.1067 0.6928 57.892 < 2e-16 *** compact 0.5901 0.4347 1.357 0.17547 trans -1.7810 0.3930 -4.532 7.90e-06 *** eng.size -1.2785 0.4767 -2.682 0.00764 ** cylinders -1.2091 0.2932 -4.124 4.59e-05 *** ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 3.581 on 371 degrees of freedom Multiple R-squared: 0.4735, Adjusted R-squared: 0.4678 F-statistic: 83.41 on 4 and 371 DF, p-value: < 2.2e-16 Researchers study the residuals. The residuals represent the errors, and we assume they are normally distributed. The object, fit, contains the objects from the linear regression. I pull out the residuals and store them in a vector called resid. resid

  • 18

    We can check whether the residuals are normally distributed. I will extract the standardized residuals from the object, fit. Standardized means the program will subtract the average and divide by the standard deviation. We use the command below: resid.standard

  • 19

    My data has outliers. I can estimate a Median Regression, or Quantile Regression. If you remember your statistics, an outlier represents an extreme point or observation the exception to the rule. When you calculate an average, outliers will cause it to deviate from the true average. On the other hand, median is another type of average. Median is the value in the middle, and it is not sensitive to outliers. For quantile regression, I need to install the package, quantreg. install.packages("quantreg") You can also install it using the Package Menu. R should also install SparseM because it relies on another package for its calculation.

    Note: You are assuming two things when you download and use someone elses package.

    1. You assume the code works correctly and calculates what it is supposed to calculate. 2. You enter the correct parameters into the function when you use it.

    You must load the library before you use it. library("quantreg") I estimate the median regression via the command, rq. I only include the engine size because I want to graph it.

  • 20

    fit.median

  • 21

    Note: The c() command lets us cheat in R. Many commands in R only allow the user to input one argument or variable. Thus, we can use c() to combine many variables or arguments into one element. Then we can enter the combined element as one command into an R function.

  • 22

    Time Series Analysis

    I downloaded the dataset from the National Aeronautics and Space Administration (NASA) at http://solarscience.msfc.nasa.gov/SunspotCycle.shtml. We read the data in, which I named sundata. sundata = read.csv("spot_num.csv", header = TRUE) I created two vectors: sunspots and year. sunspots

  • 23

    We can use the par(mfrow) command to combine multiple plots onto the same graph. The c(2,1) refers to two rows and one column, or c(rows, columns). Remember, the c() means combine the elements, and it differs from the command, cbind(). par(mfrow=c(2,1)) acf(sunspots, 30, main="Sunspots") pacf(sunspots, 30, main="Sunspots")

    I fit an AutoRegressive Integrated Moving Average (ARIMA) to the data. The ARIMA is difficult to explain, so lets assume our data does not have the Integrated part. That leaves an ARMA, which is defined below:

    = + ! + ! ++ "!" + + #! + #

    ! ++ #"!$ We define the variables as:

    The time series in the current period, xt

    xt-1 is the time series in the previous time period, t-1; xt-2, and so on. o Thus, xt depends on previous values of itself

    t represents the white noise process, assumed to be normal, mean of zero, and constant variance.

    t-1 is the white noise in the previous time period; t-2 and so on. o The noise from previous periods can influence xt. o Since we are in period t, we know what the noise is in previous periods, which is

    why we can estimate the q parameters.

    We need to estimate the parameters o The intercept, c o i are the coefficients for the autoregression and must lie between -1 and 1. o qi are the coefficients for the moving average and must lie between -1 and 1.

  • 24

    o Otherwise, the equation becomes unstable if the magnitude of i or q1 exceeds one.

    According to the ACF plot, the plot tails off. Thus, we may have one moving average term. The partial ACF plot also trails off. Hence, we may have one autoregressive term. We will estimate the ARIMA:

    = + ! + + #! We setup the estimation below. The c(1,0,1) means c(# of terms for autoregression, integrative, # of terms for the moving average). fit.arima

  • 25

    Scientists at NASA claim the number of sunspots have an 11-year cycle. Unfortunately, R will not let me estimate a seasonal ARIMA(1,0,0) with a 11-year seasonal component because that means the data has a 132-month cycle. However, I found an annual cycle, and we can estimate an ARIMA(1,0,1) with a seasonal ARIMA(1,0,0). fit.arima.season

  • 26

    Writing Programs

    The Car and Sunspot Program R uses a scripting programming language. We can write programs in text files using Notepad. However, we save the extension as R and not txt. We did many calculations for the car data. I organized everything into a program, which is available in the Apendix. I added comments using the # mark and outputted my results using the print function. Everything else is the same. Look at the code. I wanted a blank line to separate the output, so I placed the command, cat("\n"). Cat stands for concatenate while \n is a carriage return. I named the file, car.R, and can run it by: source("car.R") Similarly, I wrote a R program to calculate the sunspot data. To run the program, type in: source("arima.R")

    Note: R scripting language allows users to write sloppy code. Imagine you return a year later and try to figure out what you wrote. Thus, these rules come in handy.

    1. Get into the habit of reviewing your code and simplifying it. 2. Use # to include comments in your code. 3. Print out your variables and calculations to verify them.

    Simple Program Programmers use a loop to repeat a process. R allows a programmer three methods to construct a loop, but I show only one. A loop starts and ends with a curly bracket. From the loop below, the loop starts at 1 and ends at 100. Here is a quirk with R. I can only print one item at a time. So I create an x variable that contains my quote and adds the index number to it by using c(). for(i in 1:100) { x

  • 27

    Amortization Table I used R to calculate an amortization table. I created a subroutine to calculate the monthly payment. We would not do this in real practice because the program only calls the subroutine once. We normally use subroutines to compute repeated calculations. The subroutine comes first in your program. In the program, I utilize the subroutine to calculate the monthly payment. My variable is payment and I named the subroutine loan.payment. I pass the variables principal, interest, years, and number into the subroutine.

    The principal is the loan amount.

    Interest is the annual interest rate

    The years is the number of years for the loan

    Number is the number of payments per year. payment

  • 28

    The main program begins executing. I define the parameters of the loan. The loan is $150,000. The bank charges 8% annual interest rate. The borrower makes 12 payments per year and repays the loan in 30 years. Below, I define the parameters. principal

  • 29

    I record this information into the matrix by indexing specific elements. The loop chooses the row, i, while the second number determines the column. Please do not forget the closing bracket, }, that comes at the loops end. table.amort[i,1]

  • 30

    Matrices and Vectors

    Least Squares R has several annoyances when dealing with vectors and matrices. Technically a column vector with n rows equals a matrix of dimension n X 1. However, R treats the indexing differently for vectors and matrices. R indexes a vector by [row] and a matrix by [row, column]. We already created the vector, mpg. Type it in and show it. The number in the [] identifies the row in the vector. mpg The output: [1] 25 24 25 28 31 31 31 24 24 31 32 29 29 27 29 27 27 28 26 26 25 24 [26] 31 32 26 28 24 24 24 20 24 27 30 26 28 31 32 27 30 26 28 28 28 Lets say I want the value in the 26th row, then I type in: mpg[26] The output: [1] 31 I want to calculate linear regression using matrices. In matrix notation, linear regression equals:

    1 = 23 + 4 The Y is a vector with dimensions n by 1. The n equals the number of observations or rows. The X matrix has dimensions n by k, where k equals the number of beta parameters. It also includes an intercept. The vector is a k by 1. Finally, the represents the white noise, assumed to be normally distributed with a mean of zero and a constant variance. The has n rows and one column. First, we must create a column of ones for the intercept. We create a variable n for the number of observations. The command length returns the number of rows in the vector. n

  • 31

    I create the intercept as a vector with dimensions n X 1. intercept

  • 32

    compact 0.5901217 trans -1.7809514 eng.size -1.2784526 cylinders -1.2090972 Did you notice? This should be a vector but R defines it as a matrix. The 1 identifies the first column. I want to predict the y values while using the x and estimated . The hats mean I estimated the parameters from the data. We have seen this before, remember 1 = 23 + 4. The random noise is missing, and I added some hats.

    19 = 239 predict

  • 33

    R has another quirk. The sigma is a matrix with dimensions 1 X 1. Thus, we have a single number, or scalar. However, R does not allow sigma to be multiplied by the matrix. So we must use the command, as.numeric, to convert the matrix into a number, or scalar. cov

  • 34

    [4,] -0.45030242 -0.27372179 0.2311537 -0.77480083 0.2618316 [5,] -0.87263670 -0.01362799 -0.1439916 0.37549820 -0.2767435 I pull out the first eigenvalue and first eigenvector. I used as.numeric to convert the eigenvalue into a scalar. lambda

  • 35

    Verify the Choleski factorization by multiplying RTR. Remember, we must use matrix multiplication, %*%. t(R)%*%R intercept compact trans eng.size cylinders intercept 376.0 93.0 236.0 994.60 1952.0 compact 93.0 93.0 56.0 220.40 451.0 trans 236.0 56.0 236.0 664.40 1289.0 eng.size 994.6 220.4 664.4 2973.66 5668.5 cylinders 1952.0 451.0 1289.0 5668.50 11028.0 It does indeed equal the A matrix.

  • 36

    Appendix The Programs

    The Car Analysis # Import the dataset

    dataset

  • 37

    fit

  • 38

    The Periodogram ############################################################ # # This program calculates the periodogram on the residuals # Ken Szulczyk # ############################################################

    # reads the variable residuals

    n = length(residuals)

    response

  • 39

    Fitting an ARIMA to the Sunspot Data # Import the data into R

    sundata = read.csv("spot_num.csv", header = TRUE)

    # Create the vectors, or variables

    sunspots

  • 40

    ############################################################ # # This program calculates the periodogram on the residuals # Ken Szulczyk # ############################################################

    # reads the variable residuals

    n = length(residuals)

    response

  • 41

    Amortization Table ####################################################

    # Create a subroutine to calculate a loan payment

    ####################################################

    loan.payment

  • 42

    for(i in 1:n) {

    interest.payment