2 - 5 - representing data in r (13-18)

Upload: m-faheem-aslam

Post on 13-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    1/6

    In the previous video we talked aboutrepresting data in text when you aretrying to communicate.Now we are going to talk a bit aboutrepresenting data in R, which where wewill be doing most of our data analysis.First we are going to talk about theimportant data types in R The classes ofdata types that you can have, such ascharacters, numeric values, integers, andlogicals,as well as objects, like vectors,matrices, data frames, lists, factors,missing values.And then we'll talk a little bit aboutoperations like subsetting and logicalsubsetting.For more information, see the data typevideo that's created, for the computingfor data analysis class, and it's alsoincluded as background material for thiscourse.So we're going to start off with

    characters.So in R, you can assign, a variable, likefirstName, to have a value, like jeff,where jeff is in quotes.If you look at the class of thisvariable, firstName, by applying theclass function like this, you get outsomething that says character.So this is a character variable.You can also type, first name, toactually see the variables assignedvalue, which is Jeff.Character variables are good for storing

    text,as opposed to storing numeric values.Numeric val, values, can be stored, innumeric variables.So for example, I'm storing here myheight in cm, in the variable, heightCM,and I'm assigning it a value of 188.2.I can look at the class of this variableto see that it's numeric,or we can type the variable and hitreturn to see its value, 188.2.Much of the data that we'll be using inthis class will be numeric data, and will

    be assigned to numeric variables.You can also assign integer variables.In some cases, you want to be discreteabout the data that you're representing.They shouldn't be able to take on anycontinuous value, and they should onlytake on integer values.To do this, we take the integer that wecare about, in this case, say 1, followedby a capital L.

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    2/6

    So, we are assigning the number of sonsthat I have, variable, numberSons, to beequal to one L.You can look at the class of thisvariable and see there's an integer whereyou hit numberSons, and hit return, andyou get the 1 out.Note that if you would assign the valueof just one without the L, you could havedone the same analysis, and when you saidnumberSons, you would still get 1 outhere at the end, but it might be adifferent class.If you care about a variable being aninteger, you need to assign it with the Loperator.Another kind of example, that mightbecome, that will become useful,especially when performing coding, thatrequires for-loops, if-loops, or othercontrol structures,is assigning logical values. So here wecan assign a value, a variable, calledteachingCoursera, because I'm teaching

    the Coursera course, and we can set it tobe equal to true.If we look at a class of this variable,it's a logical variable, and if we hittype teaching Coursera and we hit returnwe get true.We can use these variables to performcomparisons that we can later use inlogical structures.I'll be talking about these types ofvariables as we go on in the class, andwhat their properties are and whenthey're used, but it's a good idea to

    review what the different types are.Once we've assigned variables of aparticular class, we can ca-, create setsof those variables, and assign them tovectors.Vectors are a set of values with the sameclass.We can create them with the c operator.And c, where c stands for concatenate.So here I'm setting a set of heights, tobe the values, 188.2 181.3, 193.5.If I type heights, I then get all 3 ofthose values.

    Values, in the c, concatenate operator,are separated by commas.You can also create a vector, that iscon, that is, consists of charactervalues.So for example Here I'm creating a vectorcalled firstNames, that consists of 4character values, jeff, roger, andrew,and brian.Again, I've separated them by commas, and

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    3/6

    if I type first names and hit return, Isee those values back out.Sometimes we might want to concatenatedifferent types of variables together,where they're, where they have differentclasses.A vector of values of possibly differentclasses is called a list.So here I create 2 vectors, vector 1 andvector 2.Vector 1 has 3 values that are numeric.Vector 2 has 4 values that arecharacters.I can then create a list using the Listcommand that puts those two vectorstogether into one object called a list.If I type my list, I then see both theheights and the first names have beenstored in the variable, myList.Another type of vector that might be ofinterest during the class are matrices.Matrices are just vectors with multipledimensions.So instead of storing one set of values,

    you can store values in multiple rows.So here, I'm creating a variablemyMatrix, and the way that I'm doing thatis I'm assigning the values 1, 2, 3 and 4to that matrix and I'm telling them to bestored in the matrix.By rows, in other words it's going towork, run from left to right, storingvalues across rows, until it hits the endof a row, then starting a new row andfilling from left to right.Here I'm telling it to have 2 rows in thematrix, so the values start off as 1, 2,

    it hits the end of the first row,and then, returns and starts, 3, 4,filling in the values left to rightagain.The most commonly used object that we'llbe using in this class are data frames.These are multiple vectors, of possiblydifferent classes of the same length.So for example, if I create those sametwo vectors with three numeric values,and four character values, and try tocreate a data frame with those values, itsays that the arguments are different

    because they have different numbers ofrows.The reason being would, these two vectorsof different lengths,three and four.In that case, the data frame that we'vecreated cannot be found.However, if we add a fourth measurementto the vector 1, so we have 4 numericvalues and 4 text values, we can create a

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    4/6

    data frame using the data.frame command,and assign it the values of heights andfirst names.If you look at my data frame, it nowlooks like this.Each column is labeled with the variablename heights and first names, and eachrow contains the corresponding firstvalue of heights and a correspondingfirst value of first names.The way a data frame is structured, thevalues of the first row should correspondto one observation. So for example, 188.2is assumed to be the height of Jeff, and181.3 would be the height of Roger.Another type of variable that we'll beusing quite often in this class arecalled factors.So a factors are qualitative variables,that can be included in models.It's often hard to include, charactervectors, directly into statistical modelsin r, and so a different, category ofvariable is created, to be able to

    include it in a model.So for example, if we create a vectorsmoker, that consists of characters yes,no, yes, no Four characters, one for eachobservation, in the study, and we want tobe able to use this vector, to analysesomething about the differences betweenthe smokers and non-smokers.We would generally create a factor withthe variable function as dot factor.We now have a factor of smoker variable,and when we, we report this variable out,you see the values yes, no, yes, yes, no,

    and the levels, that correspond to thosevalues,no and yes.If there were 3 values, including a maybehere, you would see levels no, yes andmaybe.We are talking about he half factors weuse when we get on to statisticalmodelling.Another important variable that we willbe considering are missing values, in allthey are coded as NA.So, for example in this vector one that I

    have created here, There are 3 numericvalues and 1 missing value which I'vejust typed as NA.If I type vector1 I get the value 188.2,181.3, 193.4 and NA, which suggests thatthere's a missing value here.I can also use the command is.na Todetermine which of the values, aremissing, in this particular vector.So the first 3 values are not missing,

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    5/6

    but the last value is missing.Throughout the course, we will codemissing values with NA, and learn abouthow to deal with them.Next I'm going to briefly go oversubsetting.So while we're doing our data analyses,we will often want to only consider partof a particular vector or data frame.So here, I've generated two vectorsagain,one numeric with four values, and one, acharacter vector,again, with four values.I then put them together into one singledata frame.Now if I want to access just the firstvalue in the vector variable, vector 1variable, I can do vector 1, open hardbracket, 1, close hard bracket.That will return just the first valuefrom vector 1.Similarly, I can use the concatenatevariable to look at indexes of particular

    values.So suppose I wanted the first, second,and fourth values of the vector 1variable.I can then say, I can then subset vector1 using the same hard bracket command,and passing at the indexes 1, 2, and 4suggesting which values in that vector Iwould like to access.Similarly We can look at, specific rowsand columns in a data frame.Here, for this data frame, I am lookingat the first row, and the first 2 columns

    values.That returns for me, the heights, andfirst names, from the first row.Alternitivly I can access a particularcolumn using the $ operator.So this operator can be applied by sayingmydataframe$ and then the name of thevariable you want to access, in this casefirst names.And this will then give all the firstnames in the In the data frame.We can also subset by particular logicaloperations.

    So recall that we talked about logicalvariables at the beginning of thislecture.They can either be true or false.So here's an example.I have my data frame, and suppose that Iwant to identify all the rows in thatdata frame, corresponding to cases wherethe first names are equal to Jeff.In that case, I can use the variable name

  • 7/27/2019 2 - 5 - Representing Data in R (13-18)

    6/6

    firstNames and check using the equalsequals operator if it's equal to Jeff.This particular vector will return trueonly when Jeff appears.This is equivalent to saying that youonly want the first row of the dataframe, since only the first row of thedata frame corresponds to our first nameof Jeff, and indeed that's what'sreturned here.You can also do things like try toidentify the parts of the data framecorresponding to heights in a particularrange.For example, this looks at only, thiswill return only the rows in the dataframe corresponding to heights less than190.Alternatively, I can make, put commandsafter the comma, and that will deal withcolumns rather than rows.A couple, a quick note on variable namingconventions in R.Variable names should be short or

    descriptive.There is some common styles.For example, Camel caps which havevariables that also need between lowercase and upper case where the firstletter of each new word is upper case.Another example is putting underscores inbetween each separate word,or using dots between each separate word.Each of these conventions is used bydifferent people at different times.You should pick whichever one is mostcomfortable to you.

    You can see style guides at thesedifferent websites that I've linked tohere.And you'll see, that each style guidesuggest naming variables, functions andso forth, in slightly different ways.I know that this has been a quick tour ofthese particular concepts.If you are having any trouble followingalong in the lectures that follow, pleaseconsider viewing the computing for dataanalysis videos created by Roger, so thatyou will be able to understand all of the

    data analyses we are performing.