二、计算新闻传播学工具介绍:r introduction to r for...
TRANSCRIPT
二、计算新闻传播学工具介绍:R Introduction to R for CCR
Hai Liang 梁海
复旦大学2014年FIST课程《计算新闻传播学》 Computational Communication Research
Why Programing + Why R
Why use R?
Inexpensive
Cross-platform
Extensible
Graphics better than many
You already know it
Familiarity with matrix algebra
Must be explicit
Need integrated calculator
Why avoid R?
Steep learning curve
Data cleaning can be difficult
Support limited
Extensibility needed to do what you need
Data types can be confusing
Limits to “Big Data“
Must be explicit
Base 1 (not 0)
2
Outline
Section I:
1. R + Rstudio
2. I/O
3. Basic Syntax
4. Data Structure
5. Programming Tools
Section II:
1. Data Management
2. Statistics
3. Hands-On
3
Readings
1. Torfs & Brauer (2014). A (very) short
introduction to R.
2. Kabacoff (2011). R in action: Data analysis and
graphics with R.
4
Section I
Data Structure
o Vector
o Matrix
o List
Programing
o If-statement
o For-loop
o Function
5
6
1. R + RStudio
1) Install R
2) Install RStudio
6
7
1. R + RStudio
1) RStudio layout
2) Working directory
3) R packages
7
8
2. Input/Output
1) TXT, CSV
2) SAV, DTA
3) Save & load x.Rdata
files
1) read.table(file=“”), read.csv(file=“”); write.table(), write.csv()
2) library(foreign), read.spss(file=“”, to.data.frame = T), read.dta(file=“”)
3) save(data,file=“data.Rdata”), load(“data.Rdata”)
8
3. R Basic Syntax
1) +, -, *, /, ^, sqrt
2) Variables
Height <- 180, Weight <- 50, print
height*weight
3) Using functions
sum(1,2,3)
mean(1,2,3)
9
3. R Basic Syntax—Operators
10
Operator Description Example
<- Assign a value a <- 1+2
+ Add x+y
- Subtract x-y
* Multiply x*y
/ Divide x/y
** or ^ Exponentiation x^y or x**y
%% Modulus x%%y
%/% Integer division x%/%y
Operator Description Example
<, > Less, greater than x<y
<=, >= Less, greater than or equal to
x>=y
== Equal to x==y
!= Not equal to x!=y
! Not !x
| Or x | y
& And x & y
isTRUE() Test if true isTRUE(x==y)
Arithmetic Operators Logical Operators
4. Data Types
There are three general modes of data (inside parentheses)
Strings (“Why, hi there”)
Numbers (5)
TRUE/FALSE (TRUE)
Missing data (NA) – Note, there are no quotes (“NA”)
11
4. Data Structure
1) Vector
2) Matrices
3) Data frames
4) Lists
12
Source: Kabacoff (2011)
4. Data Structure – Vector
Vector is a list of values
[numeric, logic, or string]
Define a vector
V <- c()
V <- c(1,2,”hi”)
V <- seq(5,9,0.5)
V <- c(1:7)
Vector access V[1], V[1:2], V[c(1,3)]
Vector names names(V) <- c(“first”,
”second”, ”third”)
V[“first”]
Vector math V {+,-,*,/} 1
V+V == V*2
V*V == V^2
sqrt(V)
13
4. Data Structure – Matrices
Data in rows and
columns (same mode)
Define a matrix
m <- matrix()
matrix(1,5,5)
V <- c(1:9)
m<-matrix(V,3,3)
Matrix access
m[1,2]; m[1,]; m[,2]
m[,2:3]; m[,c(1,3)]
Matrix math
m {+,-,*,/} 1
m+m = m*2
m%*%m
cbind/rbind(m,m)
14
4. Data Structure – Data Frames
Matrix + cols with
different modes
Define a data frame
Weights <- c(1:8)
Prices<- c(2:9)
Types <-c(T,F,F,…)
Data <-
data.frame(Weights,
Prices, Types)
Data frame access Data[1,2]
Data$Prices, Data[[“Prices”]]
Data frame math Data$Weigths*
Data$Prices
mean(Data$Prices)
merge(data1,data2,by=“
Prices”,all=T)
15
4. Data Structure – Lists
vector + matrix + data
frame etc.
Define a list
v1<-c(1,6,7,8)
v2<-c(2,4)
m<-matrix(1,2,4)
L <- list (v1, v2, m)
List access
L[[1]], L[[‘name’]]
16
17
5. Programming Tools
1) If-statement
2) For-loop
3) Function
if (cond) statement else statement
ifelse (condition, ture, false)
If (cond) {
statement
} else {
statement
}
17
18
5. Programming Tools
1) If-statement
2) For-loop
3) Function
An example
if (x>50} {
x=100
print (x)
} else if (x<=50 & x>10) {
x=50
print (x)
} else {
print (x)
}
18
19
5. Programming Tools
1) If-statement
2) For-loop
3) Function
for (name in expr_1) {statements}
while (cond) {statements}
x=c("LH","Jonanthan","winson","Qinjie")
for (name in x) {
print (nchar(name))
}
for (i in 1:4) {
print (nchar(x[i]))
}
19
20
5. Programming Tools
1) If-statement
2) For-loop
3) Function
myfuction <- function(arg1=default,arg2,…) {
statements
return (objects)
}
space <- function(len=5,wid=20){
sp<-len*wid
return (sp)
}
20
Hands-On
Exercise 1.
http://tryr.codeschool.com/
Homework 2.1
a. Create a list with length 10: for the first component list[[1]], the dimension is 1, the second is 1*2, the third is 3*3, the fourth is 4*4, and so on. The values should be selected randomly from 1:100.
b. For each component in the list, select the values > 50
c. and write a function to calculate a value = sd (values)/mean(values) when length(values)>1, otherwise return 0.
d. Loop for each component, you will get 10 values, and then calculate the sum of the 10 values
e. Repeat the process for many times (could you find any patterns?)
21
Section II
Data Management
o Aggregating
o Reshaping
Statistics
o Descriptive
o Graphics
o Linear Model
22
23
6. Data Management
Basics
Crating new variables
Recoding variables
Renaming variables
Missing value
Merging datasets
Subsetting datasets
Advances
Aggregating dataset
The reshape package
o install.packages(“reshape”)
o library(reshape)
o cast()
o melt()
6. Data Management – Basics I
Creating new variable load("sampleData.Rdata") # set working directory
sampleData$fn <- sampleData$User_followers_count+sampleData$User_friends_count #??!!
sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)
sampleData$User_friends_count<-as.numeric(sampleData$User_friends_count)
sampleData$fn <- sampleData$User_followers_count+sampleData$User_friends_count
Calculate a variable indicating favorites per post
Recoding a variable sampleData <- within(sampleData,{
popcat <- NA
popcat[User_followers_count > 282] <- "Popular"
popcat[User_followers_count <= 282] <- "Unpopular" })
24
6. Data Management – Basics II
Renaming variable load("sampleData.Rdata")
colnames(sampleData)[c(2,3)]<-c("article_id","content")
Missing value load("sampleData.Rdata")
sampleData$User_verified_reason[nchar(sampleData$User_verified_reason)==0]<-NA
is.na(sampleData$User_verified_reason)
sampleData$User_verified_reason[is.na(sampleData$User_verified_reason)]<-"Unknown“
sum(c(1,2,3,NA)) = ?!
sum(c(1,2,3,NA), na.rm=TRUE)
25
6. Data Management – Basics III
Subsetting datasets
o Selecting/keeping variables newdata<-sampleData[,c(1,3)]
newdata<-sampleData[,c("created_at","mid")]
o Dropping variables newdata<-sampleData[!(names(sampleData)%in%c("text","source"))]
newdata<-sampleData[c(-3,-4)]
sampleData$text<-NULL
o Selecting observations newdata<-sampleData[c(2:30),]
newdata<-sampleData[which(sampleData$User_gender==“m"&
sampleData$User_verified==“FALSE"),]
26
6. Data Management – Basics IV
Merging datasets
o Adding columns load("netD.Rdata")
load("sampleData.Rdata")
newdata<-merge(netD,sampleData[,c("User_screen_name","User_gender")],
by.x="sender",by.y="User_screen_name")
newdata <-newdata[order(newdata$sender, newdata$receiver),] # sort dataset
o Adding rows total <- rbind(dataframeA,dataframeB)
27
6. Data Management – Advances I
Aggregating
o aggregate(data,by,FUN) load("sampleData.Rdata")
sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)
aggregate(sampleData$User_followers_count,by=list(sampleData$User_verified),FUN="mean ",na.rm=T)
aggregate(sampleData$User_followers_count,by=list(sampleData$User_verified),FUN="media n",na.rm=T)
o aggregate(y~x,data,FUN) aggregate(User_followers_count~User_verified,data=sampleData,FUN="median",na.rm=T)
o aggregate(cbind(y1+y2)~x,data,FUN) aggregate(cbind(User_followers_count,User_friends_count)~User_verified,data=sampleData,F UN="median",na.rm=T)
28
29
6. Data Management – Advances II
Reshaping data with the melt() and cast() functions in Kabacoff (2011), p115.
30
6. Data Management – Advances II
With aggregation load("netD.Rdata")
library("reshape")
netD$freq<-1
withagg<-cast(netD,sender~receiver,sum)
Without aggregation withoutagg<-cast(netD,sender+receiver~issue)
Reverse aggregation nda<-melt(withoutagg,id=c("sender","receiver"))
7. R for Statistics
Descriptive
Descriptive
Chi-square test
T-test
Correlation
One way ANOVA
Graphs
Bar plot
Histogram
Scatter plot
Linear Model
Estimation
Diagnosis
information
31
7. R for Statistics – Descriptive I
Descriptive statistics
via summary()
load("sampleData.Rdata")
sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)
sampleData$User_friends_count<-as.numeric(sampleData$User_friends_count)
sampleData$User_gender<-as.factor(sampleData$User_gender)
summary(sampleData[c("User_followers_count","User_friends_count","User_gender")])
via by()
by(sampleData[c("User_followers_count","User_friends_count")],sampleData$User_gender,summa
ry)
32
7. R for Statistics – Descriptive II
Descriptive statistics
via table()
table(sampleData$User_gender,sampleData$User_verified)
table(netD$sender,netD$receiver) # edgelist=>matrix
Chi-square test [significance indicates ‘Not Independent’]
install.packages("vcd")
library(vcd)
mytable<-xtabs(~User_gender+User_verified, data=sampleData)
mytable<-table(sampleData$User_gender,sampleData$User_verified)
chisq.test(mytable)
33
7. R for Statistics – Descriptive III
Descriptive statistics
Categories association
mytable<-xtabs(~User_gender+User_verified, data=sampleData)
assocstats(mytable)
Correlation
cor(sampleData[c("User_followers_count","User_friends_count")],method="spearman",use=
"complete.obs")
cor.test(sampleData$User_followers_count, sampleData$User_friends_count, alternative =
"two.side", method ="pearson" )
34
7. R for Statistics – Descriptive IV
Descriptive statistics
T-test
t.test(User_followers_count~User_gender,sampleData) #gender difference of n of
followers
t.test(sampleData$User_followers_count, sampleData$User_friends_count)
One-way ANOVA
fit<-aov(User_followers_count~User_province,data=sampleData)
summary(fit)
TukeyHSD(fit)
35
36
7. R for Statistics – Graphs I
count<-table(sampleData$User_verified)
barplot(count,main="Simple Bar Plot",
xlab="Verified", ylab="Frequency")
barplot(count,main="Simple Bar Plot",
xlab="Verified", ylab="Frequency“, horiz=TRUE)
36
37
7. R for Statistics – Graphs II
hist(sampleData$User_followers_count,
breaks=20,
col="red",
xlab="Number of followers",
main="Colored histogram with 20 bins")
hist(sampleData$User_followers_count,
freq=FALSE, #new line
breaks=20,
col="red",
xlab="Number of followers",
main="Histogram, rug plot, density curve")
rug(jitter(sampleData$User_followers_count))
lines(density(sampleData$User_followers_count), col="blue", lwd=2)
37
38
7. R for Statistics – Graphs III
plot(sampleData$User_followers_count, sampleData$User_friends_count,
main="Basic Scatter plot of Followers vs. Friends", xlab="No. of Followers",
ylab="No. of Friends", pch=19)
abline(lm(User_followers_count~User_friends_cou nt,data=sampleData), col="red", lwd=2, lty=1)
lines(lowess(sampleData$User_followers_count,sa mpleData$User_friends_count), col="blue", lwd=2, lty=2)
?!
38
39
7. R for Statistics – Graphs III
plot(log(sampleData$User_followers_count),log(s ampleData$User_friends_count), main="Basic Scatter plot of Followers vs. Friends",xlab="log_No. of Followers", ylab="log_No. of Friends", pch=19)
abline(lm(log(User_followers_count)~log(User_frie nds_count),data=sampleData), col="red", lwd=2, lty=1)
lines(lowess(log(sampleData$User_followers_coun t),log(sampleData$User_friends_count) ), col="blue", lwd=2, lty=2)
39
7. R for Statistics – Linear Model I
Estimation
Linear models are estimated using the lm() function. It is a good idea to assign the model to an object in order to access model information. Dependent variable is listed first, all independent variables follow the ~. fit<-lm(log(User_followers_count)~log(User_friends_count)+User_gender+ as.factor(User_verified), data=sampleData)
Let’s see the results… summary(fit)
40
41
7. R for Statistics – Linear Model II
Post-estimation Plots
There is a series of diagnostic plots available with the plot() command.
plot(fit)
This is also accessible in a single chart
layout(matrix(c(1,2,3,4),2,2)) plot(fit)
41
3 4 5 6 7 8 9 10
-40
4
Fitted values
Resid
uals
Residuals vs Fitted172
16293
-3 -2 -1 0 1 2 3-2
02
4
Theoretical Quantiles
Sta
ndard
ized r
esid
uals
Normal Q-Q172
162
47
3 4 5 6 7 8 9 10
0.0
1.0
Fitted values
Sta
ndard
ized r
esid
uals
Scale-Location172
16247
0.00 0.02 0.04 0.06
-30
24
Leverage
Sta
ndard
ized r
esid
uals
Cook's distance
Residuals vs Leverage
162
47192
7. R for Statistics – Linear Model III
Information
In addition, there is quite a bit of information available within the fit object (a comprehensive list is here http://www.inside-r.org/r-doc/stats/lm). x<-residuals(fit) – Accesses model residuals plot(x) – Plot the residuals abline(a=0, b=0,col="red") – Add a horizontal line, intercept (a) = 0, slope (b) = 0, red (col) confint(fit) – Confidence intervals for each coefficient fitted(fit) – Predicted values
42
43
Hands-On
Homework 3.1
a. Read the csv file “authorlist.csv”
b. Select columns “Author Name” and
“Discipline”. The variable discipline
contains one or a set of words.
c. Output: a co-occurrence matrix M,
e.g., M[1,1] =
communication, health, 5 authors
Test the hypothesis:
People are more inclined to follow the ones who are verified when controlling for gender difference and the number of friends.
43
THANK YOU & CONTACT US @
weblab.com.cityu.edu.hk