二、计算新闻传播学工具介绍:r introduction to r for...

44
二、计算新闻传播学工具介绍:R Introduction to R for CCR Hai Liang 梁海 复旦大学2014年FIST课程《计算新闻传播学》 Computational Communication Research

Upload: nguyenkiet

Post on 17-May-2018

275 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

二、计算新闻传播学工具介绍:R Introduction to R for CCR

Hai Liang 梁海

复旦大学2014年FIST课程《计算新闻传播学》 Computational Communication Research

Page 2: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Why Programing + Why R

Why use R?

Inexpensive

Cross-platform

Extensible

Graphics better than many

You already know it

Familiarity with matrix algebra

Must be explicit

Need integrated calculator

Why avoid R?

Steep learning curve

Data cleaning can be difficult

Support limited

Extensibility needed to do what you need

Data types can be confusing

Limits to “Big Data“

Must be explicit

Base 1 (not 0)

2

Page 3: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Outline

Section I:

1. R + Rstudio

2. I/O

3. Basic Syntax

4. Data Structure

5. Programming Tools

Section II:

1. Data Management

2. Statistics

3. Hands-On

3

Page 4: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Readings

1. Torfs & Brauer (2014). A (very) short

introduction to R.

2. Kabacoff (2011). R in action: Data analysis and

graphics with R.

4

Page 5: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Section I

Data Structure

o Vector

o Matrix

o List

Programing

o If-statement

o For-loop

o Function

5

Page 7: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7

1. R + RStudio

1) RStudio layout

2) Working directory

3) R packages

7

Page 8: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

8

2. Input/Output

1) TXT, CSV

2) SAV, DTA

3) Save & load x.Rdata

files

1) read.table(file=“”), read.csv(file=“”); write.table(), write.csv()

2) library(foreign), read.spss(file=“”, to.data.frame = T), read.dta(file=“”)

3) save(data,file=“data.Rdata”), load(“data.Rdata”)

8

Page 9: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

3. R Basic Syntax

1) +, -, *, /, ^, sqrt

2) Variables

Height <- 180, Weight <- 50, print

height*weight

3) Using functions

sum(1,2,3)

mean(1,2,3)

9

Page 10: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

3. R Basic Syntax—Operators

10

Operator Description Example

<- Assign a value a <- 1+2

+ Add x+y

- Subtract x-y

* Multiply x*y

/ Divide x/y

** or ^ Exponentiation x^y or x**y

%% Modulus x%%y

%/% Integer division x%/%y

Operator Description Example

<, > Less, greater than x<y

<=, >= Less, greater than or equal to

x>=y

== Equal to x==y

!= Not equal to x!=y

! Not !x

| Or x | y

& And x & y

isTRUE() Test if true isTRUE(x==y)

Arithmetic Operators Logical Operators

Page 11: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Types

There are three general modes of data (inside parentheses)

Strings (“Why, hi there”)

Numbers (5)

TRUE/FALSE (TRUE)

Missing data (NA) – Note, there are no quotes (“NA”)

11

Page 12: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Structure

1) Vector

2) Matrices

3) Data frames

4) Lists

12

Source: Kabacoff (2011)

Page 13: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Structure – Vector

Vector is a list of values

[numeric, logic, or string]

Define a vector

V <- c()

V <- c(1,2,”hi”)

V <- seq(5,9,0.5)

V <- c(1:7)

Vector access V[1], V[1:2], V[c(1,3)]

Vector names names(V) <- c(“first”,

”second”, ”third”)

V[“first”]

Vector math V {+,-,*,/} 1

V+V == V*2

V*V == V^2

sqrt(V)

13

Page 14: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Structure – Matrices

Data in rows and

columns (same mode)

Define a matrix

m <- matrix()

matrix(1,5,5)

V <- c(1:9)

m<-matrix(V,3,3)

Matrix access

m[1,2]; m[1,]; m[,2]

m[,2:3]; m[,c(1,3)]

Matrix math

m {+,-,*,/} 1

m+m = m*2

m%*%m

cbind/rbind(m,m)

14

Page 15: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Structure – Data Frames

Matrix + cols with

different modes

Define a data frame

Weights <- c(1:8)

Prices<- c(2:9)

Types <-c(T,F,F,…)

Data <-

data.frame(Weights,

Prices, Types)

Data frame access Data[1,2]

Data$Prices, Data[[“Prices”]]

Data frame math Data$Weigths*

Data$Prices

mean(Data$Prices)

merge(data1,data2,by=“

Prices”,all=T)

15

Page 16: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

4. Data Structure – Lists

vector + matrix + data

frame etc.

Define a list

v1<-c(1,6,7,8)

v2<-c(2,4)

m<-matrix(1,2,4)

L <- list (v1, v2, m)

List access

L[[1]], L[[‘name’]]

16

Page 17: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

17

5. Programming Tools

1) If-statement

2) For-loop

3) Function

if (cond) statement else statement

ifelse (condition, ture, false)

If (cond) {

statement

} else {

statement

}

17

Page 18: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

18

5. Programming Tools

1) If-statement

2) For-loop

3) Function

An example

if (x>50} {

x=100

print (x)

} else if (x<=50 & x>10) {

x=50

print (x)

} else {

print (x)

}

18

Page 19: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

19

5. Programming Tools

1) If-statement

2) For-loop

3) Function

for (name in expr_1) {statements}

while (cond) {statements}

x=c("LH","Jonanthan","winson","Qinjie")

for (name in x) {

print (nchar(name))

}

for (i in 1:4) {

print (nchar(x[i]))

}

19

Page 20: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

20

5. Programming Tools

1) If-statement

2) For-loop

3) Function

myfuction <- function(arg1=default,arg2,…) {

statements

return (objects)

}

space <- function(len=5,wid=20){

sp<-len*wid

return (sp)

}

20

Page 21: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Hands-On

Exercise 1.

http://tryr.codeschool.com/

Homework 2.1

a. Create a list with length 10: for the first component list[[1]], the dimension is 1, the second is 1*2, the third is 3*3, the fourth is 4*4, and so on. The values should be selected randomly from 1:100.

b. For each component in the list, select the values > 50

c. and write a function to calculate a value = sd (values)/mean(values) when length(values)>1, otherwise return 0.

d. Loop for each component, you will get 10 values, and then calculate the sum of the 10 values

e. Repeat the process for many times (could you find any patterns?)

21

Page 22: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

Section II

Data Management

o Aggregating

o Reshaping

Statistics

o Descriptive

o Graphics

o Linear Model

22

Page 23: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

23

6. Data Management

Basics

Crating new variables

Recoding variables

Renaming variables

Missing value

Merging datasets

Subsetting datasets

Advances

Aggregating dataset

The reshape package

o install.packages(“reshape”)

o library(reshape)

o cast()

o melt()

Page 24: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

6. Data Management – Basics I

Creating new variable load("sampleData.Rdata") # set working directory

sampleData$fn <- sampleData$User_followers_count+sampleData$User_friends_count #??!!

sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)

sampleData$User_friends_count<-as.numeric(sampleData$User_friends_count)

sampleData$fn <- sampleData$User_followers_count+sampleData$User_friends_count

Calculate a variable indicating favorites per post

Recoding a variable sampleData <- within(sampleData,{

popcat <- NA

popcat[User_followers_count > 282] <- "Popular"

popcat[User_followers_count <= 282] <- "Unpopular" })

24

Page 25: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

6. Data Management – Basics II

Renaming variable load("sampleData.Rdata")

colnames(sampleData)[c(2,3)]<-c("article_id","content")

Missing value load("sampleData.Rdata")

sampleData$User_verified_reason[nchar(sampleData$User_verified_reason)==0]<-NA

is.na(sampleData$User_verified_reason)

sampleData$User_verified_reason[is.na(sampleData$User_verified_reason)]<-"Unknown“

sum(c(1,2,3,NA)) = ?!

sum(c(1,2,3,NA), na.rm=TRUE)

25

Page 26: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

6. Data Management – Basics III

Subsetting datasets

o Selecting/keeping variables newdata<-sampleData[,c(1,3)]

newdata<-sampleData[,c("created_at","mid")]

o Dropping variables newdata<-sampleData[!(names(sampleData)%in%c("text","source"))]

newdata<-sampleData[c(-3,-4)]

sampleData$text<-NULL

o Selecting observations newdata<-sampleData[c(2:30),]

newdata<-sampleData[which(sampleData$User_gender==“m"&

sampleData$User_verified==“FALSE"),]

26

Page 27: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

6. Data Management – Basics IV

Merging datasets

o Adding columns load("netD.Rdata")

load("sampleData.Rdata")

newdata<-merge(netD,sampleData[,c("User_screen_name","User_gender")],

by.x="sender",by.y="User_screen_name")

newdata <-newdata[order(newdata$sender, newdata$receiver),] # sort dataset

o Adding rows total <- rbind(dataframeA,dataframeB)

27

Page 28: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

6. Data Management – Advances I

Aggregating

o aggregate(data,by,FUN) load("sampleData.Rdata")

sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)

aggregate(sampleData$User_followers_count,by=list(sampleData$User_verified),FUN="mean ",na.rm=T)

aggregate(sampleData$User_followers_count,by=list(sampleData$User_verified),FUN="media n",na.rm=T)

o aggregate(y~x,data,FUN) aggregate(User_followers_count~User_verified,data=sampleData,FUN="median",na.rm=T)

o aggregate(cbind(y1+y2)~x,data,FUN) aggregate(cbind(User_followers_count,User_friends_count)~User_verified,data=sampleData,F UN="median",na.rm=T)

28

Page 29: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

29

6. Data Management – Advances II

Reshaping data with the melt() and cast() functions in Kabacoff (2011), p115.

Page 30: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

30

6. Data Management – Advances II

With aggregation load("netD.Rdata")

library("reshape")

netD$freq<-1

withagg<-cast(netD,sender~receiver,sum)

Without aggregation withoutagg<-cast(netD,sender+receiver~issue)

Reverse aggregation nda<-melt(withoutagg,id=c("sender","receiver"))

Page 31: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics

Descriptive

Descriptive

Chi-square test

T-test

Correlation

One way ANOVA

Graphs

Bar plot

Histogram

Scatter plot

Linear Model

Estimation

Diagnosis

information

31

Page 32: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Descriptive I

Descriptive statistics

via summary()

load("sampleData.Rdata")

sampleData$User_followers_count<-as.numeric(sampleData$User_followers_count)

sampleData$User_friends_count<-as.numeric(sampleData$User_friends_count)

sampleData$User_gender<-as.factor(sampleData$User_gender)

summary(sampleData[c("User_followers_count","User_friends_count","User_gender")])

via by()

by(sampleData[c("User_followers_count","User_friends_count")],sampleData$User_gender,summa

ry)

32

Page 33: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Descriptive II

Descriptive statistics

via table()

table(sampleData$User_gender,sampleData$User_verified)

table(netD$sender,netD$receiver) # edgelist=>matrix

Chi-square test [significance indicates ‘Not Independent’]

install.packages("vcd")

library(vcd)

mytable<-xtabs(~User_gender+User_verified, data=sampleData)

mytable<-table(sampleData$User_gender,sampleData$User_verified)

chisq.test(mytable)

33

Page 34: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Descriptive III

Descriptive statistics

Categories association

mytable<-xtabs(~User_gender+User_verified, data=sampleData)

assocstats(mytable)

Correlation

cor(sampleData[c("User_followers_count","User_friends_count")],method="spearman",use=

"complete.obs")

cor.test(sampleData$User_followers_count, sampleData$User_friends_count, alternative =

"two.side", method ="pearson" )

34

Page 35: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Descriptive IV

Descriptive statistics

T-test

t.test(User_followers_count~User_gender,sampleData) #gender difference of n of

followers

t.test(sampleData$User_followers_count, sampleData$User_friends_count)

One-way ANOVA

fit<-aov(User_followers_count~User_province,data=sampleData)

summary(fit)

TukeyHSD(fit)

35

Page 36: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

36

7. R for Statistics – Graphs I

count<-table(sampleData$User_verified)

barplot(count,main="Simple Bar Plot",

xlab="Verified", ylab="Frequency")

barplot(count,main="Simple Bar Plot",

xlab="Verified", ylab="Frequency“, horiz=TRUE)

36

Page 37: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

37

7. R for Statistics – Graphs II

hist(sampleData$User_followers_count,

breaks=20,

col="red",

xlab="Number of followers",

main="Colored histogram with 20 bins")

hist(sampleData$User_followers_count,

freq=FALSE, #new line

breaks=20,

col="red",

xlab="Number of followers",

main="Histogram, rug plot, density curve")

rug(jitter(sampleData$User_followers_count))

lines(density(sampleData$User_followers_count), col="blue", lwd=2)

37

Page 38: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

38

7. R for Statistics – Graphs III

plot(sampleData$User_followers_count, sampleData$User_friends_count,

main="Basic Scatter plot of Followers vs. Friends", xlab="No. of Followers",

ylab="No. of Friends", pch=19)

abline(lm(User_followers_count~User_friends_cou nt,data=sampleData), col="red", lwd=2, lty=1)

lines(lowess(sampleData$User_followers_count,sa mpleData$User_friends_count), col="blue", lwd=2, lty=2)

?!

38

Page 39: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

39

7. R for Statistics – Graphs III

plot(log(sampleData$User_followers_count),log(s ampleData$User_friends_count), main="Basic Scatter plot of Followers vs. Friends",xlab="log_No. of Followers", ylab="log_No. of Friends", pch=19)

abline(lm(log(User_followers_count)~log(User_frie nds_count),data=sampleData), col="red", lwd=2, lty=1)

lines(lowess(log(sampleData$User_followers_coun t),log(sampleData$User_friends_count) ), col="blue", lwd=2, lty=2)

39

Page 40: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Linear Model I

Estimation

Linear models are estimated using the lm() function. It is a good idea to assign the model to an object in order to access model information. Dependent variable is listed first, all independent variables follow the ~. fit<-lm(log(User_followers_count)~log(User_friends_count)+User_gender+ as.factor(User_verified), data=sampleData)

Let’s see the results… summary(fit)

40

Page 41: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

41

7. R for Statistics – Linear Model II

Post-estimation Plots

There is a series of diagnostic plots available with the plot() command.

plot(fit)

This is also accessible in a single chart

layout(matrix(c(1,2,3,4),2,2)) plot(fit)

41

3 4 5 6 7 8 9 10

-40

4

Fitted values

Resid

uals

Residuals vs Fitted172

16293

-3 -2 -1 0 1 2 3-2

02

4

Theoretical Quantiles

Sta

ndard

ized r

esid

uals

Normal Q-Q172

162

47

3 4 5 6 7 8 9 10

0.0

1.0

Fitted values

Sta

ndard

ized r

esid

uals

Scale-Location172

16247

0.00 0.02 0.04 0.06

-30

24

Leverage

Sta

ndard

ized r

esid

uals

Cook's distance

Residuals vs Leverage

162

47192

Page 42: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

7. R for Statistics – Linear Model III

Information

In addition, there is quite a bit of information available within the fit object (a comprehensive list is here http://www.inside-r.org/r-doc/stats/lm). x<-residuals(fit) – Accesses model residuals plot(x) – Plot the residuals abline(a=0, b=0,col="red") – Add a horizontal line, intercept (a) = 0, slope (b) = 0, red (col) confint(fit) – Confidence intervals for each coefficient fitted(fit) – Predicted values

42

Page 43: 二、计算新闻传播学工具介绍:R Introduction to R for …weblab.com.cityu.edu.hk/workshops/fudan-ccr/Intro_R_for_CCR.pdfExtensibility needed to do what ... The reshape

43

Hands-On

Homework 3.1

a. Read the csv file “authorlist.csv”

b. Select columns “Author Name” and

“Discipline”. The variable discipline

contains one or a set of words.

c. Output: a co-occurrence matrix M,

e.g., M[1,1] =

communication, health, 5 authors

Test the hypothesis:

People are more inclined to follow the ones who are verified when controlling for gender difference and the number of friends.

43