digital text and data processing introduction to r

Digital Text and

Data Processing

Introduction to R

□ Tools themselves are often based on specific assumptions / subjective decisions

□ There is subjectivity in the way in which tools are used

□ Reproducible results

□ Rockwell & Ramsay, in “Developing Things”: A tool is a theory

Objectivity of DH Research

http://dhdebates.gc.cuny.edu/debates#text/11

Willard McCarty, Humanities Computing (Palgrave, 2005)

"The point of all modelling exercises, as of scholarly research generally, is the process seen in and by means of a developing product, not the definitive achievement"(p. 22).

Models, "however finely perfected, are better understood as temporary states in a process of coming to know rather than fixed structures of knowledge"(p. 27)

-> Clash between tacit and intuitive knowledge of scholar and computer’s need for consistency and explicitness

□ Data creation

□ Data analysis

Two stages in text mining

□ Finding distinctive vocabulary

□ Finding stylistic or grammatical differences and similarities

□ Examining topics or themes

□ Clustering texts on the basis of quantifiable aspects

Types of analyses

opendir (DIR, $dir) or die "Can't open directory!";

while (my $file = readdir(DIR)) {

if ( $file =~ /txt$/) {push ( @files, $file ) ;

}

}

Reading a directory

Inverse document frequency

For an application, see Stephen Ramsay, Algorithmic Criticism

http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405148641/9781405148641.xml&chunk.id=ss1-6-7

□ Both a programme and a programming language

□ Successor of “S”

□ “a free software environment for statistical computing and graphic”

□ The capabilities of R can be extended via external “packages”

□ Any combination of alphanumerical characters, underscore and dot

□ Unlike Perl, they do not begin with a $ □ First characters cannot be a number. The second characters

cannot be a number if the first character is a dot

Variables in R

Allowed: Not allowed:data 3rdDataSetmy.data .4thData.setmy_2ndDataSet.myCsv

□ A collection of indexed values

□ Can be created using the c() function, or by supplying a range

□ N.B. The assignment operator in R is <-

□ Examples:

Vectors

x <- c( 4, 5, 3, 7) ;

y <- 1:30 ;

□ A collection of vectors, all of the same length

□ Each column of the table is stored in R as a vector.

Data frame

V1 V2 V3R1 3, 4, 5R2 1, 21, 8R3 23, 5, 6

Comma Separated Values

i,you,heEmma,160416,3178,1994Persuasion,77431,1284,918PrideAndPrejudice,121812,2068,1356

N.B. The first row has one column less

□ Use the read.csv function, with parameter header = TRUE□ The CSV file will be represented as a data frame□ Values on first line and first value of each subsequent line will be used as rownames and colnames

Reading data

data <- read.csv( "data.csv" , header = TRUE) ;

colnames(data)

□ Can be accessed using the $ operator

Data frame columns

data <- read.csv( "data.csv" , header = TRUE) ;

data$you

□ max(), min(), mean(), sd()

Calculations

y <- data$you ;

max(y) ;

sd(y) ;

□ Run the program “typeToken.pl”

□ Use the file “ratio.csv” that is created by this program.

□ Print a list of all the texts that have been read□ Calculate the average number of tokens□ Calculate the total number of tokens in the full corpus□ Identify the lowest number in the column “types”□ Identify the highest number in the column “ratio”

Exercise

d <- read.csv("data.csv") ;

d <- d[ 1 , 2 ] ;

d <- d[ 2 , ] ;

od <- data[ order( data$ratio ), ]

Subsetting and sorting

□ Qualitative data (categorical)

□ Nominal scale (unordered scale), e.g. eye colour, marital status□ Ordinal scale (ordered scale), e.g. educational level

□ Quantitative data

□ Interval (scale with no mathematical zero)□ Ratio (multipliable scale), e.g. age

Quantitative and Qualitative

Source: Seminar Basic Statistics, Laura Bettens

□ Two quantitative variables can be clarified in a variety of ways (e.g. line chart, pie chart)

□ A combination of one qualitative variable and one quantitative variable is best presented using a bar chart or a dot chart

Diagrams

digital text and data processing introduction to r

Documents

r slide

explicitness slide

directory slide

mycsv slide

text mining slide

external packages slide

capabilities of r

intuitive knowledge