getting started with python and r for text analysis

GETTING STARTED WITH PYTHON AND R FOR TEXT ANALYSIS

JASON T. KILEYOKLAHOMA STATE UNIVERSITY

Link to Reference Handout

http://bit.ly/1eXyIFN


Overview

Goal: familiarization with tools and specific resources for learning to use Python (and R) for text analysis.

How to get started with Python

Gathering and processing data

Using data for analyses

Getting started: Good news!

You know more about programming basics that you may realize.

Most statistical software eventually requires you to learn about different ways of formatting data (e.g. strings and dates).

Commands often require that you specify options in particular ways and provide particular kinds of data, much like functions in Python and R.

Getting started: software

Download Python and R.

Choose a good text editor that is designed for coding. I prefer Atom.

Download add-on software for your text editor, as needed.

Install the analysis packages that you would like to try out. Hint: start with TextBlob.

PYTHON: HELLO

TEXTEDIT (BAD)

ATOM (GOOD)

Getting started: What to learn

Start with the basics: data types, operators, and control structures. These are things that you statistical software (partially) hides from you.

Learn how to read and write files and work with filenames and paths.

Spend less time on classes and inheritance.

Once you can comfortably manipulate your text data into desired forms (e.g. splitting files, extracting titles and body text, combining texts and metadata in to CSVs), move to analysis tools.

LEXISNEXIS: MULTIPLE TEXTS PER FILE

Data: Collecting and reading

Gather data in forms that are easiest to work with.

Process new (or existing) data into usable formats.

Extract the information that we want to analyze.

Use data for analyses.

Data: gathering

In general, plain text is best, and closer is better.

Some other formats (e.g. CSV) are plain text files that adhere to a further specification.

LexisNexis: choose plain text (*.txt).

VIEWING A CSV AS A SPREADSHEET

VIEWING A CSV AS TEXT

Data: gathering other types

“But, my data is .rtf, .doc, HTML, morse code. . . !”

You will need some additional processing steps, but you should be fine.

Factiva: gather .rtf files and process them into plain text.

HTML: strip tags or use Beautiful Soup to parse pages.

Data: extracting information

We often want something less than the full text that we gathered.

Examples

Press releases and news stories: analyze headlines separately or with a weight.

Web pages: analyze the body content or comments.

Some libraries have their own tools, but you may have to extract data yourself using regular expressions.

Data: workflow

Gather raw data.

Write code that extracts the data you want from one text. This is often the most challenging part.

Make the single-text code into a function.

Write the code that opens files, processes each one using your function, and writes out the data that you want to analyze.

EXAMPLE: FUNCTION

Analyses

Generally, you will use collections of strings (perhaps with metadata) for text analysis.

You may also process texts into CSVs that you can use for fast human coding of either a variable of interest or as a training set for machine learning.

As Laura showed us, there are many techniques and tools available, so read up on the particular library that you intend to use.

COMMENTS AND QUESTIONS

Link to Reference Handout



getting started with python and r for text analysis

Documents

text data

use data

data types

readinggather data

existing data

body text

text analysisjason

plain text files