getting started with python and r for text analysis
DESCRIPTION
Jason T. KileyOklahoma State UniversityTRANSCRIPT
GETTING STARTED WITH PYTHON AND R FOR TEXT ANALYSIS
JASON T. KILEYOKLAHOMA STATE UNIVERSITY
Overview
Goal: familiarization with tools and specific resources for learning to use Python (and R) for text analysis.
How to get started with Python
Gathering and processing data
Using data for analyses
Getting started: Good news!
You know more about programming basics that you may realize.
Most statistical software eventually requires you to learn about different ways of formatting data (e.g. strings and dates).
Commands often require that you specify options in particular ways and provide particular kinds of data, much like functions in Python and R.
Getting started: software
Download Python and R.
Choose a good text editor that is designed for coding. I prefer Atom.
Download add-on software for your text editor, as needed.
Install the analysis packages that you would like to try out. Hint: start with TextBlob.
PYTHON: HELLO
TEXTEDIT (BAD)
ATOM (GOOD)
Getting started: What to learn
Start with the basics: data types, operators, and control structures. These are things that you statistical software (partially) hides from you.
Learn how to read and write files and work with filenames and paths.
Spend less time on classes and inheritance.
Once you can comfortably manipulate your text data into desired forms (e.g. splitting files, extracting titles and body text, combining texts and metadata in to CSVs), move to analysis tools.
LEXISNEXIS: MULTIPLE TEXTS PER FILE
Data: Collecting and reading
Gather data in forms that are easiest to work with.
Process new (or existing) data into usable formats.
Extract the information that we want to analyze.
Use data for analyses.
Data: gathering
In general, plain text is best, and closer is better.
Some other formats (e.g. CSV) are plain text files that adhere to a further specification.
LexisNexis: choose plain text (*.txt).
VIEWING A CSV AS A SPREADSHEET
VIEWING A CSV AS TEXT
Data: gathering other types
“But, my data is .rtf, .doc, HTML, morse code. . . !”
You will need some additional processing steps, but you should be fine.
Factiva: gather .rtf files and process them into plain text.
HTML: strip tags or use Beautiful Soup to parse pages.
Data: extracting information
We often want something less than the full text that we gathered.
Examples
Press releases and news stories: analyze headlines separately or with a weight.
Web pages: analyze the body content or comments.
Some libraries have their own tools, but you may have to extract data yourself using regular expressions.
Data: workflow
Gather raw data.
Write code that extracts the data you want from one text. This is often the most challenging part.
Make the single-text code into a function.
Write the code that opens files, processes each one using your function, and writes out the data that you want to analyze.
EXAMPLE: FUNCTION
Analyses
Generally, you will use collections of strings (perhaps with metadata) for text analysis.
You may also process texts into CSVs that you can use for fast human coding of either a variable of interest or as a training set for machine learning.
As Laura showed us, there are many techniques and tools available, so read up on the particular library that you intend to use.
COMMENTS AND QUESTIONS