getting started with python and r for text analysis

21
GETTING STARTED WITH PYTHON AND R FOR TEXT ANALYSIS JASON T. KILEY OKLAHOMA STATE UNIVERSITY

Upload: terry-college-of-business

Post on 08-Dec-2015

76 views

Category:

Documents


2 download

DESCRIPTION

Jason T. KileyOklahoma State University

TRANSCRIPT

Page 1: Getting Started With Python and R for Text Analysis

GETTING STARTED WITH PYTHON AND R FOR TEXT ANALYSIS

JASON T. KILEYOKLAHOMA STATE UNIVERSITY

Page 2: Getting Started With Python and R for Text Analysis

Link to Reference Handout

http://bit.ly/1eXyIFN

Page 3: Getting Started With Python and R for Text Analysis

Overview

Goal: familiarization with tools and specific resources for learning to use Python (and R) for text analysis.

How to get started with Python

Gathering and processing data

Using data for analyses

Page 4: Getting Started With Python and R for Text Analysis

Getting started: Good news!

You know more about programming basics that you may realize.

Most statistical software eventually requires you to learn about different ways of formatting data (e.g. strings and dates).

Commands often require that you specify options in particular ways and provide particular kinds of data, much like functions in Python and R.

Page 5: Getting Started With Python and R for Text Analysis

Getting started: software

Download Python and R.

Choose a good text editor that is designed for coding. I prefer Atom.

Download add-on software for your text editor, as needed.

Install the analysis packages that you would like to try out. Hint: start with TextBlob.

Page 6: Getting Started With Python and R for Text Analysis

PYTHON: HELLO

Page 7: Getting Started With Python and R for Text Analysis

TEXTEDIT (BAD)

Page 8: Getting Started With Python and R for Text Analysis

ATOM (GOOD)

Page 9: Getting Started With Python and R for Text Analysis

Getting started: What to learn

Start with the basics: data types, operators, and control structures. These are things that you statistical software (partially) hides from you.

Learn how to read and write files and work with filenames and paths.

Spend less time on classes and inheritance.

Once you can comfortably manipulate your text data into desired forms (e.g. splitting files, extracting titles and body text, combining texts and metadata in to CSVs), move to analysis tools.

Page 10: Getting Started With Python and R for Text Analysis

LEXISNEXIS: MULTIPLE TEXTS PER FILE

Page 11: Getting Started With Python and R for Text Analysis

Data: Collecting and reading

Gather data in forms that are easiest to work with.

Process new (or existing) data into usable formats.

Extract the information that we want to analyze.

Use data for analyses.

Page 12: Getting Started With Python and R for Text Analysis

Data: gathering

In general, plain text is best, and closer is better.

Some other formats (e.g. CSV) are plain text files that adhere to a further specification.

LexisNexis: choose plain text (*.txt).

Page 13: Getting Started With Python and R for Text Analysis

VIEWING A CSV AS A SPREADSHEET

Page 14: Getting Started With Python and R for Text Analysis

VIEWING A CSV AS TEXT

Page 15: Getting Started With Python and R for Text Analysis

Data: gathering other types

“But, my data is .rtf, .doc, HTML, morse code. . . !”

You will need some additional processing steps, but you should be fine.

Factiva: gather .rtf files and process them into plain text.

HTML: strip tags or use Beautiful Soup to parse pages.

Page 16: Getting Started With Python and R for Text Analysis

Data: extracting information

We often want something less than the full text that we gathered.

Examples

Press releases and news stories: analyze headlines separately or with a weight.

Web pages: analyze the body content or comments.

Some libraries have their own tools, but you may have to extract data yourself using regular expressions.

Page 17: Getting Started With Python and R for Text Analysis

Data: workflow

Gather raw data.

Write code that extracts the data you want from one text. This is often the most challenging part.

Make the single-text code into a function.

Write the code that opens files, processes each one using your function, and writes out the data that you want to analyze.

Page 18: Getting Started With Python and R for Text Analysis

EXAMPLE: FUNCTION

Page 19: Getting Started With Python and R for Text Analysis

Analyses

Generally, you will use collections of strings (perhaps with metadata) for text analysis.

You may also process texts into CSVs that you can use for fast human coding of either a variable of interest or as a training set for machine learning.

As Laura showed us, there are many techniques and tools available, so read up on the particular library that you intend to use.

Page 20: Getting Started With Python and R for Text Analysis

COMMENTS AND QUESTIONS

Page 21: Getting Started With Python and R for Text Analysis

Link to Reference Handout

http://bit.ly/1eXyIFN