basic text analysis tools - indiana university bloomingtoncatapult/resources/text-analysis.pdf ·...

34
Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating UAM Corpus Tool Automatically annotating Basic Text Analysis Tools Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34

Upload: duongngoc

Post on 21-Jul-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotatingBasic Text Analysis Tools

Markus Dickinson

Dept. of Linguistics, Indiana UniversityCatapult Workshop Series; February 1, 2013

1 / 34

Page 2: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Basic text analysis

Before any sophisticated analysis, we want ways to get a“sense” of text data

I online searchingI concordancingI word frequency countingI annotating

I recommend working with a scripting language like Perl orPython to get even more capabilities for these things

A good resource to find various tools: http://tiny.cc/corpora(see “Software, Tools, Freq Lists, etc.”)

2 / 34

Page 3: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Corpus linguistics

The material I’ll be presenting is adapted from my course,L615 (Corpus Linguistics)I Corpus linguistics is the study of how to gather &

analyze linguistic dataI Feel free to grab whatever materials you find useful at:

http://cl.indiana.edu/∼md7/13/615/

3 / 34

Page 4: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Online search tools

You’ll often find web interfaces to corpora, which allow forsearching

1. Pros: nice interfaces, useful for generating basicstatistics, etc. access to corpora you might nototherwise have access to

2. Cons: You’re limited by what the designer wants you tosearch for; you can’t run a tagger or parser on it; etc.

4 / 34

Page 5: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Some search tools

I Mark Davies’ corpus search interface:http://corpus.byu.edu/

I SketchEngine: http://www.sketchengine.co.ukI Intellitext: http://corpus.leeds.ac.uk

5 / 34

Page 6: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

ConcordancingAntConc

Concordancing is simply viewing words by their contexts,a.k.a. Keyword in Context (KWIC)

We’ll look specifically at AntConc, by Laurence Anthony(http://www.antlab.sci.waseda.ac.jp/software.html)I Installation is relatively straightforwardI The README files provide a lot of information

For this example, we’ll use Les Miserables by Victor HugoI Available from Project Gutenberg:

http://www.gutenberg.org/ebooks/135

6 / 34

Page 7: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

AntConcFile loading & concordancing

Load in a file (File → Open File(s))I Make sure this is plain text, i.e., not a Word document

or anything with extra mark-up

1. Enter search term

2. Determine window size

3. Start the search

4. Then, select sorting options & sort

Clicking on a word allows you to see the original contextI Advanced options: can use word fragments, multiple

search terms, regular expressions, context words

Note that you can export your search results to text files

7 / 34

Page 8: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

8 / 34

Page 9: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

9 / 34

Page 10: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

AntConcConcordance plots

Concordance plots allow one to see distribution at a glance

10 / 34

Page 11: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

AntConcClusters & N-grams

You can search for:I Clusters involving a particular wordI All “n-grams” of a particular sizeI Collocations involving a particular word

I Note: there are a lot of methods & tools for findingcollocations (e.g., http://www.collocations.de,http://www.d.umn.edu/∼tpederse/nsp.html)

11 / 34

Page 12: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

12 / 34

Page 13: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

13 / 34

Page 14: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

14 / 34

Page 15: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

AntConcWord lists

Word lists are easy to generateI This is an easy way to get word frequency counts

Note: you can set the preferences so as to use lemmas, ifyou have a file listing lemmas

15 / 34

Page 16: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

16 / 34

Page 17: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

AntConcKeyword Lists

You can also generate Keyword Lists, which show unusually(in)frequent words as compared to some reference corpusI To do this, you have to load your reference corpus (or

word list) under (Tool) Preferences→ Keyword (List)

The next two slides show:1. The keywords of Leaves of Grass, as compared to a

book called CampingI unusually frequent, given the reference corpus

2. The negative keywords of Leaves of Grass, ascompared to a book called Camping

I unusually infrequent, given the reference corpus

17 / 34

Page 18: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

18 / 34

Page 19: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Screenshot

19 / 34

Page 20: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Word Lists & Frequency

There are other tools on the David Lee site to calculate wordfrequencyI There are also word lists separate from corpus data,

should you need thoseI e.g., most frequent academic words

Calculating word frequencies on your own is easy withsomething like Perl

20 / 34

Page 21: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Annotating data

Corpus annotation is information about the corpus includedin the corpus itself

I Meta-data helps recover the original context & allowsfor a wider range of questions to be explored

I Linguistic annotation:I helps train NLP toolsI helps find & categorize examples

I what is the plural form of fish?I are there subjectless sentences in German?

(Yes, e.g., Mir ist kalt. (‘To me is cold.’))

Annotation makes it possible to find phenomena thatwould otherwise disappear in masses of data

21 / 34

Page 22: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Benefits of annotation

Corpus annotation adds much value to a corpusI Extracting information is easier:

I Can easily isolate left in its adjective usesI Can find information on a language you don’t speak

I Corpus is more reusableI Insights are more accessible to others

I Corpus is more multifunctionalI Annotation provides a clear record of analysis, open to

future scrutinyI Decisions are more reproducible

22 / 34

Page 23: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Annotation tools

A lot of annotation tools are available, though many aregeared towards people with computational skillI See http://tiny.cc/corpora→ Software, Tools, . . .

I Scroll to “Tools & Resources for Transcribing,Annotating or Analysing texts”

I Many tools for working with corpus are XML-based;some are not

I Or check out the Linguistic Annotation Wiki:http://annotation.exmaralda.org/index.php/Linguistic Annotation

We’ll focus on the UAM Corpus Tool, which is a little moreuser-friendly than many

23 / 34

Page 24: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus Tool

UAM Corpus Tool (http://www.wagsoft.com/CorpusTool/) isdesigned to be usable by non-computational linguistsI See write-up: http:

//www.aclweb.org/anthology-new/P/P08/P08-4004.pdf

The manual that comes with the download is very thorough& easy to useI We will walk through using some of it

Note on annotation tools:I They tend to work best by splitting your copus into

many smaller sub-corpora/small filesI Especially if the tool also automatically processes part

of the corpus

24 / 34

Page 25: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus ToolMain Steps

1. Start a new project2. Add (an) annotation layer(s)

I You can use some pre-built annotation schemes ordesign your own

3. Add filesI Incorporated files are ones you have already started

annotating

4. Annotate

25 / 34

Page 26: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus ToolDefining the Annotation Scheme

If you are creating a new scheme, you are given theopportunity to specify this when creating an annotation layerI system = name of the type of annotation (cf. attribute)I feature = alternatives for each system (cf. value)

Or you can re-use a scheme

26 / 34

Page 27: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Penn Treebank (PTB) Part-of-Speech (POS)Scheme (extract 1)

<SYSTEM>

<NAME>NOUN-TYPE</NAME>

<EC>noun</EC>

<FEATURES>

<FEATURE>

<NAME>common</NAME>

<STATE>active</STATE>

</FEATURE>

<FEATURE>

<NAME>proper</NAME>

<STATE>active</STATE>

</FEATURE>

</FEATURES>

</SYSTEM>

27 / 34

Page 28: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

PTB Scheme (extract 2 )

<SYSTEM>

<NAME>COMMON-TYPE</NAME>

<EC>common</EC>

<FEATURES>

<FEATURE>

<NAME>nn</NAME>

<STATE>active</STATE>

<GLOSS>0/7/03</GLOSS>

</FEATURE>

<FEATURE>

<NAME>nns</NAME>

<STATE>active</STATE>

</FEATURE>

...

</FEATURES>

</SYSTEM>

28 / 34

Page 29: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus ToolAnnotating the Corpus

I Annotate documentI Annotate segments (e.g., words)

I Segment definitions are flexibleI Note: I had freezing issues when I tried to have

automatic segmentation for the annotation layerI Automatic analysis with Stanford parser (English only)

I There are various options to automatically annotate thedata, which we won’t go over

29 / 34

Page 30: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

POS Annotation

30 / 34

Page 31: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus ToolSearching the Corpus

I Basic searchingI Searching across layersI Fancier string searching (for English)

I Wilcard token (*): ca* matches cat, caffeine, etc.I Classes: ca*@noun matches nouns starting with ca

I see the appendix for features which can be searchedI note: this can take some time

I % matches inflections: e.g., be% matches be, is, ...

Note: to search for classes, do the search on adocument-level annotation layerI Segment-level annotation only searches across

segments you have defined

Some of the work you can do in AntConc you can do here

31 / 34

Page 32: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Searching

32 / 34

Page 33: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

UAM Corpus ToolCorpus statistics

33 / 34

Page 34: Basic Text Analysis Tools - Indiana University Bloomingtoncatapult/resources/text-analysis.pdf · Basic Text Analysis Tools Online Search Concordancing AntConc Word Frequency Annotating

Basic Text AnalysisTools

Online Search

ConcordancingAntConc

Word Frequency

AnnotatingUAM Corpus Tool

Automatically annotating

Automatically annotating

You might want to explore using some automatic tools to beable to refer to linguistic classes (e.g., nouns)I A good collection of NLP tools are the Stanford ones:

http://nlp.stanford.edu/software/index.shtmlI Some good slides connecting digital humanities with

NLP: http://nlp.stanford.edu/∼manning/courses/DigitalHumanities/DH2011-Manning.pdf

I There will be a tutorial on the Natural Language Toolkit(NLTK) on March 8: http://www.nltk.org

34 / 34