computer corpora and what they can tell us about how people use language

43
Computer Corpora and What They Can Tell Us about How People Use Language 情情情情情情 11 July 2011

Upload: lisandra-chambers

Post on 02-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Computer Corpora and What They Can Tell Us about How People Use Language. 情報科学入門 11 July 2011. “Corpus”?. Latin “corpus” = body . Latin “corpora” = bodies . English “corpus” = collection of texts English “corpora” = collections of texts Japanese “ コーパス ” = 文書などの集大成. - PowerPoint PPT Presentation

TRANSCRIPT

Computer Corpora and What They Can Tell Us about How People Use Language

情報科学入門11 July 2011

“Corpus”?

• Latin “corpus” = body.• Latin “corpora” = bodies.• English “corpus” = collection of texts• English “corpora” = collections of

texts• Japanese “コーパス” = 文書などの集大成

What is a computer corpus?

A corpus is a collection of texts stored on a computer.

Books, magazines, letters, Internet pages, e-mails, or parts of these.

Or transcriptions of speeches, phone calls, or radio programs.

Often stored as a single file in simple text format.

How big is a computer corpus?

• It can be very big or very small.

• The biggest (e.g. the British National Corpus and the Corpus of Contemporary American English) have many millions of words.

• A small corpus might have only a few hundred words.

Benefits of computer corpora

• In what way do you think computer corpora might be useful?– Any ideas?

What are computer corpora for?

• We can use corpora to study language. – What are the most common words?

– What words are used together?

– What words of a particular type are used together (e.g., under + NOUN)?

– If we compare two corpora (e.g. e-mail and textbooks), is a word more common in one?

– How do people use words in sentences?

Computer corpora and dictionaries

• All major English dictionaries are now based on computer corpora.– How common is a word?– How many different meanings does it

have?– What are some examples of its use?– Is it used in a good or bad sense?– What grammatical patterns is it used with?– What other words is it used with?

Word frequencies

• What do you think are the most common words in English?

• Make a list of about five words.

The most common English words

(Oxford English Corpus)1. The2. Be3. To4. Of5. And6. A7. In8. That9. Have10. I

In various situations

Concordances

• One of the most common ways to study computer corpora is to use a concordance.

• A concordance finds all the instances of a word or phrase in a corpus.

• It presents a list of the instances, often with the search word in the middle of the screen.

Example of a concordance list

What does this tell us?

• In the words before forget, there are

– many examples of negative words:

• not, won’t, don’t, couldn’t, shouldn’t, never, nobody

– many contractions:

• won’t, don’t, you’ll, couldn’t, shouldn’t, you’d

– several examples of to

What does this tell us?

• In the words after forget, there are– several examples of to– several examples of –ing– several examples of what and that– several examples of the– several examples of he, she, you, it, and

we• Notice also that forget usually comes

in the middle of a sentence, not at the beginning or end.

Open a concordance on your PC

• Go to http://corpus.byu.edu/coca/.

• This site allows you to access the Corpus of Contemporary American English (COCA).

• The largest free corpus in the world:

– 425 million words, 5 types of text• Spoken• Fiction• Magazine• Newspaper• Academic

Display

• At the top left, you will see under Display:– List: Shows a list of words in the right column– Chart: Shows two charts in the right column

• Types of text (spoken, fiction, magazine, etc.)• Time (1990-1994, 1995-1999, etc.)

– KWIC (Key Words in Context)Shows nouns, verbs, etc. around the search string

– Compare: Shows results for two words

Search String

• Under Search String, you will see:– Word: Type a word (e.g. head).– Collocates: Type a word used near head.

• The two boxes next to Collocates show– Maximum number of words before head– Maximum number of words after head

– POS (Part Of Speech): Select a part of speech (e.g., noun, verb, etc.) used near head.

– Random: This chooses a random search string.– Search: Click this to begin your search– Reset: Clear the left column

Sections• Show: Check this box to show charts for

– Type of text (Spoken, Magazine, etc.)– Time

• 1 : Choose a type of text for the search string– Ignore (= all types)– Spoken– Magazine– Newspaper– Academic

• 2: If you are comparing two search strings, choose

the type of text for the second string.

Search syntax

– To find two words:• To find “good luck”, type “good luck” in Word(s).

– To find the neighboring word:• To find what word comes after “dog”, type “dog *”.• To find what word comes before “dog”, type “* dog”.

– To find two words with 1–4 words between:• Word(s): dog • Collocate: bark”

“dog bark”, “dog will bark”, “dog will often bark”, “dog will not always bark”, “dog will in no situation bark”.

5 0

Query syntax (2)

– To find different forms of a word:• Word(s): [blow] away

“blow away”, “blows away”, “blew away”, “blowing away”, “blown away”

– To find all the words that begin the same way:• Word(s): comp* “compare”, “compute”,

“computer”, “compiler”, “comply”, etc.

– To find all of a set of words:• Word(s): cut|cuts|cutting “cut”, “cuts”,

“cutting”.

Try the COCA concordance

• In the top right corner, type – Your e-mail address.

– Your password.

• In the Word(s) box, type “played”.

• Click on “Search”

• In the top right column, click on “PLAYED”.

• What topics are most of the examples about?

COCA concordance for “played”

Findings for “played”• Acted

– Played a role, played a key role, played in the movie• Sports

– Played football, played 158 games, played his last game

• Other games– Played cards

• Music– The Paris orchestra played, bands played, pianist

played• 遊んだ

– Played among easels

Word frequency

• At the top right, under TOT, you see “52589”.– The corpus contains 52589 examples of played.

• Under Display, select CHART.• Click the Search button.• The right column shows the frequency of played

in different types of text.– In which type is it most common? Why?

• You can also see the frequency for 5-year periods.– In which period was it most common?

Try a two-word search

• Click the Reset button.

• In the Word(s) box, type “* friend of *”.

• Click on “Search”.

• Notice the words before and after “friend of”. What did you find?

Collocations for “friend+of”

Findings for “friend of”

• Before– “a”– “good”– “close”– “old”

• After– “mine”– “the”– “his”– “hers”– “ours”– “theirs”

Two words with an optional gap

• Click the Reset button.

• In the Word(s) box, type “a”.

• Click on Concordance.

• In the Concordance box, type “teacher” .

• Click “Search”.

• In the top right column, click on “TEACHER”

• Notice the words between “a” and “teacher”. What did you find?

5 0

Concordance for “a . . . teacher”

Findings for “a . . . teacher” • “a former teacher”• “a retired teacher”• “a 10-year teacher”• “a young teacher”• “a physical education teacher”• “a full-time coach and teacher”• “a social studies teacher”• “a job as an English teacher”, etc.

Word + Part of Speech (POS)

noun.ALL: all common nouns (名詞 )

verb.ALL: all verbs (動詞 )adj.ALL: all adjectives (形容詞 )adv.ALL: all adverbs (副詞 )neg.ALL all instances of “not”, “n’t”art.ALL all articles (“a”, “an”, “the”)det.ALL all determiners (“this”, “these”, etc.)pron.ALL all pronouns (代名詞 )poss.ALL all possessive pronouns

(“my”, “your”, etc.)prep.ALL all prepositions (前置詞 )conj.ALL all conjunctions (接続詞 )

noun.ALL+ all common and proper nouns (名詞 )noun.SG: singular noun ( 単数の名詞 )noun.PL: plural noun (複数の名詞 )noun.CMN common noun (普通名詞 )noun.+PROP proper nouns ( 固有名詞 )verb.BASE base form of verb (“know”, “think”, etc.)verb.INF infinitive form of verb (“be”, “have”, etc.)verb.MODAL modal form of verb (“may”, “might”, etc.)verb.3SG 3rd person singular verb (“has”, “goes”, etc.)verb.ED past tense verb (“went”, “played”, etc.)verb.ING “ing” form of verb (“going”, “playing”, etc.)

etc.

PUNC all punctuation marks (. , ; : ! ? - etc.)

You can also search for a word with a POS.–E.g., made me + VERB ( 動詞 )

Click on the POS button in the left column.

Search for a word + POS

• In the left column, click “Reset”

• In the Word(s) box, type “made me”.

• Click “POS”

• In the POS box, type VERB(ALL)

• Click “Search”.

• Notice the words after “me”. What did you find?

Concordance for “made me VERB”

Findings for “made me”

• All the words after “me” were bare infinitives.

• The most common verb was “want” (299).

• There were many “thinking” verbs, e.g., “realize”, “see”, “believe”, “think”, “understand”.

• There were also some “action” verbs, e.g., “do”, “look”, “take”, “get”.

Inflected forms

• Click “Reset”

• In the Word(s) box, type “I wish I [be]”.

• Click “Search”.

• Notice the word after “I wish I”. What did you find?

Concordance for “I wish I [be]”

Findings for “I wish I”

• “I wish I was” (202 cases)• “I wish I were” (197 cases)• Grammatically, “I wish I were” is

correct.• Native English speakers do not

always use English “correctly”.

Pre-lecture quizWhat answers did you get?

• happy ______• What a _______• I haven’t a _______• as good as ______• ______ the winter• I’ve _____ arrived• Don’t be a ______• a ______ breakfast• He didn’t take any _______• She ______ her head

Answers to the pre-lecture quiz

• happy [to, with, and, about, birthday]• What a [great, lot, good, wonderful, difference]• I haven’t a [clue, thing, single, choice]• as good as[the, it, they, any, a, I, you]• [in, during, for, of, through] the winter• I’ve [just, now, finally, always, already] arrived• Don’t be a [fool, stranger, hero, jerk, baby]• A [big, good, hearty, late, quick] breakfast• He didn’t take any [questions, of, shit, precautions]• She [shook, shakes, turned, tilted] her head

Summary

• We can learn a lot about language from computer corpora.

• In particular, concordances can show us how people really use language in practice.

• Concordances are useful for students of English – To check how vocabulary is used.– To check grammatical constructions.

Some other online concordances

• Michigan Corpus of Academic Spoken English (MICASE)– http://quod.lib.umich.edu/m/micase/

• Web Concordancer (English)– http://www.edict.com.hk/concordance/WWWConcap

pE.htm

• Corpus Concordance English– http://www.lextutor.ca/concordancers/concord_

e.html

Post-lecture quiz

• Please complete the quiz paper I gave you today.

• Submit it to me by tomorrow evening.

• If you don’t submit it, you will not get any points for attending this lecture.

That’s it, folks!