using corpus tools in discourse analysis discourse and pragmatics week 12
Post on 15-Dec-2015
254 Views
Preview:
TRANSCRIPT
What is a corpus? An collection of a large number of texts of a
particular type in digital format which can be easily searched and manipulated with computer programs
What is corpus linguistics ? The analaysis of collections of texts (corpora)
with computer tools in order to detect grammatical, lexical or discourse level patterns, often with the aim of comparing those patterns with those found in other collections of texts.
Examples of corpus assisted discourse analysis Flowerdew (1997, 2002)
Anlaysis of the speeches of Gov. Chris Patten and CE Tung Chee Hwa
common themes: free market economy, freedom of the individual, rule of law
Divergent themes: democracy, stability and harmony
Rey (2001) Startrek characters from 1966 to 1993 Female language has shifted from being more
relational to more informational Male language has shifted from being more
informational to more relational
Advantages of using corpora Easily detecting grammatical and lexical
patterns in a large number of texts Reducing researcher bias Efficiently detecting differences among
varieties, registers, genres, and Discourses
Corpus based (deductive) vs. Corpus driven (inductive) analysis
Disadvantages of using corpora Separation of discourse from its social context Corpus data usually confined to text (cannot
account for images, non-verbal behavior and other aspects of multimodal discourse)
Frequency does not equal importance (sometimes very important messages are implicit or ‘taken for granted’ rather than explicit)
‘People don’t say what they mean and people don’t mean what they say’
Words have multiple meanings and word meanings change over time and according to the context in which they are used
Tools for corpus analysis Online corpora and concordancers
Collins Bank of English British National Corpus Corpus of Contemporary American English International Corpus of English
General vs. Specialized Corpora
Software tools AntConc ConcApp WordSmith Tools
Preparing corpora Collecting data (Internet? Scanning files?) Txt files Separate files for different texts ‘Cleaning’ files ‘Tagging’
Procedures in corpus analysis Type token ratio Dispersion plots Frequency lists Concordance data Collocation calculations Keyword calculations
Type Token Ratio Low indicates narrow range of subjects, lack of
variety or frequent repetition High indicates wide range of subjects, great
variation, less frequent repetition BNC Written = 45.53 BNC Spoken = 32.96 Baker’s Holiday Pamphlets = 40.03 100 Song Corpus = 9.07 Gaga Corpus = 11.4
Frequency Function words (articles, prepositions,
conjunctions, pronouns, etc.) Useful in answering questions about style, register Pronouns can be particularly important
Content words (nouns, verbs, adjectives, adverbs) Useful in answering questions about topics/
Discourses
Top 5 function words 100 Song Corpus
I you the and it
Gaga Corpus
I you the oh me
1 = 4.4%me = 2.03%
I = 5.09%me = 1.3%
Murphey 1992: The word count revealed that the total referents in first person (I, me, my, mine, etc.) amounted to 10% of the total words
Top 5 content words 100 Song Corpus
like no can baby know (love) (0.42%)
Gaga Corpus
love (0.98%) baby can want know
Collocation ‘Co-location’ The frequency with which words appear close
to other words ‘You shall know a lot about a word from the
company it keeps.’ (Firth 1957) Span (xL, xR)
Top 5 collocates for ‘I’ 100 Song Corpus
‘m and can Know ‘ll
Gaga Corpus
‘m want ‘ll don’t can
Span: 1L, 1R
Top 5 collocates of ‘love 100 Song Corpus
I you my me the
Gaga Corpus
I fu want ‘t revenge
Span 5l, 5R
Keywords The frequency of words in a corpus in relation
to another corpus The statistical significance of a keyword's
frequency in a given corpus, relative to a reference corpus.
Keywords: semantic domains lover* romance* love* loves
fame* fancy* ribbons* glitter fashion vanity rich presents famous
retro* bang* shake* dirty* grease* bad* teeth monster filthy
oh* eh*
top related