quantitative corpus linguistics and fieldwork data, coedl seminar, october 2015
TRANSCRIPT
Quantitative Corpus
Linguistics and
Fieldwork DataDanielle Barth
ARC Centre of Excellence for the Dynamics of Language
Australian National University
October 30, 2015
Seminar Outline
What is a corpus and what are quantitative corpus techniques?
Why do we want to use quantitative corpus techniques for data exploration?
When can we start using quantitative corpus techniques for data exploration?
What are some quantitative corpus techniques for data exploration (i.e., data
mining)?
Case Studies with Matukar Panau data:
Careful and Casual Speaking Styles in Matukar Panau
Directional Construction Choice in Matukar Panau
Conclusion
When is data a corpus?
Structure and Size
Machine readable
Annotation and tagging (and clean-text)
One million words (Fang, 1993; Leech, 1991)
British National Corpus: 100,000,000 words
Corpus of Contemporary American English: 450,000,000 words
Corpus of Global Web-based English: 1,900,000,000 words
Quantitative Corpus Linguistics
Techniques and Theory
(Biber and Conrad, 2001; McEnery and Wilson, 2001): corpora show and
counteract the unreliability about our intuitions of language use
Frequency & Probability
Corpora show distribution of variation
Instead of asking what are the possibilities, we ask what possibilities are most
likely given the current data
Examples of research questions with
large corpora
De Cuypere (2015) examines ACC-DAT v. DAT-ACC word order in Old English
using the York-Toronto-Helsinki Parsed Corpus of Old English Prose (Taylor et
al., 2003) (1.5 million words), with 2,000 tokens
Shih and Zuraw (2014) examine adjective-noun order variation in 4.8 million
words extracted from the web and automatically tagged for POS, with
150,000 tokens
Hilpert (2015) examines the increase in token and type frequency of noun-
participle compounding in English using the Corpus of Historical American
English (Davies, 2010) (400 million words), with 31,700 tokens
Barth (2015) examines function word shortening and contraction in English
using the Buckeye Corpus (Pitt et al., 2007), (307,000 words, time-stamped to
the phoneme level)
Why do we care?
Question of where variation fits in our first principles
Is there a perfect example of a language (I-Language)?
If so, then questions of variation are secondary
Is variation inherent in language?
Is language a probabilistic system?
If so, then being aware of variation is part of understanding a language from day
one
If so, then seeing the patterns of variation helps us understand how a language
works and how speakers work
If so, then the likelihood of variants and their conditions is part of Language
When can we start?
As soon as we have enough data given our research question
As soon as we have an appropriate question given our data
Depends highly on data types and how we think about our data
Examples of techniques with smaller
“corpora”
Meyerhoff and Walker (2013) examine variation in Bequia (St Vincent and the
Grenadines) English existentials in 30 interviews
“The Bequia corpus is too small for us to undertake a quantitative analysis of the
semantic distinctions they make within this category.” (Meyerhoff and Walker,
2013: 425)
Meyerhoff and Walker (2012) examine verbal negation words, copulas
(including zero copulas) and past/non-past distribution of other grammatical
words using 62 interviews
Meyerhoff (2015) examines distribution of a subject prefix that inflects for
(ir)realis.
Barth (2015) examines function word reduction and contraction in child and
caregiver English using the Redford Corpus (78,000 words)
Data Mining
Looking for patterns, associations, groups, sequences in data
Discovery/Exploration: atheoretical, kitchen-sink
Evaluation: search for theoretically motivated patterns
Artificial Neural Networks: Non-linear predictive models, machine learning
Genetic Algorithms: Survival, mutation, machine learning
Nearest Neighbour: classifies records/observations based on base-dataset
Rule induction: if-then rules extracted from the data
Decision trees/forests: classification of data in a tree-shaped structure
Matukar Panau
The Language and the Corpus
Spoken in two villages in Madang Province, Papua New Guinea:
Matukar (population 479)
Surumarang (population 219)
Also called Matukar, Matugar, Matukar Panau
Under 30 years old ~ Native, dominant
Tok Pisin Speakers
541
30-50 ~ First language Panau, but dominant
Tok Pisin speakers
131
Over 50 ~ Native, Proficient Panau Speakers 26
Matukar Panau
The People and the Corpus
71 texts (plus elicitation data) from 33 speakers
20,000 words
Texts primarily about family stories, traditional way of life, traditional
aspects of culture, some songs, some narrations of videos of typical village
activities like gardening and cooking, some traditional/mythical stories
Members of Saky (St. Augustine Katolik Yut) community organization:
Amos Barui, Alfred Barui, Michael Balias, Justin Willie, Zebedee Kreno
Kadagoi Rawad Forepiso
Matukar Panau
Clear and Casual Speech
Case Study for Variation in Lexical Items.
Several words (mostly adverbs) have a full form and a more reduced form(s)
Full forms are associated with a more careful speech style and reduced forms are associated with a more casual speech style
Women more likely to use formal terms of address than men in Japanese (Ogawa and Shibamoto Smith, 1997)
Women tend to use more formal phonological variants than men in Norwich English (Trudgill, 1974)
Strong association between women and standard speech in Western societies (Nordberg, 1971; Romaine, 2003; Trudgill, 1983) and women lead change from above, while men lead change from below (Gal, 1979; Labov, 1990)
BUT: In non-Western societies, “women are further away from the prestige norms of society” (Romaine 2003:109, cf. Nordberg and Sundgren, 1998; Nichols, 1983; Romaine, 1982) due to lack of education, less active outside the home (but cf. Keenan, 1974 for Malagasy)
Matukar Panau
Clear and Casual Speech
Field has shifted to talk of constructing identities (Eckert, 1989; Kendall, 2003; Ochs, 1992)
Zhang (2008) transnational business mangers/yuppies in Beijing use a full tone more common in Hong Kong and Taiwan, avoid rhoticization associated with male Beijing urban professionals (women especially avoid the “smooth Beijing operator” persona)
Podesva (2007) finds that a particular (gay male) speaker uses longer /t/ releases when constructing a “diva” persona at a BBQ than at work, the same hyper-articulated /t/ release variable can index carefulness, politeness, education (Eckert, 2012)
Creaky voice has moved from being associated with masculinity (Dilley et al., 1996; Henton and Bladon, 1998; Pittam, 1987) to being associated with nonaggressive, educated, urban oriented females in the US (Yuasa, 2010)
Women are more status-bound than men, cannot accumulate wealth or power with impunity, they rely on accumulating social capital, however that might be manifested in a particular community (Eckert, 1989) drive to assert membership (partially through linguistic variables) in a community
Matukar Panau
Lexical Access
Bilinguals may code switch because of lack of word or concept in one
language or because a word is more readily available in one language
(Grosjean, 1982)
Bilinguals may code switch to emphasize identity or group membership, to
draw attention to particular words, to engage a particular addressee, or to
comment on ongoing discourse (Appel and Muysken, 1987; Romaine, 1995).
Therefore, code switching by bilinguals can be used as a part of identity
construction or be used as a means of solving problems of lexical access.
Matukar Panau
Lexical Access
Tok Pisin words instead of Matukar Panau words (examples to follow)
Use of milok as place holder word
Milok ‘something’
Used nominally:
Used verbally: dop bor yukaup yenaba, manig milokap yenaba
ha-di kalago milo-k bakbak katalu-n-ama-n-da di-ngale-mbawai
CL-3pl basket something-UNPOSS insect egg-3s-APS-3s-COM 3pl-take-DESID
‘They wanted to take their basket with whatsit, with rice’ Kadagoi Rawad Forepiso – Kudas Custom 2 20110802:2
dop bor yuka-p y-en-aba, manig milo-k-ap y-en-aba
CONJ:DEP pig hang.up-IRR.DEP 3sg-lay-FUT like.this something-UNPOSS-IRR.DEP 3sg-lay-FUT
‘And the pig will be hung up and whatsit stay like this’ Paul Sarr Tagog – Pig: Trap, Net, Dog 20130422:30
Milok & Tok Pisin use
Paul Sarr Tagog, Willy Patal Kumung, Barry Kuyau Barui
Watch for:
milokaba, milok, milokap
milok, organisation, clan,
makim
car, taun, haussumuk,
suga, rais, trausis, siot
But Tok Pisin can also be used
stylistically
OK ngahau mam main clan leader Bantibun maror
OK 1sg.POSS father PROX clan leader clan.name clan.leader
‘OK, my father was the clan leader, the Bantibun clan leader.’ – Tomas Taleu Kreno – Life Story
20100331: 5
Matukar as ples paiin
Matukar origin place woman
‘(I am a) ur-Matukar woman’ – Sel Pain Wadom – Life Story 20100405: 10
Careful v. Casual Words
in Matukar Panau
Careful Casual Gloss
mainangan mainan ‘this one PROX’
manaiyami mainami ‘that’s all’
alohage alo ‘after
wagamami wagami ~ gegemi ‘before’
gaumomoni gauni ~ gaumoni ~ gaumo ‘now ~ today’
ebalo~ebala ebla~eblo~ebo COMP
ngahamam hamam 1pl.excl.POSS
Methodology
Recursive partitioning using ctree function in R (R Core Team, 2013) as part of
{party} package (Hothorn et al. 2006a, Hothorn et al. 2006b, Strobl et al.
2007, Strobl et al. 2008)
Can handle collinear variables, non-monotonic effects, ctree provides p
values for significance testing
Disadvantages: no random effects, no interactions, in large trees results can
be tricky to interpret
DV: Casual or Careful
IVs: Age (younger/older), Gender, Village, Clan,
Proportion of Tok Pisin in Text, Tok Pisin or none in Text,
Proportion of milok in Text, milok or none in Text,
Combined fluency measure
Length of Text, Text Quartile of word occurrence
Word Category, Word Length
Speaker, Word Meaning
Matukar Panau
Casual v. Careful Style Summary
One speaker has strong, consistent behaviour and dominates the data
Men more likely to use casual words than women
People who use more Tok Pisin, more likely to use casual variants
Young people who don’t use Tok Pisin are even more careful than older people who don’t use Tok Pisin
Young people who use Tok Pisin are even more casual than older people who use Tok Pisin
Women who don’t use milok even more careful than women who do
Another instance of people, particularly women, using linguistic variation to establish a style within a community: keepers v. shakers
Matukar Panau
Directional Constructions
Case Study for Variation in Construction Use
Should expect less variation conditioned by social variables than for
phonological or lexical variation (Cheshire, 1998; Labov, 1993; Meyerhoff and
Walker, 2010)
Directional Construction Types
DIR: P-V1-DIR.SUFFIX-TAM
SVC-z: Px-V1 Px-V2-TAM
Personx-Verb Personx-Verb-TAM
SVC-m: Px-V1-NF Px-V2-TAM
Personx-Verb-Nonfinite.Suffix Personx-Verb-TAM
Nonfinite marking can be realis or irrealis,
depending on final TAM modality
Directional Construction Examples
(1) turan nga-fure-nge maine [ng-abi-sa-nge] [nga-bal-so-nggo]
another 1.sg.S-pull.out-
R.DEP
here 1.sg.S-hold-UH-R.DEP 1.sg.S-throw-VEN-
R.IMPERF.INDEP
‘I pulled out another one, it came up, I am throwing it over here.’ (Garden V
3.6)
(2) a-das-e ab ilonl
o
[a-la a-pid-e]
2.P.S-ascend-
R.DEP
house inside [2.P.S-go 2.P.S-descend-
R.PERF.INDEP]
(3) [di-karak-e diy-a-go]
[3.pl.S-crawl-
R.DEP
3.pl.S-go-R.IMPERF.INDEP]
‘You go inside the house!’ (Manus Sa 14.1)
‘They are crawling away’ (Hermit C 4.1)
A small wrinkle – DIR_SVC-z Most verbs take zero person marking for 2s and 3s subject
inflections
In these cases DIR and SVC-z look the same
DIR: V1-DIR.SUFFIX-TAM
SVC-z: V1 V2-TAM
Except for DIR with V1 la or sa and suffix pid in which there
is a frozen i:
ngamlaipide, ngamsaipide ~ laipide, saipide
vs. ngamla ngampide ~ *la pide
*unattested
IVs Phonotactic Constraints
Presence of subject or object NPs, adverbs or other contributions to clause
length
Directional meaning (up/down/go/come & geocentric/speaker-centric)
Main Verb/V1 meaning (directional/motion/non-motion/PCU [Givón, 1990])
Speaker-based variables: village, gender, clan
TAM categories (irrealis/realis & dependent/independent &
perfective/imperfective/irrealis), Subject
Priming (Poplack, 1980; Torres Cacoullos and Travis , 2013; Travis, 2007)
DVs 4-way division of constructions
Removing ambiguous cases for 3-way division
Matukar Panau
Directional Constructions Summary
Phonotactics:
Avoidance of [a] deletion and glide insertion
No avoidance of the frozen [i], may indicate that these combinations are lexically
stored
V2 Semantics:
‘down’ patterns against ‘up’, ‘go’ patterns against ‘come’
V1 Semantics:
Sa highly associated with DIR, ngale driving SVC-m
May not be enough data yet, especially with removing ambiguous cases
Need more non 3s and 2s examples
Summary
Quantitative Corpus Linguistics
Data Mining – Recursive tree-based partitioning
Examination of Style in Matukar Panau looking at lexical items
Social aspects important
Examination of synchronic status of a particular historical development in
Matukar Panau looking at constructions
Language-internal aspects (like phonotactics, semantics) important
Conclusion
Quantitative Corpus Techniques can be used for
Statistical modelling and evaluation of patterns
Pattern discovery for further fieldwork
More in-depth look at a particular pattern
Filling in holes (hole discovery)
Improving researcher instinct about own data
Basic needs like finding appropriate examples
ReferencesBarth, D. (2015). To have and to be: function word reduction in child speech, child directed speech and
inter-adult speech (Doctoral dissertation, University of Oregon).Barth, D., & Anderson, G. D. (2015). Directional Constructions in Matukar Panau. Oceanic Linguistics,
54(1), 206-239.Barth, D. & Kapatsinski, V. (Under Review). Evaluating logistic mixed-effects models of corpus –
linguistic data. Proceedings from Leuven Statistics Days 2012.Biber, D., & Conrad, S. (2001). Quantitative corpus-based research: Much more than bean counting.
TESOL quarterly, 331-336.Cheshire, J. (1998). Taming the vernacular: Some repercussions for the study of syntactic variation and
spoken grammar, Te Reo 41, 6-27.Davies, M. (2010). The Corpus of Historical American English (COHA): 400+ million words, 1810-2009.
Available online at http://corpus.byu.edu/coha.De Cuypere, L. (2015). A multivariate analysis of the Old English ACC+ DAT double object alternation.
Corpus Linguistics and Linguistic Theory, 11(2), 225–254.Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (1996). Glottalization of word-initial vowels as a
function of prosodic structure. Journal of Phonetics, 24(4), 423-444.Eckert, P. (1989). The whole woman: Sex and gender differences in variation. Language variation and
change, 1(3), 245-267.Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12, 453-476.Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of
sociolinguistic variation. Annual Review of Anthropology, 41, 87-100.Fang, A. C. (1993). Building a corpus of the English of computer science. English Language Corpora:
Design, Analysis and Exploitation. Amsterdam and Atlanta, GA: Rodopi, 73-8.Gal, S. (1979). Language Shift: Social Determinants of Linguistic Change in Bilingual Austria. New York:
Academic Press.Givón, T. (1990). Syntax: A Functional-typological Introduction, Vol. 2. Amsterdam: John Benjamins.Henton, C. & Bladon, A. (1988). Creak as a Sociophonetic Marker. In L. M. Hyman and Ch. N. Li (Eds.), Language,
Speech and Mind: Studies in Honour of Victoria A. Fromkin, 3–29. London: Routledge.
ReferencesHilpert, M. (2015). From hand-carved to computer-based: Noun-participle compounding and the upward
strengthening hypothesis. Cognitive Linguistics, 26(1), 113-147.Johnson, D. E. (2009). Rbrul package for R.Keenan, E. (1974). Norm-makers, norm-breakers: Uses of speech by men and women in a Malagasy
community. In R. Bauman and J. Sherzer (Eds.), Explorations in the Ethnography of Speaidng. 125-43, Cambridge: CambridgeUniversity Press.
Kendall, S. (2003). Creating gendered demeanours of authority at work and at home. In J. Holmes and M. Meyerhoff, Handbook of language and gender, 600–23. Oxford: Blackwell.
Labov, W. (1973). The linguistic consequences of being a lame. Language in Society, 2(1), 81-115.Labov, W. (1990). The intersection of sex and social class in the course of linguistic change. Language
Variation and Change, 2, 205-54.Labov, W. (199)3. The unobservability of structure and its linguistic consequences. Paper presented at
NWAV 22, University of Ottawa.Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer and B. Altenberg (Eds.), English
Corpus Linguistics: Studies in honour of Jan Svartvik, 8-29. London: Longman. McEnery, T., & Wilson, A. (2001). Corpus linguistics: An introduction. Edinburgh University Press.Meyerhoff, M. (2015). Turning variation on its head: Analysing subject prefixes in Nkep (Vanuatu) for
language documentation. Asia-Pacific Language Variation, 1(1), 78-108.Meyerhoff, M., & Walker, J. A. (2013). An existential problem: The sociolinguistic monitor and variation
in existential constructions on Bequia (St. Vincent and the Grenadines). Language in Society, 42 (4), 407-428.
Meyerhoff, M., & Walker, J. A. (2012). Grammatical variation in Bequia (St Vincent and the Grenadines). Journal of Pidgin and Creole Languages, 27(2), 209-234.
Nichols, P. (1983). Linguistic options and choices for Black women in the rural south. In B. Thorne, C. Kramarae, and N. Henley (Eds.), Language, Gender and Society, 54-68. Rowley, MA: Newbury House.
Nordberg, B. & Sundgren, E. (1998). On Observing Language Change: A Swedish Case Study. FUMS Rapport nr. 190. Institutionen for nordiska sprak vid Uppsala Universitet.
Pennebaker, J. W. & Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2), 291-301.
ReferencesPoplack, S. (1980). The notion of the plural in Puerto Rican Spanish: Competing constraints on (s) deletion.
Locating language in time and space, 1, 55-67.Ochs, E. (1992). Indexing gender. In A. Duranti and C. Goodwin (Eds.), Rethinking Context: Language as an
Interactive Phenomenon, 335-358. Cambridge: Cambridge University Press.Ogawa, N. & Shibamoto Smith, J. (1997). The gendering of the gay male sex class in Japan: A case study based
on "Rasen no Sobyo". In A. Livia and K. Hall (Eds.), Queerly Phrased: Language, Gender, and Sexuality, 402-415. New York and Oxford: Oxford University Press.
R Development Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Romaine, S. (1982). Socio-historical Linguistics: Its Status and Methodology. Cambridge: Cambridge University Press.
Romaine, S. (2003). Variation in Language and Gender. In J. Holmes and M. Meyerhoff (Eds.), Handbook of language and gender, 98-118. Oxford: Blackwell.
Shih, S. S., & Zuraw, K. (2014) Phonological factors in Tagalog adjective-noun word order variation. Paper presented at the Linguistics Society of America in Minneapolis, MN.
Taylor, A., Warner, A., Pintzuk, S., & Beths, F. (2003). The York-Toronto-Helsinki parsed corpus of Old English prose. University of York.
Torres Cacoullos, R. & Travis, C. E. (2013). Prosody, priming and particular constructions: The patterning of English first-person singular subject expression in conversation. Journal of Pragmatics, 63, 19-34.
Travis, C. E. (2007). Genre effects on subject expression in Spanish: Priming in narrative and conversation. Language Variation and Change, 19, 101-35.
Trudgill, P. (1974). The Social Differentiation of English in Norwich. Cambridge: Cambridge University Press.Trudgill, P. (1983). On Dialect. Oxford: Blackwell.Yuasa, I. P. (2010). Creaky voice: A new feminine voice quality for young urban-oriented upwardly mobile
American women?. American Speech, 85(3), 315-337.Zhang, Q. (2008). Rhotacization and the “Beijing smooth operator”: The social meaning of a linguistic variable.
Journal of Sociolinguistics, 12, 201-222.
Support
Support for this project comes from Living Tongues, Enduring Voices, National
Geographic and Firebird Foundation for Collection of Oral Literature
Primary consultant & teacher: Kadagoi Rawad Forepiso
Other speakers in the corpus:
Peter Barui, John Bogg, Thomas Taleu Kreno, Tukan Pain Francis, Bruce Kainor
Kaluk, Sel Pain Wadom, Wendy Pulu, Rebecca Wille, Barry Kuyau Barui, Cathy
Samun Wiliang, Griffin Mait, Pauline Griffin Mait, Mod Tabalib, Julie Nabog,
Boipain Sibon, Monica Malik Gim, Sara Duwagu, Gabriel Nali Gall, Willy Patal
Kumung, Kasan Barui, Jenny Kusum Gim, Sareg Erwin, Warangia Sangmei,
Kadagoi Lovinea Rapalau, Rosa Kibis Dikoi, Mulung Garog Tagog, Clara Kusos
Darr, Mod Wiliang, Kasarom Sapak Magop, Margaret Lem Kaluk, Michole
Sangmei Barui, Paul Sarr Tagog