quantitative corpus linguistics and fieldwork data, coedl seminar, october 2015

54
Quantitative Corpus Linguistics and Fieldwork Data Danielle Barth ARC Centre of Excellence for the Dynamics of Language Australian National University October 30, 2015

Upload: anu-au

Post on 30-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Quantitative Corpus

Linguistics and

Fieldwork DataDanielle Barth

ARC Centre of Excellence for the Dynamics of Language

Australian National University

October 30, 2015

Seminar Outline

What is a corpus and what are quantitative corpus techniques?

Why do we want to use quantitative corpus techniques for data exploration?

When can we start using quantitative corpus techniques for data exploration?

What are some quantitative corpus techniques for data exploration (i.e., data

mining)?

Case Studies with Matukar Panau data:

Careful and Casual Speaking Styles in Matukar Panau

Directional Construction Choice in Matukar Panau

Conclusion

When is data a corpus?

Structure and Size

Machine readable

Annotation and tagging (and clean-text)

One million words (Fang, 1993; Leech, 1991)

British National Corpus: 100,000,000 words

Corpus of Contemporary American English: 450,000,000 words

Corpus of Global Web-based English: 1,900,000,000 words

Quantitative Corpus Linguistics

Techniques and Theory

(Biber and Conrad, 2001; McEnery and Wilson, 2001): corpora show and

counteract the unreliability about our intuitions of language use

Frequency & Probability

Corpora show distribution of variation

Instead of asking what are the possibilities, we ask what possibilities are most

likely given the current data

Examples of research questions with

large corpora

De Cuypere (2015) examines ACC-DAT v. DAT-ACC word order in Old English

using the York-Toronto-Helsinki Parsed Corpus of Old English Prose (Taylor et

al., 2003) (1.5 million words), with 2,000 tokens

Shih and Zuraw (2014) examine adjective-noun order variation in 4.8 million

words extracted from the web and automatically tagged for POS, with

150,000 tokens

Hilpert (2015) examines the increase in token and type frequency of noun-

participle compounding in English using the Corpus of Historical American

English (Davies, 2010) (400 million words), with 31,700 tokens

Barth (2015) examines function word shortening and contraction in English

using the Buckeye Corpus (Pitt et al., 2007), (307,000 words, time-stamped to

the phoneme level)

Hilpert, 2015:22

Shih and Zuraw, 2014:4

Barth, 2015:13

Why do we care?

Question of where variation fits in our first principles

Is there a perfect example of a language (I-Language)?

If so, then questions of variation are secondary

Is variation inherent in language?

Is language a probabilistic system?

If so, then being aware of variation is part of understanding a language from day

one

If so, then seeing the patterns of variation helps us understand how a language

works and how speakers work

If so, then the likelihood of variants and their conditions is part of Language

When can we start?

As soon as we have enough data given our research question

As soon as we have an appropriate question given our data

Depends highly on data types and how we think about our data

Examples of techniques with smaller

“corpora”

Meyerhoff and Walker (2013) examine variation in Bequia (St Vincent and the

Grenadines) English existentials in 30 interviews

“The Bequia corpus is too small for us to undertake a quantitative analysis of the

semantic distinctions they make within this category.” (Meyerhoff and Walker,

2013: 425)

Meyerhoff and Walker (2012) examine verbal negation words, copulas

(including zero copulas) and past/non-past distribution of other grammatical

words using 62 interviews

Meyerhoff (2015) examines distribution of a subject prefix that inflects for

(ir)realis.

Barth (2015) examines function word reduction and contraction in child and

caregiver English using the Redford Corpus (78,000 words)

Zipfian Word Distribution in small (80,000 word) Corpus

Data Mining

Looking for patterns, associations, groups, sequences in data

Discovery/Exploration: atheoretical, kitchen-sink

Evaluation: search for theoretically motivated patterns

Artificial Neural Networks: Non-linear predictive models, machine learning

Genetic Algorithms: Survival, mutation, machine learning

Nearest Neighbour: classifies records/observations based on base-dataset

Rule induction: if-then rules extracted from the data

Decision trees/forests: classification of data in a tree-shaped structure

Matukar Panau

The Language and the Corpus

Matukar

Matukar Panau

The Language and the Corpus

Matukar

Matukar Panau

The Language and the Corpus

Spoken in two villages in Madang Province, Papua New Guinea:

Matukar (population 479)

Surumarang (population 219)

Also called Matukar, Matugar, Matukar Panau

Under 30 years old ~ Native, dominant

Tok Pisin Speakers

541

30-50 ~ First language Panau, but dominant

Tok Pisin speakers

131

Over 50 ~ Native, Proficient Panau Speakers 26

Matukar Panau

The People and the Corpus

71 texts (plus elicitation data) from 33 speakers

20,000 words

Texts primarily about family stories, traditional way of life, traditional

aspects of culture, some songs, some narrations of videos of typical village

activities like gardening and cooking, some traditional/mythical stories

Members of Saky (St. Augustine Katolik Yut) community organization:

Amos Barui, Alfred Barui, Michael Balias, Justin Willie, Zebedee Kreno

Kadagoi Rawad Forepiso

Matukar Panau

The People and the Corpus

Matukar Panau

The People and the Corpus

Matukar Panau

Clear and Casual Speech

Case Study for Variation in Lexical Items.

Several words (mostly adverbs) have a full form and a more reduced form(s)

Full forms are associated with a more careful speech style and reduced forms are associated with a more casual speech style

Women more likely to use formal terms of address than men in Japanese (Ogawa and Shibamoto Smith, 1997)

Women tend to use more formal phonological variants than men in Norwich English (Trudgill, 1974)

Strong association between women and standard speech in Western societies (Nordberg, 1971; Romaine, 2003; Trudgill, 1983) and women lead change from above, while men lead change from below (Gal, 1979; Labov, 1990)

BUT: In non-Western societies, “women are further away from the prestige norms of society” (Romaine 2003:109, cf. Nordberg and Sundgren, 1998; Nichols, 1983; Romaine, 1982) due to lack of education, less active outside the home (but cf. Keenan, 1974 for Malagasy)

Matukar Panau

Clear and Casual Speech

Field has shifted to talk of constructing identities (Eckert, 1989; Kendall, 2003; Ochs, 1992)

Zhang (2008) transnational business mangers/yuppies in Beijing use a full tone more common in Hong Kong and Taiwan, avoid rhoticization associated with male Beijing urban professionals (women especially avoid the “smooth Beijing operator” persona)

Podesva (2007) finds that a particular (gay male) speaker uses longer /t/ releases when constructing a “diva” persona at a BBQ than at work, the same hyper-articulated /t/ release variable can index carefulness, politeness, education (Eckert, 2012)

Creaky voice has moved from being associated with masculinity (Dilley et al., 1996; Henton and Bladon, 1998; Pittam, 1987) to being associated with nonaggressive, educated, urban oriented females in the US (Yuasa, 2010)

Women are more status-bound than men, cannot accumulate wealth or power with impunity, they rely on accumulating social capital, however that might be manifested in a particular community (Eckert, 1989) drive to assert membership (partially through linguistic variables) in a community

Matukar Panau

Lexical Access

Bilinguals may code switch because of lack of word or concept in one

language or because a word is more readily available in one language

(Grosjean, 1982)

Bilinguals may code switch to emphasize identity or group membership, to

draw attention to particular words, to engage a particular addressee, or to

comment on ongoing discourse (Appel and Muysken, 1987; Romaine, 1995).

Therefore, code switching by bilinguals can be used as a part of identity

construction or be used as a means of solving problems of lexical access.

Matukar Panau

Lexical Access

Tok Pisin words instead of Matukar Panau words (examples to follow)

Use of milok as place holder word

Milok ‘something’

Used nominally:

Used verbally: dop bor yukaup yenaba, manig milokap yenaba

ha-di kalago milo-k bakbak katalu-n-ama-n-da di-ngale-mbawai

CL-3pl basket something-UNPOSS insect egg-3s-APS-3s-COM 3pl-take-DESID

‘They wanted to take their basket with whatsit, with rice’ Kadagoi Rawad Forepiso – Kudas Custom 2 20110802:2

dop bor yuka-p y-en-aba, manig milo-k-ap y-en-aba

CONJ:DEP pig hang.up-IRR.DEP 3sg-lay-FUT like.this something-UNPOSS-IRR.DEP 3sg-lay-FUT

‘And the pig will be hung up and whatsit stay like this’ Paul Sarr Tagog – Pig: Trap, Net, Dog 20130422:30

Milok & Tok Pisin use

Paul Sarr Tagog, Willy Patal Kumung, Barry Kuyau Barui

Watch for:

milokaba, milok, milokap

milok, organisation, clan,

makim

car, taun, haussumuk,

suga, rais, trausis, siot

But Tok Pisin can also be used

stylistically

OK ngahau mam main clan leader Bantibun maror

OK 1sg.POSS father PROX clan leader clan.name clan.leader

‘OK, my father was the clan leader, the Bantibun clan leader.’ – Tomas Taleu Kreno – Life Story

20100331: 5

Matukar as ples paiin

Matukar origin place woman

‘(I am a) ur-Matukar woman’ – Sel Pain Wadom – Life Story 20100405: 10

Careful v. Casual Words

in Matukar Panau

Careful Casual Gloss

mainangan mainan ‘this one PROX’

manaiyami mainami ‘that’s all’

alohage alo ‘after

wagamami wagami ~ gegemi ‘before’

gaumomoni gauni ~ gaumoni ~ gaumo ‘now ~ today’

ebalo~ebala ebla~eblo~ebo COMP

ngahamam hamam 1pl.excl.POSS

Intercepts for Mixed-Effects Model of

Careful v. Casual word use in Panau

Methodology

Recursive partitioning using ctree function in R (R Core Team, 2013) as part of

{party} package (Hothorn et al. 2006a, Hothorn et al. 2006b, Strobl et al.

2007, Strobl et al. 2008)

Can handle collinear variables, non-monotonic effects, ctree provides p

values for significance testing

Disadvantages: no random effects, no interactions, in large trees results can

be tricky to interpret

DV: Casual or Careful

IVs: Age (younger/older), Gender, Village, Clan,

Proportion of Tok Pisin in Text, Tok Pisin or none in Text,

Proportion of milok in Text, milok or none in Text,

Combined fluency measure

Length of Text, Text Quartile of word occurrence

Word Category, Word Length

Speaker, Word Meaning

Matukar Panau

Casual v. Careful Style Summary

One speaker has strong, consistent behaviour and dominates the data

Men more likely to use casual words than women

People who use more Tok Pisin, more likely to use casual variants

Young people who don’t use Tok Pisin are even more careful than older people who don’t use Tok Pisin

Young people who use Tok Pisin are even more casual than older people who use Tok Pisin

Women who don’t use milok even more careful than women who do

Another instance of people, particularly women, using linguistic variation to establish a style within a community: keepers v. shakers

Matukar Panau

Directional Constructions

Case Study for Variation in Construction Use

Should expect less variation conditioned by social variables than for

phonological or lexical variation (Cheshire, 1998; Labov, 1993; Meyerhoff and

Walker, 2010)

Directional Construction Types

DIR: P-V1-DIR.SUFFIX-TAM

SVC-z: Px-V1 Px-V2-TAM

Personx-Verb Personx-Verb-TAM

SVC-m: Px-V1-NF Px-V2-TAM

Personx-Verb-Nonfinite.Suffix Personx-Verb-TAM

Nonfinite marking can be realis or irrealis,

depending on final TAM modality

Directional Construction Examples

(1) turan nga-fure-nge maine [ng-abi-sa-nge] [nga-bal-so-nggo]

another 1.sg.S-pull.out-

R.DEP

here 1.sg.S-hold-UH-R.DEP 1.sg.S-throw-VEN-

R.IMPERF.INDEP

‘I pulled out another one, it came up, I am throwing it over here.’ (Garden V

3.6)

(2) a-das-e ab ilonl

o

[a-la a-pid-e]

2.P.S-ascend-

R.DEP

house inside [2.P.S-go 2.P.S-descend-

R.PERF.INDEP]

(3) [di-karak-e diy-a-go]

[3.pl.S-crawl-

R.DEP

3.pl.S-go-R.IMPERF.INDEP]

‘You go inside the house!’ (Manus Sa 14.1)

‘They are crawling away’ (Hermit C 4.1)

A small wrinkle – DIR_SVC-z Most verbs take zero person marking for 2s and 3s subject

inflections

In these cases DIR and SVC-z look the same

DIR: V1-DIR.SUFFIX-TAM

SVC-z: V1 V2-TAM

Except for DIR with V1 la or sa and suffix pid in which there

is a frozen i:

ngamlaipide, ngamsaipide ~ laipide, saipide

vs. ngamla ngampide ~ *la pide

*unattested

Distribution of Construction Types

IVs Phonotactic Constraints

Presence of subject or object NPs, adverbs or other contributions to clause

length

Directional meaning (up/down/go/come & geocentric/speaker-centric)

Main Verb/V1 meaning (directional/motion/non-motion/PCU [Givón, 1990])

Speaker-based variables: village, gender, clan

TAM categories (irrealis/realis & dependent/independent &

perfective/imperfective/irrealis), Subject

Priming (Poplack, 1980; Torres Cacoullos and Travis , 2013; Travis, 2007)

DVs 4-way division of constructions

Removing ambiguous cases for 3-way division

Matukar Panau

Directional Constructions Summary

Phonotactics:

Avoidance of [a] deletion and glide insertion

No avoidance of the frozen [i], may indicate that these combinations are lexically

stored

V2 Semantics:

‘down’ patterns against ‘up’, ‘go’ patterns against ‘come’

V1 Semantics:

Sa highly associated with DIR, ngale driving SVC-m

May not be enough data yet, especially with removing ambiguous cases

Need more non 3s and 2s examples

Summary

Quantitative Corpus Linguistics

Data Mining – Recursive tree-based partitioning

Examination of Style in Matukar Panau looking at lexical items

Social aspects important

Examination of synchronic status of a particular historical development in

Matukar Panau looking at constructions

Language-internal aspects (like phonotactics, semantics) important

Conclusion

Quantitative Corpus Techniques can be used for

Statistical modelling and evaluation of patterns

Pattern discovery for further fieldwork

More in-depth look at a particular pattern

Filling in holes (hole discovery)

Improving researcher instinct about own data

Basic needs like finding appropriate examples

ReferencesBarth, D. (2015). To have and to be: function word reduction in child speech, child directed speech and

inter-adult speech (Doctoral dissertation, University of Oregon).Barth, D., & Anderson, G. D. (2015). Directional Constructions in Matukar Panau. Oceanic Linguistics,

54(1), 206-239.Barth, D. & Kapatsinski, V. (Under Review). Evaluating logistic mixed-effects models of corpus –

linguistic data. Proceedings from Leuven Statistics Days 2012.Biber, D., & Conrad, S. (2001). Quantitative corpus-based research: Much more than bean counting.

TESOL quarterly, 331-336.Cheshire, J. (1998). Taming the vernacular: Some repercussions for the study of syntactic variation and

spoken grammar, Te Reo 41, 6-27.Davies, M. (2010). The Corpus of Historical American English (COHA): 400+ million words, 1810-2009.

Available online at http://corpus.byu.edu/coha.De Cuypere, L. (2015). A multivariate analysis of the Old English ACC+ DAT double object alternation.

Corpus Linguistics and Linguistic Theory, 11(2), 225–254.Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (1996). Glottalization of word-initial vowels as a

function of prosodic structure. Journal of Phonetics, 24(4), 423-444.Eckert, P. (1989). The whole woman: Sex and gender differences in variation. Language variation and

change, 1(3), 245-267.Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12, 453-476.Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of

sociolinguistic variation. Annual Review of Anthropology, 41, 87-100.Fang, A. C. (1993). Building a corpus of the English of computer science. English Language Corpora:

Design, Analysis and Exploitation. Amsterdam and Atlanta, GA: Rodopi, 73-8.Gal, S. (1979). Language Shift: Social Determinants of Linguistic Change in Bilingual Austria. New York:

Academic Press.Givón, T. (1990). Syntax: A Functional-typological Introduction, Vol. 2. Amsterdam: John Benjamins.Henton, C. & Bladon, A. (1988). Creak as a Sociophonetic Marker. In L. M. Hyman and Ch. N. Li (Eds.), Language,

Speech and Mind: Studies in Honour of Victoria A. Fromkin, 3–29. London: Routledge.

ReferencesHilpert, M. (2015). From hand-carved to computer-based: Noun-participle compounding and the upward

strengthening hypothesis. Cognitive Linguistics, 26(1), 113-147.Johnson, D. E. (2009). Rbrul package for R.Keenan, E. (1974). Norm-makers, norm-breakers: Uses of speech by men and women in a Malagasy

community. In R. Bauman and J. Sherzer (Eds.), Explorations in the Ethnography of Speaidng. 125-43, Cambridge: CambridgeUniversity Press.

Kendall, S. (2003). Creating gendered demeanours of authority at work and at home. In J. Holmes and M. Meyerhoff, Handbook of language and gender, 600–23. Oxford: Blackwell.

Labov, W. (1973). The linguistic consequences of being a lame. Language in Society, 2(1), 81-115.Labov, W. (1990). The intersection of sex and social class in the course of linguistic change. Language

Variation and Change, 2, 205-54.Labov, W. (199)3. The unobservability of structure and its linguistic consequences. Paper presented at

NWAV 22, University of Ottawa.Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer and B. Altenberg (Eds.), English

Corpus Linguistics: Studies in honour of Jan Svartvik, 8-29. London: Longman. McEnery, T., & Wilson, A. (2001). Corpus linguistics: An introduction. Edinburgh University Press.Meyerhoff, M. (2015). Turning variation on its head: Analysing subject prefixes in Nkep (Vanuatu) for

language documentation. Asia-Pacific Language Variation, 1(1), 78-108.Meyerhoff, M., & Walker, J. A. (2013). An existential problem: The sociolinguistic monitor and variation

in existential constructions on Bequia (St. Vincent and the Grenadines). Language in Society, 42 (4), 407-428.

Meyerhoff, M., & Walker, J. A. (2012). Grammatical variation in Bequia (St Vincent and the Grenadines). Journal of Pidgin and Creole Languages, 27(2), 209-234.

Nichols, P. (1983). Linguistic options and choices for Black women in the rural south. In B. Thorne, C. Kramarae, and N. Henley (Eds.), Language, Gender and Society, 54-68. Rowley, MA: Newbury House.

Nordberg, B. & Sundgren, E. (1998). On Observing Language Change: A Swedish Case Study. FUMS Rapport nr. 190. Institutionen for nordiska sprak vid Uppsala Universitet.

Pennebaker, J. W. & Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2), 291-301.

ReferencesPoplack, S. (1980). The notion of the plural in Puerto Rican Spanish: Competing constraints on (s) deletion.

Locating language in time and space, 1, 55-67.Ochs, E. (1992). Indexing gender. In A. Duranti and C. Goodwin (Eds.), Rethinking Context: Language as an

Interactive Phenomenon, 335-358. Cambridge: Cambridge University Press.Ogawa, N. & Shibamoto Smith, J. (1997). The gendering of the gay male sex class in Japan: A case study based

on "Rasen no Sobyo". In A. Livia and K. Hall (Eds.), Queerly Phrased: Language, Gender, and Sexuality, 402-415. New York and Oxford: Oxford University Press.

R Development Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

Romaine, S. (1982). Socio-historical Linguistics: Its Status and Methodology. Cambridge: Cambridge University Press.

Romaine, S. (2003). Variation in Language and Gender. In J. Holmes and M. Meyerhoff (Eds.), Handbook of language and gender, 98-118. Oxford: Blackwell.

Shih, S. S., & Zuraw, K. (2014) Phonological factors in Tagalog adjective-noun word order variation. Paper presented at the Linguistics Society of America in Minneapolis, MN.

Taylor, A., Warner, A., Pintzuk, S., & Beths, F. (2003). The York-Toronto-Helsinki parsed corpus of Old English prose. University of York.

Torres Cacoullos, R. & Travis, C. E. (2013). Prosody, priming and particular constructions: The patterning of English first-person singular subject expression in conversation. Journal of Pragmatics, 63, 19-34.

Travis, C. E. (2007). Genre effects on subject expression in Spanish: Priming in narrative and conversation. Language Variation and Change, 19, 101-35.

Trudgill, P. (1974). The Social Differentiation of English in Norwich. Cambridge: Cambridge University Press.Trudgill, P. (1983). On Dialect. Oxford: Blackwell.Yuasa, I. P. (2010). Creaky voice: A new feminine voice quality for young urban-oriented upwardly mobile

American women?. American Speech, 85(3), 315-337.Zhang, Q. (2008). Rhotacization and the “Beijing smooth operator”: The social meaning of a linguistic variable.

Journal of Sociolinguistics, 12, 201-222.

Support

Support for this project comes from Living Tongues, Enduring Voices, National

Geographic and Firebird Foundation for Collection of Oral Literature

Primary consultant & teacher: Kadagoi Rawad Forepiso

Other speakers in the corpus:

Peter Barui, John Bogg, Thomas Taleu Kreno, Tukan Pain Francis, Bruce Kainor

Kaluk, Sel Pain Wadom, Wendy Pulu, Rebecca Wille, Barry Kuyau Barui, Cathy

Samun Wiliang, Griffin Mait, Pauline Griffin Mait, Mod Tabalib, Julie Nabog,

Boipain Sibon, Monica Malik Gim, Sara Duwagu, Gabriel Nali Gall, Willy Patal

Kumung, Kasan Barui, Jenny Kusum Gim, Sareg Erwin, Warangia Sangmei,

Kadagoi Lovinea Rapalau, Rosa Kibis Dikoi, Mulung Garog Tagog, Clara Kusos

Darr, Mod Wiliang, Kasarom Sapak Magop, Margaret Lem Kaluk, Michole

Sangmei Barui, Paul Sarr Tagog