lin 3098 – corpus linguistics albert gatt. in this lecture corpora for the study of...

36
LIN 3098 – Corpus Linguistics Albert Gatt

Upload: jasper-richardson

Post on 17-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

LIN 3098 – Corpus Linguistics

Albert Gatt

In this lecture

Corpora for the study of genre/register variation revisit the concept of representativeness

and balance external vs. internal criteria: Biber (1992)

introduce the multi-dimensional approach to register/genre variation (Biber 1988)

Part 1

The concept of register/genre

A preliminary example

Compare the following: It is hard to resolve this problem. I find it hard to resolve this problem.

Is one intuitively more “formal”? Why?

A preliminary example Extraposed to-clause

It is hard to resolve this problem. It (expletive) Verb be An adjective (hard) or participle (boring) Clause starting with to + infinitive verb

Tends to be associated with a formal, “anomymous” style.

Tends to be “static”: Adjective or participle denotes a state, not a

dynamic event.

A preliminary example Extraposed to-clause

It is hard to resolve this problem. It (expletive) Verb be An adjective (hard) or participle (boring) Clause starting with to + infinitive verb

If our intuitions are correct, we would expect the distribution of this clause to vary across genres and registers.

What is a register? Would you consider the following to

be registers?1. recipe English2. legal Maltese3. specialised language used by ship-

builders

What are the crucial characteristics of register?

Defining register

Possible definitions (see overview in Paolillo 2000): register = “a field of discourse” or

“topic” register = “a combination of all the

parameters of the communicative situation”

register = “an occupationally determined variety of language”

Defining genre In discourse analysis and related

fields, genre is given a “sociologically oriented” definition:

“A socially ratified way of using language in connection with a particular type of social activity” suggests “typical” settings in which

language is used e.g. interview, lecture, story…

Why is this relevant?

Reminder (see lecture 2): general-purpose corpora aim for balance

and representativeness how genre/register are defined affects the

structure and the uses of the corpus

corpus-based studies of variation across/within registers need a well-defined notion

Balance and representativeness Balance:

refers to the range of types of text in the corpus e.g. the BNC’s construction was based on an a

priori classification of texts by domain, time and medium

Representativeness: refers to the extent to which the corpus contains

the full range of variation in the language.

Representativeness depends on balance as a prerequisite

Biber (1993) on achieving balance

Biber distinguishes: external criteria:

social and communicative contexts in which a particular sample of text/speech is produced

external criteria define registers or genres internal criteria:

linguistic (e.g. lexico-grammatical) features that distinguish texts

internal criteria define text types

External vs. internal Example: academic writing vs. spoken

conversation Some external criteria of differentiation:

primary channel (spoken/written/…) type of addressee factuality

Some internal criteria of differentiation: more uses of personal pronouns in spoken

discourse more use of passives in academic writing …

Which should come first? Biber’s argument:

“in defining the population for a corpus, register/genre distinctions [i.e. external criteria] take precedence over text-type distinctions. […] identification of the salient text-type distinctions in a language requires a representative corpus of texts…”

Biber’s external criteria

1. Primary channel: written/spoken/scripted

2. Format: published/unpublished

includes various publication formats

3. Setting: institutional/other/private-personal

Biber’s external criteria

4. Addresse/receivera. Plurality: unenumerated/

plural/individual/selfb. Presence: present/absentc. Interactiveness: none/little/extensived. Shared knowledge: general/ specialised/

personal

Biber’s external criteria

5. Addressor:a. Demographic variation: age, sex etcb. Acknowledgement: acknowledged

invididual/insititution

6. Factuality: factual-informational / intermediate / imaginative

7. Purposes: persuade, entertain, edify, inform, instruct…

8. Topics: [cf. the “Domain” definition in BNC texts]

The logic behind genre/register comparison

A priori distinction between different genres/registers adequately sampled to be representative

Given these externally-based distinctions, the question is: what linguistic features are characteristic

(give rise to) different genres?

Part 2

The multifeature/multidimensional framework (Biber 1988, Biber 1995)

Biber (1988, 1995) Compared twenty-one genres in spoken

and written British English

Used a precompiled list of 67 linguistic features, comparing: the extent to which these features “cluster

together” across genres high relative frequency of personal pronouns

=> high relative frequency of questions the extent to which these clusters are more

clearly present in different genres

Primary goals

1. identify the main dimensions (clusters of features) of variation underlying all registers

2. find similarities and differences between different registers

Dimensions Dimension:

group of features that are empirically determined to co-occur in text

Functional interpretation: given a set of features forming a dimension

e.g. pers. pronouns + questions the crucial question is: how do we interpret it

functionally? e.g. the cluster containing pers. pronouns and

questions shows a high level of interpersonal focus in the text

Factor analysis

The MF/MD approach uses factor analysis statistical technique to group together

related features based on their co-occurrence

resulting clusters of features (“factors) are then interpreted and given a label

this is the process of identification and functional interpretation of dimensions

Biber’s methodology1. Identify the grammatical features

based on review of existing literature

2. tag all relevant features in the corpus texts

3. post-edit the texts to ensure accuracy

4. count frequency of each feature in each text

5. apply factor analysis to compute co-occurrence patterns among features

6. interpret the resulting dimensions functionally

7. compare different registers to see how much each dimension is represented in them

Types of features

Lexical features type-token ratio (indicates the average

no. of different types given the number of tokens)

word length

lexical semantic features e.g. word classes like hedges (probably,

possibly…); speech act verbs (declare), etc

Types of features

Grammatical feature classes nouns, prepositional phrases, attributive

and predicative adjectives, etc.

Syntactic features: relative clauses, that-complements, pied-

piping constructions (Which car does he like?), conditional subordination (should you ever…)

The dimensions identified Involved vs. informational production Narrative vs. non-narrative production Elaborated vs. situation-dependent

reference Overt expression of persuasion Abstract vs. non-abstract styleNB. Many of these dimensions define

“poles of opposition”

Dimension 1: involved vs. informational Features:

1st & 2nd personal pronouns

questions reductions stance verbs hedges emphatics adverbial

subordination nouns adjectives prepositional phrases long words

Typical of conversations, letters(high personal involvement)

Typical of informational exposition, e.g. in official documents and academic writing

Dimension 2: Narrative vs. non-narrative

Features: past tense perfect aspect 3rd person pronouns speech act verbs

present tense attributive adjectives

Typical of fiction

Typical of broadcasts, telephone conversations, professional letters

Dimension 3: elaborated vs. situation-dependent reference

Features: wh-relative clauses

pied-piping phrasal coordination

time adverbials place adverbials

Typical of “elaborated” text: official documents, professional letters, written exposition

Typical of “situation-independent language”

Typical of “situation-dependent language”, e.g. broadcasts, fiction, personal letters

Dimension 4: Overt expression of persuasion

Features: modals conditional

subordination

lack of any of the above

Defines an “overt expression of persuasion type”e.g editorials, professional letters

Language which does not overtly seek to persuade

Dimension 5: Abstract vs. non-abstract style

Features:

agentless passives by-passives …

lack of any of the above

An “abstract style”: technical prose, academic prose, official documents

Language which is typically not abstract: conversation, public speeches, broadcasts…

Biber’s main argument

No one dimension is enough to characterise the properties of a particular register dimensions are coherent, correlated

groupings of features every register could be defined in terms

of the relative prominence of all 5 dimensions

Biber’s main argument Biber finds no evidence of an absolute

difference between spoken and written language e.g. conversations often display similar

characteristics to other non-spoken genres

Better to identify different types of speech (broadcast, scripted, spontaneous) view similarities and differences to different

types of writing

Summary

Biber’s MF/MD approach has proved highly influential in the study of register and genre

Crucially, relies on a priori definition of: features (“what to look for”) registers (“situationally-defined uses of

language”)

References Paolillo, J. C. (2000). Formalising formality.

Journal of Linguistics, 36: 215—259

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8 (4): 243-258.

Biber, D. (1995). On the role of computational, statistical and interpretive techniques in multi-dimensional analysis of register variation. Text, 15 (3): 314—370