multivariate methods for analysis of categorical data for...

63
Multivariate methods for analysis of categorical data for linguists Natalia Levshina Mainz, May 9 2016

Upload: others

Post on 20-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Multivariate methods for analysis of categorical data

for linguists

Natalia Levshina

Mainz, May 9 2016

Page 2: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 3: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Categorical data

• Binary (biological sex, rural or urban, left the village or not) or having more than 2 categories (case in Russian, gender, valency)

• Ordered (level of education) or unordered (case in Russian)

• Pervasive in linguistics

• Poorly described in statistical textbooks

Page 4: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Examples from WALS

http://wals.info/feature„Classical“ categorical variables:

• Feature 107A: Passive Constructions• Feature 87A: Order of Adjective and Noun• Feature 1A: Consonant Inventories

Tricky variables (might need recoding before analysis): • Feature 30A: Number of Genders (loss of information (“5 and

more”)?)• Feature 72A: Imperative-Hortative Systems (two variables in

fact?)• Feature 144U: Double Negation in Verb-Initial Languages (one

category – one count)• Feature 142A: Para-Linguistic Usages of Clicks (is it good to

have “other or none”?)

Page 5: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Contingency table

• 2- and more-dimensional cross-tabulated data of 2 and more categorical variables

• Example: • Dryer’s WALS F86 WO ADJ + Noun and F87 WO GEN +

Noun:

ADJ_Noun Noun_ADJ No dominant WO

GEN_Noun 232 342 38

Noun_GEN 65 342 26

No dominant WO 21 48 21

Page 6: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A 3-dimensional contingency table• WALS F86 + F87 + F89 (WO Num + Noun)

ADJ_Noun Noun_ADJ No dominant WO

GEN_Noun 9 13 6

Noun_GEN 1 16 2

No dominant WO 0 3 1

ADJ_Noun Noun_ADJ No dominant WO

GEN_Noun 15 226 13

Noun_GEN 16 184 4

No dominant WO 1 28 0

ADJ_Noun Noun_ADJ No dominant WO

GEN_Noun 176 45 10

Noun_GEN 37 100 15

No dominant WO 15 10 5

Num_Noun

Noun_Num

No dominantWO

Page 7: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 8: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exploratory vs. confirmatory multivariate methods• Exploratory:

• most commonly, finding patterns in large complex data sets

• Confirmatory (or hypothesis-testing): • statistical inference, i.e. how confident can we be that

we can reject the null hypothesis (frequentist statistics) /believe in the alternative hypothesis (Bayesian statistics)?• Frequentist statistics: p-values, confidence intervals

• Bayesian statistics: posterior probabilities, credible intervals

Page 9: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exploratory methods discussed today• Correspondence Analysis

• Multidimensional Scaling based on Gower’s distances

Page 10: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 11: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Correspondence Analysis

• Represents associations between categorical variables on a map

• Simple CA: two variables

• Multiple CA: more than two variables

• Supplementary points: all kinds of metainformation

Page 12: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 13: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Colour terms in different registers of COCA

spoken fiction academic press

black 20335 41118 26892 73080

blue 4693 22093 3605 21210

brown 1185 10914 1201 11539

gray 1168 12140 1289 6559

green 3860 14398 4477 26837

orange 931 3496 474 5766

pink 962 7312 584 6356

purple 613 3366 429 3403

red 7230 25111 5621 34596

white 14474 40745 26336 54883

yellow 1349 10553 1855 10382

Page 14: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Simple CA map

Page 15: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Interpretation

• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as

proportions [0.13, 0.25, 0.17, 0.45]

• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]

• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]

Page 16: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Interpretation

• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as

proportions [0.13, 0.25, 0.17, 0.45]

• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]

• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]

Which colours have similar profiles?

Page 17: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Interpretation (cont.)

• The same holds for columns (registers).

• The absolute distances between columns and rows are not always meaningful (depends on the type of CA map). What matters always, however, is the dimensional interpretation (e.g. in which quadrants do you find the rows and columns)?

• Less frequent categories are usually further from the origin.

• The absolute values of the coordinates and their sign usually do not matter.

Page 18: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

What can you say about the map?

Page 19: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

How good is the 2-dimensional solution?• 1 dimension: 77.9% of variation explained

• 2 dimension: 19.2% of variation explained

• 3 dimension: 2.9% of variation explained

Total for dimensions 1 and 2: 97.1%

An excellent result!

Page 20: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exercise: WO (2 var)

• How can you interpret the plot? Is the CA solution good?

Page 21: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 22: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Stuhl oder Sessel?

Function Age Back Soft Arms Upholstery Mat_Seat

1 Eat Adult Low No No No Plastic

2 Eat Children Mid No No No Wood

3 NotSpec Adult Mid No Yes No Rattan

4 Eat Adult High Yes No Yes Fabric

5 Eat Children High No Yes No Plastic

6 Work Adult High Yes Yes Yes Fabric

7 NotSpec Adult Mid No No No Wood

8 Relax Adult High Yes Yes Yes Leather

9 Eat Adult Mid No No No Wood

10 Eat Adult Mid No No Yes Fabric

[188 observations, 16 variables in total]

Page 23: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic
Page 24: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

What is shown?

• The black points are the values of the categorical variables (e.g. Rattan from Material, Work from Function).

• The grey points are the individual observations.

• Again, the safe interpretation is dimensional.

Page 25: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic
Page 26: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

How good is the solution?

• Unfortunately, the traditional MCA inflates the variance (inertia), so the quality often seems lower than it actually is.

• Use the adjusted version of MCA, which provides a correction (see Levshina 2015: Ch. 19).

Page 27: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 28: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Supplementary points

• Represent variables or individuals that have a different nature from the rest (e.g. demographic information against linguistic variables, linguistic variables against the referential features).

• Are passive (do not influence the orientation of the axes).

• Can be plotted onto the maps.

Page 29: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Supplementary points

Page 30: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Confidence ellipses

Page 31: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exercise: MCA of WO data (3 var)

• Interpret the map.

Page 32: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 33: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A fictitious example: data

Page 34: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A fictitious example: compute Gower’s distances

Page 35: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A fictitious example: MDS

Page 36: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Interpretation

• Large distances between points suggest little overlap of the formal expressions cross-linguistically; small distances suggest great overlap of the formal expressions cross-linguistically.

• The coordinates (the absolute magnitude and the sign) normally do not matter.

• What matters are the dimensions and clusters on the map. However, their interpretation is not provided by the algorithm. This is a task for a linguist (not always easy).

Page 37: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A real example

• ParTy corpus of film and TED talk subtitles (see my website)

• English + ten other languages (Finnish, French, Indonesian, Japanese, Mandarin, Russian, Turkish, Vietnamese)

• Causal connectives (because, so, so that, that’s why, etc.) – in total, 205 instances.

Page 38: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic
Page 39: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

MDS map with English categories

Page 40: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

MDS map with Indonesian categories

Page 41: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

How to interpret the points?

• You can provide any metainformation (e.g. original sentences) in a clickable plot (package googleVis)

• See an example here: http://www.natalialevshina.com/plots/bubblechart1.html

Page 42: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exploratory methods: summary

• CA:+ Shows all variables (features of furniture) and individuals

(furniture items) on one map+ Shows the average positions of variables (features of furniture)+ The number of variables is not very important- Very sensitive to outliers (rare categories)- Missing data can be a problem

• MDS:- Shows only individuals (instances of connectives) with maximum

one variable (French, Chinese, etc.) - Does not show the average positions of variable categories

(language-specific causal connectives)- Too few variables do not create enough variation+ Rare categories are not a big deal+ Missing data are not a big deal (of course, to a reasonable extent)

Page 43: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 44: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Logistic regression

• Models the relationship between a categorical response (e.g. active or passive voice, going to or gonna) and one or more predictors (e.g. direct or indirect causation, spoken or written data, the country, formal or informal speech…)

- Two outcomes: binomial (dichotomous)

- Three and more: multinomial (polytomous)

Page 45: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A case study

• Causative verbs doen “do, make” and laten “let”

• Semantics: doen expresses more direct causation than laten

• Syntax: doen is used more often with intransitive verbs

• Geographic variation: causative doen occurs more frequently in Belgian Dutch

(1) Hij deed me denken aan mijn vader.

He did me think at my father

“He reminded me of my father.”

(2) Ik liet hem mijn huis schilderen.

I let him my house paint

“I had him paint my house.”

Page 46: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Data (first 6 observations)

Aux Country Causation EPTrans EPTrans1

1 laten NL Inducive Intr Intr

2 laten NL Physical Intr Intr

3 laten NL Inducive Tr Tr

4 doen BE Affective Intr Intr

5 laten NL Inducive Tr Tr

6 laten NL Volitional Intr Intr

Page 47: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Table of coefficients

Coef S.E. Wald Z Pr(>|Z|)

Intercept 1.8631 0.3771 4.94 <0.0001

Causation=Inducive -3.3725 0.3741 -9.01 <0.0001

Causation=Physical 0.4661 0.6275 0.74 0.4576

Causation=Volitional -3.7373 0.4278 -8.74 <0.0001

EPTrans=Tr -1.2952 0.3394 -3.82 0.0001

Country=BE 0.7085 0.2841 2.49 0.0126

The coefficients of the variables are log-odds ratios. They show by how much the chances of doen against laten increase (if > 0) or decrease (if < 0) in comparison with the reference level (Causation = Affective: EPTrans = Intr; Country = NL)

Page 48: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Goodness of fit

• Provide an estimate of how well the model fits the data

• The most popular measures:• Pseudo-R2 (here: 0.61), ranges from 0 to 1. Caution: it is

usually lower for logistic regression models than its counterpart in linear regression.

• A better option: Concordance index C (here: 0.89)

Page 49: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Goodness of fit: concordance index C• If you take all possible pairs that contain a sentence

with doen and a sentence with laten, and try all combinations, the statistic C will be the proportion of the times when the model predicts a higher probabilityof doen for the sentence with doen , and a higherprobability of laten for the sentence with laten.

• Rule of thumb:

C = 0.5 no discrimination

0.7 ≤ C < 0.8 acceptable discrimination

0.8 ≤ C < 0.9 excellent discrimination

C ≥ 0.9 outstanding discrimination

Page 50: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Exercise: nerd or geek?

• Data:

Noun Num Century Register Eval

nerd pl XX ACAD Neutral

geek pl XXI MAG Neutral

geek pl XX NEWS Neutral

geek sg XXI MAG Neutral

nerd sg XXI SPOK Neg

geek sg XX SPOK Positive

[1316 observations in total]

Page 51: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Table of coefficients

Coef S.E. Wald Z Pr(>|Z|)

Intercept -1.5038 0.3515 -4.28 <0.0001

Num=sg 0.2724 0.1291 2.11 0.0348

Century=XXI 0.8063 0.1220 6.61 <0.0001

Register=MAG 0.7457 0.3208 2.32 0.0201

Register=NEWS 0.5962 0.3301 1.81 0.0709

Register=SPOK 0.5729 0.3310 1.73 0.0835

Eval=Neutral -0.0991 0.1942 -0.51 0.6098

Eval=Positive 1.5084 0.2375 6.35 <0.0001

The coefficients show the effect for geek vs. nerd!

Page 52: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Does the model fit well?

• Pseudo-R2 = 0.17

• C = 0.69

Page 53: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Page 54: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Trees and forests

• These are methods based on recursive binary partitioning. • Why binary? The algorithm tests if any independent

variable is associated with the response variable and chooses the one that has the strongest association with the response (e.g. geek or nerd). It makes a binary split in this variable and splits the dataset into two subsets: those with value A and value B.

• Why recursive? The previous steps are repeated again and again until there’re no more variables associated with the outcome at the given level of statistical significance (e.g. α = 0.05).

Page 55: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

A case study

• Variation in the English causatives make + V, have + V and cause + to V

• 50 examples of each construction from a corpus = 150 observations in total

• 6 categorical variables:• CrSem: semantics of the Causer (Animate, Inanimate)• CeSem: semantics of the Causee (Animate, Inanimate)• CdEv: semantics of the infinitive (Mental, Physical, Social)• Neg: negation• Coref: coreferentiality between Cr and other participants (Yes,

No)• Poss: possessive markers that suggest a possessive

relationships between Cr and other participants (Yes, No)

Page 56: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Conditional inference tree

Page 57: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Goodness of fit: classification assuracy

Observed outcomes

Pre

dic

ted

ou

tco

mes

Accuracy = (35 + 42 + 24)/150 = 0.67

Page 58: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Random forests

• Are grown from many trees (e.g. 1000) if we repeat the conditional tree algorithm many times.

• We can compute the conditional variable importance scores for each variable. ‘Conditional’, because they are computed given the impact of all other variables and interactions with them.

• Important: the variable importance scores are relative. They cannot be compared across different models.

Page 59: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Conditional variable importance

Page 60: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Confirmatory methods: summary

• Conditional inference trees and random forests are used in the situations when the use of regression is problematic:• ‘Small n, large p’ (the maximum number of coefficients

in the binary logistic regression model is the frequency of the less frequent response category divided by 10)

• Complex interactions• Outliers

• Unlike regression, the partitioning methods do not return coefficients. However, it is possible to obtain relative variable importance measures.

Page 61: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

Acknowledgements

• All analyses were performed with R, a free statistical environment available from https://cran.r-project.org/

Thanks for your attention!

Page 62: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

References

• All methods and many of the case studies are discussed in my textbook:• Levshina, N. 2015. How to Do Linguistics with R: Data exploration and statistical

analysis. Amsterdam: John Benjamins.

• Correspondence Analysis:• Greenacre, M. 2007. Correspondence Analysis in Practice (2nd ed.). Boca Raton, FL:

Hall/CRC Press.

• Multidimensional Scaling:• Borg, I. & Groenen, P. 2005. Modern Multidimensional Scaling: Theory and

Applications (2nd ed.). New York: Springer.

• Logistic regression:• Hosmer, D. W., Lemeshow, S. & Sturdivant, R.X. 2013. Applied Logistic Regression.

New York: Wiley.

• Conditional inference trees and random forests:• Tagliamonte, S. & Baayen, R.H. 2012. Models, forests and trees of York English:

Was/were variation as a case study for statistical practice. Language Variation and Change 24(2): 135-178.

Page 63: Multivariate methods for analysis of categorical data for ...natalialevshina.com/Documents/Levshina_TriMCo2016... · Colour terms in different registers of COCA spoken fiction academic

R code

• See the textbook and the companion website

• Exemplar-based MDS maps: • See my page on Academia.edu, paper “How to make

semantic maps with R (based on contextual features of exemplars and Multidimensional Scaling)”