multiple sequence analysis: a contextualized narrative approach to longitudinal data university of...

Multiple Sequence Analysis: a contextualized narrative approach to

longitudinal dataUniversity of Stirling, September 2007

Gary PollockDepartment of Sociology

Manchester Metropolitan [email protected]

Longitudinal processes

start and end times (EHA)

competing risk, multi-episode (EHA)

contiguous states as a single DV (SA)

ie. SA offers an alternative (complementary) approach to EHA

Sequence analysis using OMA

1. Sequences of statuses are processed by….

2. Optimal Matching Analysis (OMA) which results in …

3. A distance matrix representing the closeness (proximity) of each sequence with all others which can then be processed by…

4. Cluster analysis which leads to the construction of…

5. A typology of sequence categories

Single Sequences

• social class (S/N/M)• eg. case 1:SSSSSSSSSS• case 2:NNNNNSSSSS • case 3:NNNNNMMMMM etc.• Case Analysis: resulting typology is an end-in-itself• Variable Analysis: typology as a predictor or a dependent

variable• Class, employment status, qualifications, housing,

marital status, housing.. can all be analysed in this way – a range of typologies…but these don’t account for interactions as they are each independently arrived at

• why not combine sequence data prior to analysis in order to capture interactions?

Analysis: process

• Create sequence data file

• Determine what to do with internal gaps (fill, delete or skip)

• Determine the ‘costs’ to be used in the OMA (indel and substitution). These are the parameters which define the distances between the sequences. They work by giving low distance scores to similar sequences and high scores to dissimilar sequences

• Perform the OMA (though there other SA techniques)

• Weight the distances scores to account for different sequence lengths

• Perform cluster analysis

• Analyse clusters (i. sequence progression ii. covariates)

Indel and substitution costs

case 1:SSSSSSSSSScase 2:NNNNNSSSSS case 3:NNNNNMMMMM• If INDEL = 1 and SUBS = 2 (often a default setting)• 1,2 = 10• 1,3 = 20• 2,3 = 10• If INDEL = 1 and SUBS = 2 (NM, MN, SM,MS) and 1.5

(NS,SN)• 1,2 = 7.5• 1,3 = 17.5• 2,3 = 10

Data: BHPS 1991-2007

• born 1970-1975• tracked from age 21 to 29• data shifted to a common time axis• class and qualifications examined here (housing,

marriage, employment status and fertility status also processed)

• All internal gaps filled• All sequence lengths included• Year on year transitions used to inform

substitution costs

Sequence gaps over waves A to NLength of gaps across all waves A to N (note that a single case may have more than one internal gap on a single variable hence a straight % of no. gaps/N of sequences is problematic, instead % = of all gaps on that variable) 10,264 sequences 1 yr gap 2 yr gap 3 yr gap 4 yr gap 5 yr gap 6 yr gap 7 yr gap 8 yr gap 9 yr gap 10 yr gap 11 yr gap 12 yr gap N

Class 2,018 (54%) 761 (20%) 321 (9%) 230 (6%) 144 (4%) 87 (2%) 63 (2%)

35 34 17 10 10

3,730

Highest qualification 1,224 (70%) 274 (16%) 121 (7%) 67 (4%)

19 14 10 6 5 3 1 4

1,748

Data: BHPS 1991-2007

N 04 M 03 L 02 K 01 J 00 I 99 H 98 G 97 F 96 Year of birth = 1975: E 95 Year of birth = 1974: D 94 Year of birth = 1973: C 93 Year of birth = 1972: B 92 Year of birth = 1971: A 91 Year of birth = 1970: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Age

Single Sequences: class

C 21 22 23 24 25 26 27 28 29

1 3 3 3 3 3 -1 -1 -1 -12 3 3 3 3 3 3 2 3 33 4 3 3 3 3 3 3 2 34 3 4 4 4 3 3 2 1 -15 6 6 4 4 6 -1 -1 -1 -16 5 5 5 5 5 6 6 6 67 0 0 3 3 3 3 3 3 38 0 0 0 2 4 2 2 2 29 3 3 3 3 3 3 3 1 310 2 2 4 4 2 2 5 6 1

0 = no job yet

1 = Service class Higher

2 = Service class Lower

3 = Non-manual

4 = Self

5 = Skilled

6 = unskilled

Proportions of time spent in a particular class

N=810 sch scl nm self Skil unsk0.0 75.9 54.1 50.9 87.9 66.9 60.9

0.2 8.5 10.4 8.6 4.1 8.1 7.3 0.4 8.0 14.9 11.7 3.8 6.8 11.1 0.6 4.3 9.8 10.0 2.0 7.0 7.7 0.8 2.2 5.6 7.7 0.9 3.5 5.4 1.0 1.0 5.3 11.1 1.4 7.7 7.7

Year on year class transitions

0 sch scl nm self skil unsk Total

0 242 12 27 29 10 14 36 370sch 0 247 75 32 6 14 10 384scl 0 114 725 126 15 41 30 1051nm 0 73 202 960 14 21 45 1315self 0 5 13 10 175 14 25 242skil 0 21 57 27 23 597 97 822

unsk 0 18 41 82 25 108 686 960Total 242 490 1140 1266 268 809 929 5144

Year on year class transitions: off diagonal proportions (N = 1512)

SCH SCL NM Self Skilled UnskNone 1 2 2 1 1 2SCH 5 2 0 1 1SCL 8 8 1 3 2NM 5 13 1 1 3Self 0 1 1 1 2

Skilled 1 4 2 2 6Unsk 1 3 5 2 7

Total 100

Class substitution costs

None sch scl nm self skil unskNone 0.0, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, sch 1.8, 0.0, 1.2, 1.3, 1.8, 1.7, 1.7, scl 1.8, 1.2, 0.0, 1.1, 1.7, 1.3, 1.3, nm 1.8, 1.3, 1.1, 0.0, 1.7, 1.6, 1.3, Self 1.8, 1.8, 1.7, 1.7, 0.0, 1.6, 1.6, Skil 1.8, 1.7, 1.3, 1.6, 1.6, 0.0, 1.2, unsk 1.8, 1.7, 1.3, 1.3, 1.6, 1.2, 0.0;

Cluster analysis of class sequences

An eight cluster solution produces the following:

Clus % cases description

1 17 non manual, little if any mobility

2 12 service class, lower, little mobility

3 13 unskilled, little mobility

4 12 moving from unskilled to skilled work

5 15 mixed

6 6 skilled, little mobility

7 19 upwards mobility, NM, SCL, SCH

8 5 self employed, little mobility

Single Sequences: highest qualification

C 21 22 23 24 25 26 27 28 29

1 2 2 2 2 2 -1 -1 -1 -12 2 2 2 2 2 2 2 1 13 2 2 2 2 2 2 2 2 24 2 1 1 1 1 1 1 -1 -15 5 5 5 5 5 5 5 5 56 3 3 3 3 3 3 3 3 37 2 2 2 2 2 2 2 2 28 3 3 2 2 2 2 2 2 29 3 3 3 3 3 3 3 3 210 2 2 2 2 2 2 2 1 1

1 = HE

2 = Post GCSE/O grade

3 = GCSE / O grade

4 = Other

5 = None/at school

Proportions of time spent in highest qualification statuses

N=831 HE Post GCSE

GCSE Other None

0 79.8 35.5 72.8 89.5 92.9 0.2 0.5 7.9 1.9 0.4 0.1 0.4 2.2 8.9 1.9 0.7 0.7 0.6 1.6 4.7 2.4 1 0.4 0.8 6.4 4.7 1.9 0.2 0.6 1.0 9.6 38.3 19 8.2 5.3

Year on year changes in HEQ

• HE A O Other None TotalHE 710 0 0 0 0 710A 136 2426 0 0 0 2562O 4 73 1133 0 0 1210Other 0 15 5 485 0 505None 0 14 0 1 264 279Total 850 2528 1138 486 264 5266

Year on year changes in HEQ: off diagonal proportions (N = 248)

HE A O Other NoneHE 0 0 0 0

A 55 0 0 0

O 2 29 0 0

Other 0 6 2 0

None 0 6 0 0Total

HEQ substitution costs

None HE A O oth noneNone 0.0, 2.0, 2.0, 2.0, 2.0, 2.0,HE 1.8, 0.0, 2.0, 2.0, 2.0, 2.0, A 1.8, 1.1, 0.0, 2.0, 2.0, 2.0, O 1.8, 1.8, 1.2, 0.0, 2.0, 2.0, Other 1.8, 1.7, 1.6, 1.7, 0.0, 1.8, None 1.8, 1.8, 1.6, 1.7, 1.7, 0.0;

Cluster analysis of HEQ sequences

A seven cluster solution produces the following:Clus % cases description1 17 from GCSE to post-GCSE2 7 ‘late’ post GCSE to HE3 30 post GCSE, stable4 13 ‘early’ post GCSE to HE5 6 no qualifications6 14 GCSE, stable7 11 other, stable

Multiple Sequence Analysis (MSA)

• combine different sequences prior to OMA processing

• eg. class, qualifications, (housing, marital and fertility statuses) are combined in a single measure

• the sequences represent a narrative of change (or stability) on the measured dimensions

• the resulting typology can be analysed using case and variable methods as before but is in itself a representation of complex time embedded associations between the source variables

Multiple Sequences: class and highest qualification

C 21 22 23 24 25 26 27 28 291 23 23 23 23 23 -1 -1 -1 -12 23 23 23 23 23 23 22 13 133 24 23 23 23 23 23 23 22 234 23 14 14 14 13 13 12 -1 -15 56 56 54 54 56 -1 -1 -1 -16 35 35 35 35 35 36 36 36 367 20 20 23 23 23 23 23 23 238 30 30 20 22 24 22 22 22 229 33 33 33 33 33 33 33 31 2310 22 22 24 24 22 22 25 16 11

1st Digit:1 = HE2 = Post GCSE/O grade3 = GCSE / O grade4 = Other5 = None/at school

2nd Digit:0 = no job yet1 = Service class Higher2 = Service class Lower3 = Non-manual4 = Self5 = Skilled6 = unskilled

Year on year changes

• This is a large (35 by 35 ) matrix• Calculation of substitution costs as for single

sequence structure• Frequent transitions: • 1211 (2.9%)• 1312 (2.3%)• 2122 (2.6%)• 2221 (2.6%)• 2223 (4.5%)• 2322 (5.9%)• 2625 (2.4%)

Sequence analysis of class-HEQ data

Clus % description1 11 post GCSE, NM, stable2 8 post GCSEHE, NMSCL3 5 no quals, self empunsk4 10 GCSE, mixed emp (self,sk,unsk)5 7 post GCSE, NMSCL6 7 GCSE, NM both stable7 4 post GCSE skilled, both stable8 6 from unsk and sk SCH, HE9 15 mixed10 4 other quals and SCL, SCH11 3 post GCSE, SCH/SCLswitching12 6 other and sk/unsk , stable13 8 post GCSE , unsk stable14 2 post GCSE, self , stable

Advantages of MSA

• Is not limited to a single sequence measure

• Is not limited to a single event type

• Articulates the full scope of related sequences together

Issues

• Increasing complexity of the measure as new variables drawn in

• computing time / software switching

• Lack of formal rules in executing the OMA and clustering processes

• Largely exploratory: scope to develop in relation to EHA

multiple sequence analysis: a contextualized narrative approach to longitudinal data university of...

Documents

eha slide

particular class slide

case analysis

n slide

unskilled slide

multiple sequence analysis

sequence gaps

year class transitions