multiple sequence analysis: a contextualized narrative approach to longitudinal data university of...
TRANSCRIPT
Multiple Sequence Analysis: a contextualized narrative approach to
longitudinal dataUniversity of Stirling, September 2007
Gary PollockDepartment of Sociology
Manchester Metropolitan [email protected]
Longitudinal processes
start and end times (EHA)
competing risk, multi-episode (EHA)
contiguous states as a single DV (SA)
ie. SA offers an alternative (complementary) approach to EHA
Sequence analysis using OMA
1. Sequences of statuses are processed by….
2. Optimal Matching Analysis (OMA) which results in …
3. A distance matrix representing the closeness (proximity) of each sequence with all others which can then be processed by…
4. Cluster analysis which leads to the construction of…
5. A typology of sequence categories
Single Sequences
• social class (S/N/M)• eg. case 1:SSSSSSSSSS• case 2:NNNNNSSSSS • case 3:NNNNNMMMMM etc.• Case Analysis: resulting typology is an end-in-itself• Variable Analysis: typology as a predictor or a dependent
variable• Class, employment status, qualifications, housing,
marital status, housing.. can all be analysed in this way – a range of typologies…but these don’t account for interactions as they are each independently arrived at
• why not combine sequence data prior to analysis in order to capture interactions?
Analysis: process
• Create sequence data file
• Determine what to do with internal gaps (fill, delete or skip)
• Determine the ‘costs’ to be used in the OMA (indel and substitution). These are the parameters which define the distances between the sequences. They work by giving low distance scores to similar sequences and high scores to dissimilar sequences
• Perform the OMA (though there other SA techniques)
• Weight the distances scores to account for different sequence lengths
• Perform cluster analysis
• Analyse clusters (i. sequence progression ii. covariates)
Indel and substitution costs
case 1:SSSSSSSSSScase 2:NNNNNSSSSS case 3:NNNNNMMMMM• If INDEL = 1 and SUBS = 2 (often a default setting)• 1,2 = 10• 1,3 = 20• 2,3 = 10• If INDEL = 1 and SUBS = 2 (NM, MN, SM,MS) and 1.5
(NS,SN)• 1,2 = 7.5• 1,3 = 17.5• 2,3 = 10
Data: BHPS 1991-2007
• born 1970-1975• tracked from age 21 to 29• data shifted to a common time axis• class and qualifications examined here (housing,
marriage, employment status and fertility status also processed)
• All internal gaps filled• All sequence lengths included• Year on year transitions used to inform
substitution costs
Sequence gaps over waves A to NLength of gaps across all waves A to N (note that a single case may have more than one internal gap on a single variable hence a straight % of no. gaps/N of sequences is problematic, instead % = of all gaps on that variable) 10,264 sequences 1 yr gap 2 yr gap 3 yr gap 4 yr gap 5 yr gap 6 yr gap 7 yr gap 8 yr gap 9 yr gap 10 yr gap 11 yr gap 12 yr gap N
Class 2,018 (54%) 761 (20%) 321 (9%) 230 (6%) 144 (4%) 87 (2%) 63 (2%)
35 34 17 10 10
3,730
Highest qualification 1,224 (70%) 274 (16%) 121 (7%) 67 (4%)
19 14 10 6 5 3 1 4
1,748
Data: BHPS 1991-2007
N 04 M 03 L 02 K 01 J 00 I 99 H 98 G 97 F 96 Year of birth = 1975: E 95 Year of birth = 1974: D 94 Year of birth = 1973: C 93 Year of birth = 1972: B 92 Year of birth = 1971: A 91 Year of birth = 1970: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Age
Single Sequences: class
C 21 22 23 24 25 26 27 28 29
1 3 3 3 3 3 -1 -1 -1 -12 3 3 3 3 3 3 2 3 33 4 3 3 3 3 3 3 2 34 3 4 4 4 3 3 2 1 -15 6 6 4 4 6 -1 -1 -1 -16 5 5 5 5 5 6 6 6 67 0 0 3 3 3 3 3 3 38 0 0 0 2 4 2 2 2 29 3 3 3 3 3 3 3 1 310 2 2 4 4 2 2 5 6 1
0 = no job yet
1 = Service class Higher
2 = Service class Lower
3 = Non-manual
4 = Self
5 = Skilled
6 = unskilled
Proportions of time spent in a particular class
N=810 sch scl nm self Skil unsk0.0 75.9 54.1 50.9 87.9 66.9 60.9
0.2 8.5 10.4 8.6 4.1 8.1 7.3 0.4 8.0 14.9 11.7 3.8 6.8 11.1 0.6 4.3 9.8 10.0 2.0 7.0 7.7 0.8 2.2 5.6 7.7 0.9 3.5 5.4 1.0 1.0 5.3 11.1 1.4 7.7 7.7
Year on year class transitions
0 sch scl nm self skil unsk Total
0 242 12 27 29 10 14 36 370sch 0 247 75 32 6 14 10 384scl 0 114 725 126 15 41 30 1051nm 0 73 202 960 14 21 45 1315self 0 5 13 10 175 14 25 242skil 0 21 57 27 23 597 97 822
unsk 0 18 41 82 25 108 686 960Total 242 490 1140 1266 268 809 929 5144
Year on year class transitions: off diagonal proportions (N = 1512)
SCH SCL NM Self Skilled UnskNone 1 2 2 1 1 2SCH 5 2 0 1 1SCL 8 8 1 3 2NM 5 13 1 1 3Self 0 1 1 1 2
Skilled 1 4 2 2 6Unsk 1 3 5 2 7
Total 100
Class substitution costs
None sch scl nm self skil unskNone 0.0, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, sch 1.8, 0.0, 1.2, 1.3, 1.8, 1.7, 1.7, scl 1.8, 1.2, 0.0, 1.1, 1.7, 1.3, 1.3, nm 1.8, 1.3, 1.1, 0.0, 1.7, 1.6, 1.3, Self 1.8, 1.8, 1.7, 1.7, 0.0, 1.6, 1.6, Skil 1.8, 1.7, 1.3, 1.6, 1.6, 0.0, 1.2, unsk 1.8, 1.7, 1.3, 1.3, 1.6, 1.2, 0.0;
Cluster analysis of class sequences
An eight cluster solution produces the following:
Clus % cases description
1 17 non manual, little if any mobility
2 12 service class, lower, little mobility
3 13 unskilled, little mobility
4 12 moving from unskilled to skilled work
5 15 mixed
6 6 skilled, little mobility
7 19 upwards mobility, NM, SCL, SCH
8 5 self employed, little mobility
Single Sequences: highest qualification
C 21 22 23 24 25 26 27 28 29
1 2 2 2 2 2 -1 -1 -1 -12 2 2 2 2 2 2 2 1 13 2 2 2 2 2 2 2 2 24 2 1 1 1 1 1 1 -1 -15 5 5 5 5 5 5 5 5 56 3 3 3 3 3 3 3 3 37 2 2 2 2 2 2 2 2 28 3 3 2 2 2 2 2 2 29 3 3 3 3 3 3 3 3 210 2 2 2 2 2 2 2 1 1
1 = HE
2 = Post GCSE/O grade
3 = GCSE / O grade
4 = Other
5 = None/at school
Proportions of time spent in highest qualification statuses
N=831 HE Post GCSE
GCSE Other None
0 79.8 35.5 72.8 89.5 92.9 0.2 0.5 7.9 1.9 0.4 0.1 0.4 2.2 8.9 1.9 0.7 0.7 0.6 1.6 4.7 2.4 1 0.4 0.8 6.4 4.7 1.9 0.2 0.6 1.0 9.6 38.3 19 8.2 5.3
Year on year changes in HEQ
• HE A O Other None TotalHE 710 0 0 0 0 710A 136 2426 0 0 0 2562O 4 73 1133 0 0 1210Other 0 15 5 485 0 505None 0 14 0 1 264 279Total 850 2528 1138 486 264 5266
Year on year changes in HEQ: off diagonal proportions (N = 248)
HE A O Other NoneHE 0 0 0 0
A 55 0 0 0
O 2 29 0 0
Other 0 6 2 0
None 0 6 0 0Total
HEQ substitution costs
None HE A O oth noneNone 0.0, 2.0, 2.0, 2.0, 2.0, 2.0,HE 1.8, 0.0, 2.0, 2.0, 2.0, 2.0, A 1.8, 1.1, 0.0, 2.0, 2.0, 2.0, O 1.8, 1.8, 1.2, 0.0, 2.0, 2.0, Other 1.8, 1.7, 1.6, 1.7, 0.0, 1.8, None 1.8, 1.8, 1.6, 1.7, 1.7, 0.0;
Cluster analysis of HEQ sequences
A seven cluster solution produces the following:Clus % cases description1 17 from GCSE to post-GCSE2 7 ‘late’ post GCSE to HE3 30 post GCSE, stable4 13 ‘early’ post GCSE to HE5 6 no qualifications6 14 GCSE, stable7 11 other, stable
Multiple Sequence Analysis (MSA)
• combine different sequences prior to OMA processing
• eg. class, qualifications, (housing, marital and fertility statuses) are combined in a single measure
• the sequences represent a narrative of change (or stability) on the measured dimensions
• the resulting typology can be analysed using case and variable methods as before but is in itself a representation of complex time embedded associations between the source variables
Multiple Sequences: class and highest qualification
C 21 22 23 24 25 26 27 28 291 23 23 23 23 23 -1 -1 -1 -12 23 23 23 23 23 23 22 13 133 24 23 23 23 23 23 23 22 234 23 14 14 14 13 13 12 -1 -15 56 56 54 54 56 -1 -1 -1 -16 35 35 35 35 35 36 36 36 367 20 20 23 23 23 23 23 23 238 30 30 20 22 24 22 22 22 229 33 33 33 33 33 33 33 31 2310 22 22 24 24 22 22 25 16 11
1st Digit:1 = HE2 = Post GCSE/O grade3 = GCSE / O grade4 = Other5 = None/at school
2nd Digit:0 = no job yet1 = Service class Higher2 = Service class Lower3 = Non-manual4 = Self5 = Skilled6 = unskilled
Year on year changes
• This is a large (35 by 35 ) matrix• Calculation of substitution costs as for single
sequence structure• Frequent transitions: • 1211 (2.9%)• 1312 (2.3%)• 2122 (2.6%)• 2221 (2.6%)• 2223 (4.5%)• 2322 (5.9%)• 2625 (2.4%)
Sequence analysis of class-HEQ data
Clus % description1 11 post GCSE, NM, stable2 8 post GCSEHE, NMSCL3 5 no quals, self empunsk4 10 GCSE, mixed emp (self,sk,unsk)5 7 post GCSE, NMSCL6 7 GCSE, NM both stable7 4 post GCSE skilled, both stable8 6 from unsk and sk SCH, HE9 15 mixed10 4 other quals and SCL, SCH11 3 post GCSE, SCH/SCLswitching12 6 other and sk/unsk , stable13 8 post GCSE , unsk stable14 2 post GCSE, self , stable
Advantages of MSA
• Is not limited to a single sequence measure
• Is not limited to a single event type
• Articulates the full scope of related sequences together
Issues
• Increasing complexity of the measure as new variables drawn in
• computing time / software switching
• Lack of formal rules in executing the OMA and clustering processes
• Largely exploratory: scope to develop in relation to EHA