data analysis tools and associated scientific developments
TRANSCRIPT
WP - C Data analysis tools
Marco Bink & Gerrit Gort
Outline
Overview Work Package C C1:Upgrade standard tools
• Partly presented by M. Frisch, HOH
C2: Novel map-based tools C3: Genome-wide and locus specific tools C4: Large-data mining tools C5: Germplasm Simulator
• Presented by M. Frisch, HOH
Concluding remarks
WP-C I Upgrading statistical analysis tools
Objective: Upgrade standard cluster and correlation tools, able to handle large data sets
Case: cluster analysis in S-Plus clustering based on (genetic) distance matrix S-Plus functions not sufficient for large data sets
• May depend on computer capacity
BigClus algorithm (Gerrit Gort PRI) • Written in C-code, accessible in S-Plus via dynamic link library (DLL)
WP-C I BigClus algorithm characteristics
Methods of Clustering Single link Complete link Average link McQuitty’s Ward’s
Distance measures Eucledian Jaccard
Allow missing values Jaccard
Large datasets Ordinary dendograms will not suffice
(e.g., 5000 plants, 100 markers, Jaccard distance, UPGMA)
Ability to look at part of dendogram e.g. show first 25 clusters from top,
show number of observations below each leave.
S-PLUS functions to plot top of tree, plot summary
information on tree, like frequencies, cluster averages of covariates.
WP-C I Dendrograms (from BigClus)
WP-C II Novel map-based tools
Two important issues Account for genetic linkage map information
Consider molecular markers to be dependent variables Combine information from (a) trait characteristics
(b) passport data and (c) molecular markers Map-based diversity tools, cluster & correlation
analysis software
Core - selection
WP-C II Account for genetic linkage maps
Unlinked markers
Loosely linked markers
closely linked markers
Genetic distances Rational: Data on genetic
markers are likely correlated due to underlying genetic map
Utilise correlation structure? Account for correlation!
• Allow different weights for markers
Unequal distribution of markers across genome
WP-C II Account for genetic linkage maps
Correlation among linked markers: erodes with increasing number of meioses separating
two individuals due to recombination increases due to linkage disequilibrium (non-random
mating / selection pressure)
Use all available markers calculate weights for every marker locus
• Partial regression coefficients (Zeng, PNAS ’93)• Meioses factor (Mf,) = Expected average number of meioses
separating two individuals
WP-C II Account for genetic linkage maps
Unlinked markers
Loosely linked markers
closely linked markers
W = 1.0
W = 1.0
W = 0.5
W = 0.7
W = 0.2
W = 0.3
Meff = 5.0
Meff = 2.9
Meff = 1.2
Example!
WP-C II Combine passport, trait & marker info
S-Plus software offers a very limited possibility to combine different types of data Function “Daisy()” applies normalization to all data
variables, no specification of weights across variables
Improve/extend function “Daisy()” Allow user-defined weights for every variable S-Plus function WeightedDaisy()
• E.g., use weights for markers (from S-Plus function WeightMap() )
1 23 4 5
6
789
10
11 1213 14 15
16
17 18 1920 21 22
23
2425 26 2728 29 30 31 32 3334
35
36 37 3839 40 41 4243 4445 46 47
48 4950
51 525354 55 5657 58 59 606162 636465 66676869 70 71 72
73
74 75 7677 7879 8081 8283 84
85
86 87 88 89 90
91
9293 94 95 96 9798 99 100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
0.0
0.2
0.4
0.6
Hei
ght
12
345 6
7 8
9
101112
13
14
15
16
17
1819
20 2122 23
24
2526 27
28
29
30
31323334
35 3637
38
39
40 41
42
43
44
45 46 47
4849
50
51
52
53 54
5556
57
5859
60
61
62
63
64
65
66
67
68 69
70
71 72
73
7475
76
77
78
79
80
81
82
8384
85
86 87
88
8990
91
92
939495
9697 9899 100
101 10
2
103
104
105
106
107
108
109
110
111
112
113114
115 116
117
118 11
9 120
121
122
123
124
12512
6
127
128
12913
0
131
132
133
134
13513
6 137
138
139 140
141
142 14
3
144
145
146
147
148
149
150 15
1
152
153
154
15515
6
157
158
159 16
0
161
162 16
3
164
165
166 16
716
8
169
170
171
172
173
174
175
176
177
178
179
180
18118
2
183
184
185
186
187 18
8
189
190 19
1
192
193
194
195
196
197
198
199
200201
202
203
204 20
5
206
207
208
20921
0
211
212
213
214
215 21
6 217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
23423
5
236
237
238
239
240
241
242
243 244245 246 24
7248
249
250
251 25
2
253
254
255
256
257
258 259
26026
1
262 263
264
265
266
267
268 2
69
270
271
272
273
274
275
276
277 278
279
280
281 282
283
284
285
286
287
288
289
290
29129
2
293
294
295
296 29
7
298
299
300
301
302
303
304 305
306
307
308
309 310
311
312 31
3
314
315
316 31
7
318
319 32
0
321
322 32
3
324
325
326 32
732
8
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348 34
9
350
351 35
2
353 35
4
355
356
357
358
359
360
361
362
363
364
365
366
36736836
9
370
371
372
373
374
375
37637
7 378
379
380
381
382
383
384
385
386
387
388
389
390
391
392 39
3
394
395
39639
739
8399
400
401
402 40
3
404
405
406
407
408
409
410
411
412
41341
4
415 41
6
41741
8
419
420 42
1
42242
3
424 42
5426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
0.0
0.1
0.2
0.3
0.4
0.5
Hei
ght
12
3
4
5
6
78
9
1011
12 13
14
15 16
17
18 19 2021
22
2324
25
2627
28
29
30
31
32
33
34
35
36
37
3839
40 41
42
4344
4546
47
48
49
50
51 52
53
54
55 56
57
58
59
60 61
62
63
6465
66
67
68
6970
71
72 73
74
75 76
77
78 79
80
81
82 8384 85
8687
88
89
90
91
92
93
94
95
96
97
98 99
100
10110
2
103
104
105 10
6
107
108 10
9
110
111
112
113
114
115
116 11
7 118
119
12012
1
122
123
124
125
126 12
7
128
129
130
131
132
133
134
135
136
13713
8
139
140
141
142
143
144
145
146
147
148
149
150
15115
2
153
154
155
15615
7
15815
9
160
161
162
163
164
165
166
167 16
8
169
170
171
172
173 17
4
175
176 17
7
178
179
180
181
182
183
184
18518
6
187 18
8
189
190
19119
2
193 19
4195
196
197
198 19
9
200
201
202
20320
4
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
22622
7
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
25525
6
257
258
259 26
0
261
262
26326
4
265
266
267
268
269
270
271
272
273
274 27
5
276
277 27
827
9
280
281
28228
3
284
28528
6
287
288
289
290
29129
2293
294
29529
629
7 298 29
9
300
301
302
303 30
4 305
306
307
308
309
31031
131
2
313
314
315
316 31
7
318
319
320
321
322
32332
4
325
326
327
328
329 3
30 331
332
333
334
335
336
337
338
339
340
341
34234
3 344
345
346 34
7
348 34
9
35035
1
352
353
354
355
356
357 35
8
359
360 36
1
362
363
364
365
366
367
368
369
370
371
372 37
3
37437
5
376
377
378
379
380
381 38
2
38338
4
385 38
6
387 38
8
389
390
39139
2
393
394
395
396
397
398
399 40
0
401
402
403
404
405
406
407
408
409 41
0
411
412
413 41
4
415
416
417
418
419 42
0 421
422
423
424
425
426
42742
8
429
43043
1
432
43343
4
435
436
43743
8
439
440
441
442
0.0
0.2
0.4
0.6
0.8
Hei
ght
phenotypes
AFLP markers
MS markers
Poor distinction
Poor distinctionFair distinction
WP-C II Multiple sources of data for cluster analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
2021 22
23
24
25 26 27
2829
30
31
32
33
34
3536
37
38
39
40 41
42
43
44
4546
47
48
49
50
51
52
53
54
5556
57
58
59
60 61
62
63
64
65
66
67
6869
70
71
72
73
74 75 76
77
78
79
80
81
82
8384
85
86 87
88
89
90
91
92
939495
96
97
98
99 100
101
102
103
104
105
106
107
108
109
110
111
11211
3
114
115
116117
118
119
120
121
122
123
124
125
126
127
128 12
9
130
131
132
13313
4
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
15615
7
158
159
160
161
162 16
3
164
165
166
167
168
169
170
171
172 17
3
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
19019
1
192
19319
4
195
196 19
7
198
199200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246 24
7
248
249
25025
1
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277 27
8
279
280
281
282
283
284
28528
6
287
288
289
290
291
292 29
3
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309 3
10
311
312
313
314
315
316
317
318
319320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351 3
5235
3354
355
356
357
358
359
360
361 36
2
363
364
365
366
36736
8
369
370
371
372
373
374
375
376
37737
8
379
380
381
382
383
384
385
386
387
388
389
390
39139
2
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424 42
542
6
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
0.0
0.1
0.2
0.3
0.4
0.5
Hei
ght
1
2
3
4
5
6
7 8
9 10
11
12
13
14
15
1617
1819
20
21
22
23
24
25
2627
28
29
30
31
32
33
34
3536 37
38
39
4041
42
43
44
4546
47
48
49
50
51
52
53
54
5556
57
58
59
60
61
62
63
6465
66
67
6869
70
71
72
73
74
75 76
77
78
79
80
81
82
83
84
85
8687
88
89
90
91
92
9394
95
9697
98 9910
010
1
102
103
104
105
106
107
108
109
110
111
112
113
114
115
11611
7 118
119
120 12
1
122
123
124
125
126
127
128129
130
131
132
133
134
135
136 13
7
138
139
140
141
14214
3
144
145
146
147
148
149
150
151
152
153
154
155
15615
7
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173 17
4
175
176
177
178
179
18018
118
2
183
184
185
186
187
188
189
190
191 19
2
19319
4
195
196
197
198 19
9
200
201
202
203
204
205
206
207
208
209
210 21
1
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
24724
8
249
250
251
252
253
254 25
5
256
257
258
259
260
261
262
263
264
265
266
267
268
269
27027
127
2
273
274
275
276
277
278
279
280
281
282
283
284
28528
6
287
28828
929
029
1
29229
3
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
31631
7
318
319
320
321
322
323
324
325
326
327
328
329
330331
332
333
334
335
336
337
338
339
340
341
342
34334
4
345
346
347
348
349
350 35
1 352
353
354
355
356
357
358
359
360
361
36236
3
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401 40
2
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423 424
425 42
6
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
0.0
0.2
0.4
0.6
Hei
ght
Standard weights (daisy())
User-defined weights (weighteddaisy())
WP-C II Combining multiple sources of data
WP-C II Example marker weights
m0444.8m0547.3
m10113.3
m13151.4
1
m1530.2m1635.6
m1860.7
m1970.2
m2092.5
m22111.3
m23131.2
m24141.5
m25151.4
m26166.4
m29198.7m30199.3
2
m310.1m336.6
m3441.8
m3867.6
m3977.0
m42115.9m43124.0m44126.7
3
m468.8
m5181.0
4
m5210.6m5312.5
m5438.1m5543.2
m5657.1
m5887.7m5990.9m60100.1m61100.2m62101.4
5
m665.8m6710.0m6914.1m7015.9
m7445.9m7548.5m7661.2m7770.5m7871.1m7972.5m8073.8
6
m813.0
m8220.4
7
m8836.8
m8947.5
m9057.9m9160.4
8
m9314.4m9421.4m9527.6
9
1.00
0.36
0.54
0.52
0.13
0.18
0. 431.00
WP-C II Results of cluster analyses w. & w/out weights1
2
3
4
5
6
7
8
9
10 11
1213
14
15
16
1718
1920
21
22
23
24
25 26
27
28
29
30
31
32
33
34
35
36 37
38
39
40
41
42
43
44 45
46
47
48
49
50 5152
53
54 55
56 57
58
59
60
61
62
63 646566
6768
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
9293
94
95
9697
98
99
100
0.0
0.2
0.4
0.6
Hei
ght
1
2
3
4
5
6
7
8
9
10 11
12
13
14
15
16 17
18
1920
21
22
23
24
25 26
27
28
29
30
31
32
33
34
35
36 37
38
39
40 41
42
43
44
45 46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
62
63
64
65
66
67 68
69
70
71
72
73
74
75
76
77
78
79
80
8182
83
84
8586
87
88
89
90
91
92
93
94
95
96
97
98
99
100
0.0
0.2
0.4
0.6
Hei
ght
1 63 95
10 11 57
55
74
1
63 95
10 11 57
55
74
Map-based weights
Standard weights
WP-C II (next step) Core Selection Form N (e.g., 6) distinct groups
cluster analysis tree Cut tree at arbitrary level Our example: group sizes
• No weights: 81, 7, 6, 2, 2, and 2
• Map-based weights: 4, 7, 81, 2, 4, and 2
Sample/select from each group a given number Define core selection, e.g., 12 Sampling strategy Standard Map-based
• Constant [2 2 2 2 2 2] [2 2 2 2 2 2]
• Proportional [7 1 1 1 1 1] [1 1 7 1 1 1]
• Logproportional [5 2 2 1 1 1] [1 2 5 1 2 1]
WP-C II Core selection (logproportional sampling)
1
2
3
4
5
6
7
8
9
10 11
1213
14
15
16
1718
1920
21
22
23
24
25 26
27
28
29
30
31
32
33
34
35
36 37
38
39
40
41
42
43
44 45
46
47
48
49
50 5152
53
54 55
56 57
58
59
60
61
62
63 646566
6768
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
9293
94
95
9697
98
99
100
0.0
0.2
0.4
0.6
Hei
ght
1
2
3
4
5
6
7
8
9
10 11
12
13
14
15
16 17
18
1920
21
22
23
24
25 26
27
28
29
30
31
32
33
34
35
36 37
38
39
40 41
42
43
44
45 46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
62
63
64
65
66
67 68
69
70
71
72
73
74
75
76
77
78
79
80
8182
83
84
8586
87
88
89
90
91
92
93
94
95
96
97
98
99
100
0.0
0.2
0.4
0.6
Hei
ght
Cut off line
3 42 24,774,23 514,25,68,94,95
12 accessions selected from 6 clusters
Tree from map-based clustering
WP-C III Genome-wide and Locus-specific
mapping
Objective was to develop novel map-based tools for searching systematically for useful genes and alleles in germplasm collections
Genome-wide search
Tagged loci search (fine mapping)
WP-C III Genome-wide mapping
Marker-marker association Assemble genome wide map of AFLP markers (no map available) Only few markers could be mapped last summer (KeyGene) Are high associations indicative for distance between markers on
genome?
Marker-trait association More interesting to associate markers to traits, e.g. Bremia
resistance to map genes coding for trait But: if high associations between markers are not indicative for
distance between markers does it make sense to associate markers to traits then?
WP-C III Retrieval of linkage map from
genome wide pair-wise marker associations
Multi Dimensional Scaling (MDS) One-dimensional representation of markers from pair-
wise distances is achieved, corresponding to a marker map.
Correction for population structure is very important• Logistic regression correction by stratification
Three types of MDS (S-PLUS) evaluated• Classical (= PCO = Principal Coordinate Analysis)
• Kruskal's ( = non-metric MDS)
• Sammon’s MDS ( minimizes weighted “stress”) (performs best)
WP-C III Example MDS to form linkage map
WP-C III Resolution of QTL (fine) mapping
Experiments of linkage analysis 2 or 3 generations of individuals limited number of meioses in experiment dense marker maps hardly improve map-resolution QTL
• Even with RIL populations: 5 - 10 centiMorgan
Higher resolution desired to allow better (molecular) study of gene involved
• cloning, comparative mapping, etc.
identify tightly linked markers• more efficient marker-assisted breeding
WP-C III Locus specific (Fine) mapping
This leads to the detection of a small region containing the disease gene.
Key-paper: Meuwissen & Goddard (2000) Genetics 155:421-430
Linkage disequilibrium mapping successful in mapping genetic disorders:
= Identify a chromosomal region that is identical by descent (IBD)
among diseased individuals (region may contain disease gene)
The IBD region is detected by closely linked marker loci that carry identical alleles at this region in the diseased individuals.
Size of IBD region decreases with the number of meioses since the disease mutation occurred and may be small.
WP-C III Methodology LD fine-mapping of QTL
QTL position known up to 5 - 20 cM precision effective population size for many discrete generations phenotypes available for last generation of individuals Fully inbred individuals (selfed by single seed descent)
(1) Expected correlation matrix among marker haplotypes Whether two marker haplotypes have identical alleles in a region
depends on the position of the QTL. Hence, the covariance between haplotype effects depends on the position of the QTL.
Identity By Descent (IBD) probability (2) Maximum Likelihood estimation of QTL position
Linear model (phenotypes normally distributed) ML estimates for each marker interval
WP-C III Calculate power of QTL fine-mapping 20 markers = 19 intervals
• Simulation: QTL between marker 10 and 11
• Estimated interval: Interval with highest ML estimate
– MLbase = ML without QTL
– MLQTL,I = ML with QTL in interval I
– Test statistic : MLQTL,I - MLbase
Deviations of estimated interval from true interval• -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Power % replicates in true or next to true interval (interval -1 , 0, 1)
1000 replicates/scenario
1000 replicates
0
100
200
300
400
500
600
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9deviation from true interval
freq
uen
cy
Power = 0.91
25
50
100 2
5
10
20
0.40
0.60
0.80
po
we
r
generationsmarker distance
WP-C III Results of power from simulations
WP-C III Locus-specific search
Separate modules (C–language) for Calculation of IBD probabilities (= expected correlation
matrix) Simulation of data sets & Max. Likelihood estimation
Paper on methodology and simulation results Bink & Meuwissen (2004) Euphytica, in press
WP-C IV Large-data mining tools
Objective was to find important patterns within the germplasm data set, which are not apparent from visual analysis and to compare and contrast these patterns with those found from the classical statistical analyses
WP-C IV Large-data mining tools
Methods Data Mining methods (JIC)
• Decision Trees, Built with C4.5 (Quinlan) [ DAM ]
• Rule induction, Simulated Annealing: Witness Miner 2001
Artificial Neural Networks (PRI)• Linear Vector Quantisation [ LVQ ]
• Support Vector Machines [ SVM ]
(Classical) Statistical analysis (PRI)• LDA/Linear Regression [ CS ]
WP-C IV Large-data mining ‘data set’
Data: 1423 Lactuca Sativa accessions, CGN X1: 167 AFLP markers X2: 20 (2x10) STMS markers Y: 5 traits, all treated as categorical variables
• Y1:[n=1413] seed colour (black, white, varied)• Y2:[n= 761] flowering time (< 41 d., 41-60, …. 101-120 d.)• Y3:[n=1208] leaf colour (yellow, green, grey, blue, red)• Y4:[n= 927] resistance to Bremia 1 (resistant, susceptible)• Y5:[n= 919] resistance to Bremia 3 (resistant, susceptible)
Data split into training and test sample (50 - 50) Objective: use X to predict Y
Criteria: coverage, accuracy, applicability
Rule:
Reality:
A B Tot Cov
A 10 20 30 10/30
B 0 20 20 20/20
Tot 10 40 50
Accur 10/10 20/40
Applic 10/50 40/50A
Results: Resistance to Bremia 1
Performance across all traits: LVQ lowest CS good SVM & DAM best
Note: differences not very large!
Trade-off between increase accuracy and maintain coverage/applicability!
Concluding remarksNovel statistical tools available
Answer questions you could not answer before?
Applicability & integration with other WP–tools
© Wageningen UR