data analysis tools and associated scientific developments

WP - C Data analysis tools

Marco Bink & Gerrit Gort

Outline

Overview Work Package C C1:Upgrade standard tools

• Partly presented by M. Frisch, HOH

C2: Novel map-based tools C3: Genome-wide and locus specific tools C4: Large-data mining tools C5: Germplasm Simulator

• Presented by M. Frisch, HOH

Concluding remarks

WP-C I Upgrading statistical analysis tools

Objective: Upgrade standard cluster and correlation tools, able to handle large data sets

Case: cluster analysis in S-Plus clustering based on (genetic) distance matrix S-Plus functions not sufficient for large data sets

• May depend on computer capacity

BigClus algorithm (Gerrit Gort PRI) • Written in C-code, accessible in S-Plus via dynamic link library (DLL)

WP-C I BigClus algorithm characteristics

Methods of Clustering Single link Complete link Average link McQuitty’s Ward’s

Distance measures Eucledian Jaccard

Allow missing values Jaccard

Large datasets Ordinary dendograms will not suffice

(e.g., 5000 plants, 100 markers, Jaccard distance, UPGMA)

Ability to look at part of dendogram e.g. show first 25 clusters from top,

show number of observations below each leave.

S-PLUS functions to plot top of tree, plot summary

information on tree, like frequencies, cluster averages of covariates.

WP-C I Dendrograms (from BigClus)

WP-C II Novel map-based tools

Two important issues Account for genetic linkage map information

Consider molecular markers to be dependent variables Combine information from (a) trait characteristics

(b) passport data and (c) molecular markers Map-based diversity tools, cluster & correlation

analysis software

Core - selection

WP-C II Account for genetic linkage maps

Unlinked markers

Loosely linked markers

closely linked markers

Genetic distances Rational: Data on genetic

markers are likely correlated due to underlying genetic map

Utilise correlation structure? Account for correlation!

• Allow different weights for markers

Unequal distribution of markers across genome


Correlation among linked markers: erodes with increasing number of meioses separating

two individuals due to recombination increases due to linkage disequilibrium (non-random

mating / selection pressure)

Use all available markers calculate weights for every marker locus

• Partial regression coefficients (Zeng, PNAS ’93)• Meioses factor (Mf,) = Expected average number of meioses

separating two individuals


Unlinked markers

Loosely linked markers

closely linked markers

W = 1.0

W = 1.0

W = 0.5

W = 0.7

W = 0.2

W = 0.3

Meff = 5.0

Meff = 2.9

Meff = 1.2

Example!

WP-C II Combine passport, trait & marker info

S-Plus software offers a very limited possibility to combine different types of data Function “Daisy()” applies normalization to all data

variables, no specification of weights across variables

Improve/extend function “Daisy()” Allow user-defined weights for every variable S-Plus function WeightedDaisy()

• E.g., use weights for markers (from S-Plus function WeightMap() )

1 23 4 5

6

789

10

11 1213 14 15

16

17 18 1920 21 22

23

2425 26 2728 29 30 31 32 3334

35

36 37 3839 40 41 4243 4445 46 47

48 4950

51 525354 55 5657 58 59 606162 636465 66676869 70 71 72

73

74 75 7677 7879 8081 8283 84

85

86 87 88 89 90

91

9293 94 95 96 9798 99 100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

0.0

0.2

0.4

0.6

Hei

ght

12

345 6

7 8

9

101112

13

14

15

16

17

1819

20 2122 23

24

2526 27

28

29

30

31323334

35 3637

38

39

40 41

42

43

44

45 46 47

4849

50

51

52

53 54

5556

57

5859

60

61

62

63

64

65

66

67

68 69

70

71 72

73

7475

76

77

78

79

80

81

82

8384

85

86 87

88

8990

91

92

939495

9697 9899 100

101 10

2

103

104

105

106

107

108

109

110

111

112

113114

115 116

117

118 11

9 120

121

122

123

124

12512

6

127

128

12913

0

131

132

133

134

13513

6 137

138

139 140

141

142 14

3

144

145

146

147

148

149

150 15

1

152

153

154

15515

6

157

158

159 16

0

161

162 16

3

164

165

166 16

716

8

169

170

171

172

173

174

175

176

177

178

179

180

18118

2

183

184

185

186

187 18

8

189

190 19

1

192

193

194

195

196

197

198

199

200201

202

203

204 20

5

206

207

208

20921

0

211

212

213

214

215 21

6 217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

23423

5

236

237

238

239

240

241

242

243 244245 246 24

7248

249

250

251 25

2

253

254

255

256

257

258 259

26026

1

262 263

264

265

266

267

268 2

69

270

271

272

273

274

275

276

277 278

279

280

281 282

283

284

285

286

287

288

289

290

29129

2

293

294

295

296 29

7

298

299

300

301

302

303

304 305

306

307

308

309 310

311

312 31

3

314

315

316 31

7

318

319 32

0

321

322 32

3

324

325

326 32

732

8

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348 34

9

350

351 35

2

353 35

4

355

356

357

358

359

360

361

362

363

364

365

366

36736836

9

370

371

372

373

374

375

37637

7 378

379

380

381

382

383

384

385

386

387

388

389

390

391

392 39

3

394

395

39639

739

8399

400

401

402 40

3

404

405

406

407

408

409

410

411

412

41341

4

415 41

6

41741

8

419

420 42

1

42242

3

424 42

5426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

0.0

0.1

0.2

0.3

0.4

0.5

Hei

ght

12

3

4

5

6

78

9

1011

12 13

14

15 16

17

18 19 2021

22

2324

25

2627

28

29

30

31

32

33

34

35

36

37

3839

40 41

42

4344

4546

47

48

49

50

51 52

53

54

55 56

57

58

59

60 61

62

63

6465

66

67

68

6970

71

72 73

74

75 76

77

78 79

80

81

82 8384 85

8687

88

89

90

91

92

93

94

95

96

97

98 99

100

10110

2

103

104

105 10

6

107

108 10

9

110

111

112

113

114

115

116 11

7 118

119

12012

1

122

123

124

125

126 12

7

128

129

130

131

132

133

134

135

136

13713

8

139

140

141

142

143

144

145

146

147

148

149

150

15115

2

153

154

155

15615

7

15815

9

160

161

162

163

164

165

166

167 16

8

169

170

171

172

173 17

4

175

176 17

7

178

179

180

181

182

183

184

18518

6

187 18

8

189

190

19119

2

193 19

4195

196

197

198 19

9

200

201

202

20320

4

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

22622

7

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

25525

6

257

258

259 26

0

261

262

26326

4

265

266

267

268

269

270

271

272

273

274 27

5

276

277 27

827

9

280

281

28228

3

284

28528

6

287

288

289

290

29129

2293

294

29529

629

7 298 29

9

300

301

302

303 30

4 305

306

307

308

309

31031

131

2

313

314

315

316 31

7

318

319

320

321

322

32332

4

325

326

327

328

329 3

30 331

332

333

334

335

336

337

338

339

340

341

34234

3 344

345

346 34

7

348 34

9

35035

1

352

353

354

355

356

357 35

8

359

360 36

1

362

363

364

365

366

367

368

369

370

371

372 37

3

37437

5

376

377

378

379

380

381 38

2

38338

4

385 38

6

387 38

8

389

390

39139

2

393

394

395

396

397

398

399 40

0

401

402

403

404

405

406

407

408

409 41

0

411

412

413 41

4

415

416

417

418

419 42

0 421

422

423

424

425

426

42742

8

429

43043

1

432

43343

4

435

436

43743

8

439

440

441

442

0.0

0.2

0.4

0.6

0.8

Hei

ght

phenotypes

AFLP markers

MS markers

Poor distinction

Poor distinctionFair distinction

WP-C II Multiple sources of data for cluster analysis

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

2021 22

23

24

25 26 27

2829

30

31

32

33

34

3536

37

38

39

40 41

42

43

44

4546

47

48

49

50

51

52

53

54

5556

57

58

59

60 61

62

63

64

65

66

67

6869

70

71

72

73

74 75 76

77

78

79

80

81

82

8384

85

86 87

88

89

90

91

92

939495

96

97

98

99 100

101

102

103

104

105

106

107

108

109

110

111

11211

3

114

115

116117

118

119

120

121

122

123

124

125

126

127

128 12

9

130

131

132

13313

4

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

15615

7

158

159

160

161

162 16

3

164

165

166

167

168

169

170

171

172 17

3

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

19019

1

192

19319

4

195

196 19

7

198

199200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246 24

7

248

249

25025

1

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277 27

8

279

280

281

282

283

284

28528

6

287

288

289

290

291

292 29

3

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309 3

10

311

312

313

314

315

316

317

318

319320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351 3

5235

3354

355

356

357

358

359

360

361 36

2

363

364

365

366

36736

8

369

370

371

372

373

374

375

376

37737

8

379

380

381

382

383

384

385

386

387

388

389

390

39139

2

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424 42

542

6

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

0.0

0.1

0.2

0.3

0.4

0.5

Hei

ght

1

2

3

4

5

6

7 8

9 10

11

12

13

14

15

1617

1819

20

21

22

23

24

25

2627

28

29

30

31

32

33

34

3536 37

38

39

4041

42

43

44

4546

47

48

49

50

51

52

53

54

5556

57

58

59

60

61

62

63

6465

66

67

6869

70

71

72

73

74

75 76

77

78

79

80

81

82

83

84

85

8687

88

89

90

91

92

9394

95

9697

98 9910

010

1

102

103

104

105

106

107

108

109

110

111

112

113

114

115

11611

7 118

119

120 12

1

122

123

124

125

126

127

128129

130

131

132

133

134

135

136 13

7

138

139

140

141

14214

3

144

145

146

147

148

149

150

151

152

153

154

155

15615

7

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173 17

4

175

176

177

178

179

18018

118

2

183

184

185

186

187

188

189

190

191 19

2

19319

4

195

196

197

198 19

9

200

201

202

203

204

205

206

207

208

209

210 21

1

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

24724

8

249

250

251

252

253

254 25

5

256

257

258

259

260

261

262

263

264

265

266

267

268

269

27027

127

2

273

274

275

276

277

278

279

280

281

282

283

284

28528

6

287

28828

929

029

1

29229

3

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

31631

7

318

319

320

321

322

323

324

325

326

327

328

329

330331

332

333

334

335

336

337

338

339

340

341

342

34334

4

345

346

347

348

349

350 35

1 352

353

354

355

356

357

358

359

360

361

36236

3

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401 40

2

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423 424

425 42

6

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

0.0

0.2

0.4

0.6

Hei

ght

Standard weights (daisy())

User-defined weights (weighteddaisy())

WP-C II Combining multiple sources of data

WP-C II Example marker weights

m0444.8m0547.3

m10113.3

m13151.4

1

m1530.2m1635.6

m1860.7

m1970.2

m2092.5

m22111.3

m23131.2

m24141.5

m25151.4

m26166.4

m29198.7m30199.3

2

m310.1m336.6

m3441.8

m3867.6

m3977.0

m42115.9m43124.0m44126.7

3

m468.8

m5181.0

4

m5210.6m5312.5

m5438.1m5543.2

m5657.1

m5887.7m5990.9m60100.1m61100.2m62101.4

5

m665.8m6710.0m6914.1m7015.9

m7445.9m7548.5m7661.2m7770.5m7871.1m7972.5m8073.8

6

m813.0

m8220.4

7

m8836.8

m8947.5

m9057.9m9160.4

8

m9314.4m9421.4m9527.6

9

1.00

0.36

0.54

0.52

0.13

0.18

0. 431.00

WP-C II Results of cluster analyses w. & w/out weights1

2

3

4

5

6

7

8

9

10 11

1213

14

15

16

1718

1920

21

22

23

24

25 26

27

28

29

30

31

32

33

34

35

36 37

38

39

40

41

42

43

44 45

46

47

48

49

50 5152

53

54 55

56 57

58

59

60

61

62

63 646566

6768

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

9293

94

95

9697

98

99

100

0.0

0.2

0.4

0.6

Hei

ght

1

2

3

4

5

6

7

8

9

10 11

12

13

14

15

16 17

18

1920

21

22

23

24

25 26

27

28

29

30

31

32

33

34

35

36 37

38

39

40 41

42

43

44

45 46

47

48

49

50

51

52

53

54

55

5657

58

59

60

61

62

63

64

65

66

67 68

69

70

71

72

73

74

75

76

77

78

79

80

8182

83

84

8586

87

88

89

90

91

92

93

94

95

96

97

98

99

100

0.0

0.2

0.4

0.6

Hei

ght

1 63 95

10 11 57

55

74

1

63 95

10 11 57

55

74

Map-based weights

Standard weights

WP-C II (next step) Core Selection Form N (e.g., 6) distinct groups

cluster analysis tree Cut tree at arbitrary level Our example: group sizes

• No weights: 81, 7, 6, 2, 2, and 2

• Map-based weights: 4, 7, 81, 2, 4, and 2

Sample/select from each group a given number Define core selection, e.g., 12 Sampling strategy Standard Map-based

• Constant [2 2 2 2 2 2] [2 2 2 2 2 2]

• Proportional [7 1 1 1 1 1] [1 1 7 1 1 1]

• Logproportional [5 2 2 1 1 1] [1 2 5 1 2 1]

WP-C II Core selection (logproportional sampling)

1

2

3

4

5

6

7

8

9

10 11

1213

14

15

16

1718

1920

21

22

23

24

25 26

27

28

29

30

31

32

33

34

35

36 37

38

39

40

41

42

43

44 45

46

47

48

49

50 5152

53

54 55

56 57

58

59

60

61

62

63 646566

6768

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

9293

94

95

9697

98

99

100

0.0

0.2

0.4

0.6

Hei

ght

1

2

3

4

5

6

7

8

9

10 11

12

13

14

15

16 17

18

1920

21

22

23

24

25 26

27

28

29

30

31

32

33

34

35

36 37

38

39

40 41

42

43

44

45 46

47

48

49

50

51

52

53

54

55

5657

58

59

60

61

62

63

64

65

66

67 68

69

70

71

72

73

74

75

76

77

78

79

80

8182

83

84

8586

87

88

89

90

91

92

93

94

95

96

97

98

99

100

0.0

0.2

0.4

0.6

Hei

ght

Cut off line

3 42 24,774,23 514,25,68,94,95

12 accessions selected from 6 clusters

Tree from map-based clustering

WP-C III Genome-wide and Locus-specific

mapping

Objective was to develop novel map-based tools for searching systematically for useful genes and alleles in germplasm collections

Genome-wide search

Tagged loci search (fine mapping)

WP-C III Genome-wide mapping

Marker-marker association Assemble genome wide map of AFLP markers (no map available) Only few markers could be mapped last summer (KeyGene) Are high associations indicative for distance between markers on

genome?

Marker-trait association More interesting to associate markers to traits, e.g. Bremia

resistance to map genes coding for trait But: if high associations between markers are not indicative for

distance between markers does it make sense to associate markers to traits then?

WP-C III Retrieval of linkage map from

genome wide pair-wise marker associations

Multi Dimensional Scaling (MDS) One-dimensional representation of markers from pair-

wise distances is achieved, corresponding to a marker map.

Correction for population structure is very important• Logistic regression correction by stratification

Three types of MDS (S-PLUS) evaluated• Classical (= PCO = Principal Coordinate Analysis)

• Kruskal's ( = non-metric MDS)

• Sammon’s MDS ( minimizes weighted “stress”) (performs best)

WP-C III Example MDS to form linkage map

WP-C III Resolution of QTL (fine) mapping

Experiments of linkage analysis 2 or 3 generations of individuals limited number of meioses in experiment dense marker maps hardly improve map-resolution QTL

• Even with RIL populations: 5 - 10 centiMorgan

Higher resolution desired to allow better (molecular) study of gene involved

• cloning, comparative mapping, etc.

identify tightly linked markers• more efficient marker-assisted breeding

WP-C III Locus specific (Fine) mapping

This leads to the detection of a small region containing the disease gene.

Key-paper: Meuwissen & Goddard (2000) Genetics 155:421-430

Linkage disequilibrium mapping successful in mapping genetic disorders:

= Identify a chromosomal region that is identical by descent (IBD)

among diseased individuals (region may contain disease gene)

The IBD region is detected by closely linked marker loci that carry identical alleles at this region in the diseased individuals.

Size of IBD region decreases with the number of meioses since the disease mutation occurred and may be small.

WP-C III Methodology LD fine-mapping of QTL

QTL position known up to 5 - 20 cM precision effective population size for many discrete generations phenotypes available for last generation of individuals Fully inbred individuals (selfed by single seed descent)

(1) Expected correlation matrix among marker haplotypes Whether two marker haplotypes have identical alleles in a region

depends on the position of the QTL. Hence, the covariance between haplotype effects depends on the position of the QTL.

Identity By Descent (IBD) probability (2) Maximum Likelihood estimation of QTL position

Linear model (phenotypes normally distributed) ML estimates for each marker interval

WP-C III Calculate power of QTL fine-mapping 20 markers = 19 intervals

• Simulation: QTL between marker 10 and 11

• Estimated interval: Interval with highest ML estimate

– MLbase = ML without QTL

– MLQTL,I = ML with QTL in interval I

– Test statistic : MLQTL,I - MLbase

Deviations of estimated interval from true interval• -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Power % replicates in true or next to true interval (interval -1 , 0, 1)

1000 replicates/scenario

1000 replicates

0

100

200

300

400

500

600

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9deviation from true interval

freq

uen

cy

Power = 0.91

25

50

100 2

5

10

20

0.40

0.60

0.80

po

we

r

generationsmarker distance

WP-C III Results of power from simulations

WP-C III Locus-specific search

Separate modules (C–language) for Calculation of IBD probabilities (= expected correlation

matrix) Simulation of data sets & Max. Likelihood estimation

Paper on methodology and simulation results Bink & Meuwissen (2004) Euphytica, in press

WP-C IV Large-data mining tools

Objective was to find important patterns within the germplasm data set, which are not apparent from visual analysis and to compare and contrast these patterns with those found from the classical statistical analyses

WP-C IV Large-data mining tools

Methods Data Mining methods (JIC)

• Decision Trees, Built with C4.5 (Quinlan) [ DAM ]

• Rule induction, Simulated Annealing: Witness Miner 2001

Artificial Neural Networks (PRI)• Linear Vector Quantisation [ LVQ ]

• Support Vector Machines [ SVM ]

(Classical) Statistical analysis (PRI)• LDA/Linear Regression [ CS ]

WP-C IV Large-data mining ‘data set’

Data: 1423 Lactuca Sativa accessions, CGN X1: 167 AFLP markers X2: 20 (2x10) STMS markers Y: 5 traits, all treated as categorical variables

• Y1:[n=1413] seed colour (black, white, varied)• Y2:[n= 761] flowering time (< 41 d., 41-60, …. 101-120 d.)• Y3:[n=1208] leaf colour (yellow, green, grey, blue, red)• Y4:[n= 927] resistance to Bremia 1 (resistant, susceptible)• Y5:[n= 919] resistance to Bremia 3 (resistant, susceptible)

Data split into training and test sample (50 - 50) Objective: use X to predict Y

Criteria: coverage, accuracy, applicability

Rule:

Reality:

A B Tot Cov

A 10 20 30 10/30

B 0 20 20 20/20

Tot 10 40 50

Accur 10/10 20/40

Applic 10/50 40/50A

Results: Resistance to Bremia 1

Performance across all traits: LVQ lowest CS good SVM & DAM best

Note: differences not very large!

Trade-off between increase accuracy and maintain coverage/applicability!

Concluding remarksNovel statistical tools available

Answer questions you could not answer before?

Applicability & integration with other WP–tools

© Wageningen UR

data analysis tools and associated scientific developments

Documents

genetic markers

markers w

wpc iiaccount

wpc idendrograms

available markers

wpc iiexample marker

genome unlinked markers

sum w unlinked markers