eccv local label descriptor for example based semantic image...

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

ECCV

#1378ECCV

#1378

Local Label Descriptor for Example basedSemantic Image Labeling

Supplementary Material for ECCV submission

Paper ID 1378

In this supplementary document, we provide details of our tree constructionand additional experiment results.

A Tree Construction

We construct a random tree in the same way as [1], except that we make somechanges to adapt the procedure to our label histogram descriptors (instead ofraw label patches).

Specifically, we first collect the feature descriptor and label descriptor pairs(descriptor pairs), denoted by (gj ,qj), for all the patches in the training set,which form the root of one tree. We randomly select nh feature descriptor com-ponents to generate split hypotheses; each split hypothesis is a random thresholdgenerated within the range of feature descriptor values of the selected compo-nent. A random component in label descriptor will also be selected to evaluateall the nh split hypothesis. The split that results in the minimum entropy at theselected label descriptor component is selected as the final split to divide thisset (root) into two parts (nodes). Then we split each node recursively using thesame randomized procedure.

The splitting stops when the feature descriptors are similar enough. In theend, this recursive procedure generates a random binary tree whose leaf nodescontain a small set of descriptor pairs. If the label descriptors are also similar,we randomly select a descriptor pair as an exemplar. Otherwise, we cluster thedescriptor pairs based on label descriptors and randomly pick one descriptorpair from each cluster as exemplars. This will maintain the label diversity in leafnodes.

Note that in [1], raw label patches are used, which is equivalent to use labeldescriptors with label cell size 1. In the splitting procedure, we use nh = 20.

B A Synthetic Experiment on CamVid

The quality of candidate sets plays a critical role that affects the final perfor-mance of both the baseline algorithm and ours. In an extreme case, if a testingimage has a completely different visual appearance from the training images(e.g., training on street views but testing on desert scenes), neither algorithmswould perform well. To isolate the effect of candidate sets in evaluating ourmethod, we perform the following synthetic experiment.

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

ECCV

#1378ECCV

#1378

2 ECCV-12 submission ID 1378

Setting Description Global Avg(Class) Avg(Pascal)

Baseline 85.55 64.00 55.12(1) Off, (2) On, (3) On 91.37 72.85 65.40(1) On, (2) On, (3) On 94.00 78.47 69.74

(1) Over-Segmentation (2) Continuous Optimization (3) Label DescriptorBaseline [1] (our implementation) = (1) Off, (2) Off, (3) Off

Table 1: Overall accuracy in the synthetic experiment (described in Section B).In this experiment, all methods use the same nearly ideal candidate sets. Con-tinuous optimization (convex relaxation) and label descriptor improve the per-formance noticeably compared to the baseline, in both overall correctness andaverage per-class scores. The low level over-segmentation further boosts the per-formance as pixels of similar colors are constrained to have the same label.

bycyclist luggage child text pedestrian sidewalk SUV vegetation

020

4060

8010

0

64.2

77.99

84.19

41.05

56.34

64.15

36.68

56.51

62.2

66.68

73.43

81.21

46.85

67.72

74.19

78.66

89.8591.54

64.85

89.68

92.5

58.16

85.8587.87

baseline(1) off, (2) on, (3) on(1) on, (2) on, (3) on

(1) Over-Segmentation (2) Continuous Optimization (3) Label DescriptorBaseline [1] (our implementation) = (1) off, (2) off, (3) off

Fig. 1: Performance comparison on small object classes in the synthetic experi-ment (described in Section B). Combined with continuous optimization (convexrelaxation), our label descriptor better identifies small objects in the imagesby alleviating the misalignment problem. Using over-segmentation, our methodfurther improves over the baseline on the small object classes.

Fig. 2: Testing images used in Figure 3.

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

ECCV

#1378ECCV

#1378

ECCV-12 submission ID 1378 3

Ground Truthfor Images in

Figure 2

Baseline

overall: 87.4

76.5 90.9 8.5 N/A 20.7

overall: 82.9

83.1 87.3 14.5 47.5 27.2

overall: 73.8

63.2 33.3 N/A 60.3 28.8

(1) Off(2) On(3) On

overall: 93.9

92.0 93.1 44.0 N/A 5.9

overall: 92.3

91.8 90.4 84.8 58.4 22.2

overall: 88.3

74.6 55.0 N/A 88.3 28.0

(1) On(2) On(3) On

overall: 96.1

93.6 94.7 53.6 N/A 19.8

overall: 94.4

91.4 93.5 88.6 65.5 46.7

overall: 93.7

81.5 55.9 N/A 94.4 51.0

Sidewalk Car Pedestrian Tree Column

(1) Over-Segmentation (2) Continuous Optimization (3) Label DescriptorBaseline [1] (our implementation) = (1) Off, (2) Off, (3) Off

Fig. 3: Three example results from the synthetic experiment (described in Sec-tion B). Overall, the performance of our method improves as more design choicesare applied. In particular, when all design choices are applied, our method oftenperforms the best. It is interesting to note that, although COLUMNs appear inthe baseline’s result for the third image (second row, third column), they are notin the correct orientation, because the baseline method uses raw label patchesfor label map inference and lacks the flexibility to handle label patch misalign-ment. Using label descriptors and over-segmentation, our method alleviates thisproblem and generates better results (third and fourth rows, third column).

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

ECCV

#1378ECCV

#1378


For each training image, we generate a slightly rotated (5-degree) and scaled(a factor of 0.9) version as a testing image. For each patch in the testing (trans-formed) image, we collect its top nearest neighbors from the corresponding orig-inal (non-transformed) image to form the candidate set. The nearest neighborsare defined using the Euclidean distance between feature descriptors. Such can-didate sets are nearly optimal, subject to the label patch misalignment and thepossible feature descriptor matching failure, both of which are introduced by thegeometric transformation.

On such synthetic testing dataset (generated from the 377 training imagesin the CamVid dataset), we evaluate the baseline method and 3 design choicesof our method. The overall accuracy for different methods is shown in Table 1;the performance on small object classes is shown in Figure 1. Our methodsoutperform the baseline in almost every per-class accuracy, especially on smallobject classes. Several qualitative visual comparisons are provided in Figure 3.

Note that in this experiment, all methods use the same nearly ideal candidatesets. Therefore the performance difference should be attributed to algorithmdesign choices.

C Additional Real Experiments on CamVid

We provide additional experiments and visualization to further evaluate ouralgorithm design choices. We use all the 233 real testing images (including theimages at dusk) in the CamVid dataset and test on the 11 classes as in [1]. Tomeasure the impact of each design choice, we start from the baseline method,and turn on one or two design choices each time and evaluate their performance.

The overall accuracy for each combination of design choices is shown in Ta-ble 2. The per-class accuracy on images in daylight is shown in Figure 4 andFigure 5 for large and small object classes, respectively; images at dusk are ex-cluded in per-class accuracy evaluation because small objects are too similar tothe background, making them hard to recognize even for human vision. Severalqualitative visual comparisons are provided in Figure 7.

Setting Description Global Avg(Class) Avg(Pascal)

Baseline 59.40 28.38 20.09(1) On, (2) Off, (3) Off, (4) Off 62.07 29.33 21.50(1) On, (2) On, (3) On, (4) Off 66.25 32.54 24.76(1) On, (2) On, (3) On, (4) On 67.93 33.95 26.35

(1) Over-Segmentation (2) Continuous Optimization(3) Label Descriptor (4) Candidate Set EvolutionBaseline [1] (our implementation) = (1) Off, (2) Off, (3) Off, (4) Off

Table 2: Overall accuracy in the experiment in Section C. As each of our designchoices is turned on, all the scores increase.

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

ECCV

#1378ECCV

#1378


Overall, the performance of our method improves as more design choices areapplied. In particular, when all design choices are applied, our method oftenperforms the best.

We also note that the accuracy for smaller object classes shown in Figure 5 ismuch lower than that for large object classes shown in Figure 4, which suggeststhat more research is definitely needed in this regard for this approach.

References

1. Kontschieder, P., Bulo, S., Bischof, H., Pelillo, M.: Structured class-labels in randomforests for semantic image labelling. In: ICCV. (2011)

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

ECCV

#1378ECCV

#1378


road building sky tree

020

4060

8010

0 97.85 98.38 97.86 97.92

77.28

72.34

80.3782.3

64.44

87.7489.66 90.13

53.06

59.96

65.26

68.16

baseline(1) on, (2) off, (3) off, (4) off(1) on, (2) on, (3) on, (4) off(1) on, (2) on, (3) on, (4) on


Fig. 4: Results for the experiment in Section C. Per-class accuracy for the largestobject classes: ROAD, BUILDING, SKY, and TREE, which account for 29.7%,28.3%, 17.2% and 7.5% of the pixels in the testing images, respectively. Per-formance of all methods are comparable at the largest class ROAD; and ourmethods, especially when all design choices are ON, are generally better thanthe baseline on the other large classes BUILDING, SKY, and TREE. Note thatover-segmentation significantly improves the performance on SKY, because theSKY often appears as a region of uniform colors and it is relatively easy to gen-erate a good segment for SKY in the image. For TREE, it often contains morestructure details than the other three classes; as a result, combining label de-scriptor and continuous optimization (convex relaxation) shows a very noticeableimprovement on TREE.

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

ECCV

#1378ECCV

#1378


sidewalk bicyclist sign fence pedestrian car column

05

1015

20

9.32

5.2

8.11

14.13

0.12 0 0.01

0.61

0 0 0.070.11 0 0 0.070.02 0.07 0

0.510.87

9.98

8.02

16.17

19.05

1.391.17

3.21

4.2

baseline(1) on, (2) off, (3) off, (4) off(1) on, (2) on, (3) on, (4) off(1) on, (2) on, (3) on, (4) on


Fig. 5: Results for the experiment in Section C. Per-class accuracy for smallerobject classes. Compared with the baseline, combining continuous optimizationand local label descriptor noticeably increase the accuracy on smaller objectssuch as BICYCLIST, PEDESTRIAN, CAR, and COLUMN. The appearance ofSIDEWALK is often confused with ROAD; so the initial candidate sets can bewrong. Candidate set evolution helps to improve the accuracy for SIDEWALK.No methods can handle fence and sign. We suspect it is due to the lack of trainingdata on these classes.

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

ECCV

#1378ECCV

#1378


Fig. 6: Testing images and corresponding grounth truth labels in Figure 7.

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

ECCV

#1378ECCV

#1378


Baseline

overall: 80.6

34.7 N/A N/A 0.7

overall: 82.3

35.5 56.1 N/A 9.2

overall: 86.4

18.8 N/A 66.0 7.9

(1) On(2) Off(3) Off(4) Off

overall: 80.3

10.2 N/A N/A 0.0

overall: 80.0

12.9 59.6 N/A 6.4

overall: 61.4

1.5 N/A 73.1 3.5

(1) On(2) On(3) On(4) Off

overall: 89.0

58.6 N/A N/A 2.9

overall: 85.2

33.3 64.5 N/A 10.0

overall: 63.4

1.7 N/A 79.5 18.3

(1) On(2) On(3) On(4) On

overall: 92.6

92.1 N/A N/A 7.5

overall: 92.4

91.5 64.5 N/A 10.5

overall: 71.1

69.9 N/A 80.9 39.4

(1) Over Segmentation (2) Continuous Optimization(3) Label Descriptor (4) Candidate Set EvolutionBaseline [1] (our implementation) = (1) Off, (2) Off, (3) Off, (4) Off

Sidewalk Car Tree Column

Fig. 7: Three example results from the experiment in Section C. Overall, theperformance of our method improves as more design choices are applied. Inparticular, when all design choices are applied, our method often performs thebest.

eccv local label descriptor for example based semantic image...

Documents