independent evaluation of biometric system...

11
© 2004 BIOSEC Consortium Prof. Davide Maltoni Independent Evaluation of Biometric System Performance BioLab BioLab - Biometric Systems Lab Biometric Systems Lab University of Bologna University of Bologna - ITALY ITALY http://bias.csr.unibo.it/research/biolab 1st BioSec Workshop Barcelona, June 28th, 2004 © 2004 BIOSEC Consortium 1 BioSec Biometrics & Security 1st BioSec Workshop Barcelona, June 28th, Evaluation of Biometric Systems Technology, Scenario and Operational evaluations DB DB Off-line evaluation comparable reproducible On-line evaluation

Upload: others

Post on 23-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

1

© 2004 BIOSEC Consortium

Prof. Davide Maltoni

Independent Evaluation of Biometric System Performance

BioLabBioLab -- Biometric Systems LabBiometric Systems LabUniversity of Bologna University of Bologna -- ITALYITALYhttp://bias.csr.unibo.it/research/biolab

1st BioSec WorkshopBarcelona, June 28th, 2004

© 2004 BIOSEC Consortium 1

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Evaluation of Biometric Systems

Technology, Scenario and Operational evaluations

DB DB

Off-line evaluation

• comparable• reproducible

On-line evaluation

Page 2: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

2

© 2004 BIOSEC Consortium 2

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

How much can we trust an evaluation?

trustworthiness

cost

/ re

sour

ces

Strongly Supervised

In-house evaluation

Supervised

Existing benchmark

Self-defined test

Independent evaluation

Weakly Supervised

© 2004 BIOSEC Consortium 3

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Why independent ?

Judging one's own work is hard, Judging one's own work is hard, and judging dispassionately impossible...and judging dispassionately impossible...

For his mother, he is certainly a lovely puppy!

Page 3: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

3

© 2004 BIOSEC Consortium 4

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

In-house: Self-defined test

DB Usually internally collected

Self-defined training and test setOften not well documented

ProtocolProtocol

Not comparableNot reproducible by a third party

ResultsResults

© 2004 BIOSEC Consortium 5

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

In-house: Self defined test (2)

Acquire a database

Measure FRR/FAR

Has the desired performance been

achieved?

Update brochuresUpdate papers

Yes

Analyze the difficult cases that caused most of the

errors

(Find a good motivation to)Remove the corresponding samples from the database

No

Is the databasetoo small?

Yes

No

Add new (not too hard) samples to the database

An iterative approach to achieve a good performance...An iterative approach to achieve a good performance...

Page 4: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

4

© 2004 BIOSEC Consortium 6

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

In-house: Existing benchmark

Publicly available database

Existing test protocol for the database

Comparable with others obtained usingthe same protocol on the same database

DB

ProtocolProtocol

ResultsResults

© 2004 BIOSEC Consortium 7

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

In-house: Existing benchmark (2)

Some examples• Most of the recent relevant scientific papers• Face Authentication Contest at ICPR2000 (Part I)• FAC2004: ICBA Face Verification Contest

Main drawback• Overfitting (partially mitigated in case of a very large test set)

Evaluate the algorithm using the training and test sets defined in the protocol

Report the best resultsYes

Modify the parameters

Initial parameters

Finished?

No

Page 5: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

5

© 2004 BIOSEC Consortium 8

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Weakly Supervised

Sequestered data made available only during the testData is unlabeled (images with random filenames)

Test executed at testee’s siteTime constraints

Results generated by the evaluator from the matchingscores obtained during the test

DB

ProtocolProtocol

ResultsResults

© 2004 BIOSEC Consortium 9

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Testee’s Site

Independent: Weakly Supervised (2)

Testee’s hardware

Evaluator’s Site

DBResultsResults

MatchscoresMatchscores

Generationof the resultsGenerationof the results

MatchingAlgorithmMatchingAlgorithm

Testing procedure

Operator (testee’s staff)

Page 6: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

6

© 2004 BIOSEC Consortium 10

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Weakly Supervised (3)

Some examples

• FERET (1996)• 3813 images cross-matched• time limit: 72 hours

• FRVT2000• 13872 images cross-matched (>192 millions comparisons)• time limit: 72 hours (>740 matches per second)

Main drawback• Cannot avoid human intervention, result editing ...

which could be in principle carried out with huge resources

© 2004 BIOSEC Consortium 11

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Supervised

Sequestered data made available only during the testData is unlabeled (images with random filenames)

Test executed at evaluator’s site, on testee’s hardwareTime constraints

DB

ProtocolProtocol

ResultsResultsResults generated by the evaluator from the matchingscores obtained during the test

Page 7: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

7

© 2004 BIOSEC Consortium 12

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Supervised (2)

Testing procedure

Testee’s hardware

Evaluator’s Site

SupervisorDBResultsResults

MatchscoresMatchscores

Generationof the resultsGenerationof the results

MatchingAlgorithmMatchingAlgorithm

Operator (testee’s staff)

© 2004 BIOSEC Consortium 13

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Supervised (3)

Some examples

• FRVT2002

• FpVTE2003

Drawbacks• No way to compare efficiency (different hardware can be used)• Other interesting statistics (template size, memory usage) cannot be obtained• Cannot avoid “score normalization”, “template consolidation”

Page 8: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

8

© 2004 BIOSEC Consortium 14

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Strongly Supervised

Sequestered data not made available during the test

Software components compliant to a given input/outputprotocol are tested on the evaluator’s hardware

DB

ProtocolProtocol

ResultsResultsResults generated by the evaluator from the matchingscores obtained during the test

© 2004 BIOSEC Consortium 15

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Strongly Supervised (2)

Testing procedure

Evaluator’s Site

DBResultsResults

MatchscoresMatchscores

Generationof the resultsGenerationof the results

MatchingAlgorithmMatchingAlgorithm

Operator (evaluator’s staff)

The tested algorithm is executedin a totally-controlled environment, where all input/output operations are strictly monitored.

The tested algorithm is executedin a totally-controlled environment, where all input/output operations are strictly monitored.

Evaluator’s hardware

Page 9: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

9

© 2004 BIOSEC Consortium 16

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Independent: Strongly Supervised (3)

Some examples

• FVC2000• FVC2002• FVC2004

• SVC2004

Main drawback

• Very time- and resource-consuming

© 2004 BIOSEC Consortium 17

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

…but in practice

Are Independent Strongly Supervised Evaluationsalways practicable ?

High costDatabase acquisition and tuning (appropriate size and difficulty)Hardware and logisticPersonnel (supervising test, solving compatibility problems)

Most of the biometric companies cannot afford such costs

Biometric systems are characterized by frequent updates, this requires continuous performance evaluations

Page 10: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

10

© 2004 BIOSEC Consortium 18

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Performance evaluations vs Comparative evaluations

Performance evaluation!aimed at measuring absolute performance of a system!large database, representative population, …

Comparative evaluation!aimed at measuring relative performance of a system!less expensive, easy to reproduce for a given DB

Comparative evaluation + Inference

Estimation of real performance

?

© 2004 BIOSEC Consortium 19

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Inferring performance

DB1

DB2

...DBN

Referen

ce data

bases

S1

S2

...SM

Referen

ce syste

ms

Reference case

studies

C1

C2

...CK

TESTTEST

TESTTEST

Results on casestudiesResults on

referencedatabases

NS

TESTTEST Results on

referencedatabases

New System

InferenceInference

Estimated performanceof NS on the case studies

( large databases, scenario, operational )

Page 11: Independent Evaluation of Biometric System Performanceai.pku.edu.cn/application/files/9415/1133/7327/... · Testee’s Site Independent: Weakly Supervised (2) Testee’s hardware

11

© 2004 BIOSEC Consortium 20

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Conclusions

Avoid Self-defined In-house evaluationwhen standard databases and protocols are availableotherwise, provide your database to the community!

Strongly Supervised Independent evaluations are the most trustable but also the most expensive.

To what extent can we use comparative evaluation + inference to estimate real performances?

© 2004 BIOSEC Consortium 21

BioSecBiometrics & Security

1st BioSec Workshop Barcelona, June 28th,

Ranking difference in FVC2002

31st - PA0331st - PA0331st - PA0331st - PA03

............4th - PA024th - PB274th - PA084th - PB05

3rd - PA273rd - PB153rd - PB273rd - PA27

2nd - PB272nd - PA272nd - PA152nd - PB27

1st - PA151st - PA151st - PA271st - PA15

Database 4Database 3Database 2Database 1

Ranking Difference for PB27:(|2-3|+|2-4|+|2-2|+|3-4|+|3-2|+|4-2|)/6=1.17

FVC2002Average Ranking Difference = 2.84 (St.Dev = 2.51)

FVC2002Average Ranking Difference = 2.84 (St.Dev = 2.51)