pnd facial search accuracy evaluation foi/crime/133 18 ho pnd facial se… · database (pnd) in...
TRANSCRIPT
PND Facial Search Accuracy
Evaluation
HO Centre for Applied Science and Technology
Author: G Whitaker
V1.0 23rd October 2015
OFFICIAL – SENSITIVE
2 OFFICIAL - SENSITIVE
Contents
PND Facial Search Accuracy Evaluation .................................................................................... 1
Executive Summary ............................................................................................................................ 3
1 Introduction .................................................................................................................................... 4
2 Aim of the evaluation ..................................................................................................................... 4
3 Scope .............................................................................................................................................. 4
5 Evaluation Objectives ...................................................................................................................... 5
6 Evaluation Approach ....................................................................................................................... 5
6.1 Evaluation Methodology ........................................................................................................................6
6.2 Test dataset compilation ........................................................................................................................6
6.3 Matching Thresholds .............................................................................................................................7
6.4 Searching (Stage 1) ................................................................................................................................7
6.5 Results Recording ..................................................................................................................................8
6.6 Searching (Stage 2) ................................................................................................................................8
6.7 Results analysis and reporting ................................................................................................................8
7 Results Analysis .............................................................................................................................. 9
7.1 Probe Image Statistics ............................................................................................................................9
7.2 Search Results – Stage 1 ....................................................................................................................... 12
7.3 Search Results – Stage 2 ....................................................................................................................... 12
7.4 Results of the Automated Analysis ....................................................................................................... 13
7.5 Manual Analysis of Results ................................................................................................................... 14
7.6 Interpretation of Results ...................................................................................................................... 15
8 Conclusions and Recommendations .............................................................................................. 17
8.1 General ............................................................................................................................................... 17
8.2 Image Quality ...................................................................................................................................... 17
8.3 Comparison Thresholds and Candidate List Length ............................................................................... 18
8.4 Multiple Match Groups and Duplicate Images ...................................................................................... 19
Annex A – An example of the probe image data spreadsheet ............................................................ 20
Annex B – An example of the results analysis spreadheet ................................................................. 21
OFFICIAL – SENSITIVE
3 OFFICIAL - SENSITIVE
Executive Summary
This paper describes the approach explored by HO CAST to attempt an assessment of the performance
(on the operational system) of the Facial Search capability that was implemented on the Police National
Database (PND) in 2013.
Testing on a ‘live’ system in this way is a significant challenge for reasons explained in this paper and
was only undertaken since no other formal accuracy testing had been carried out as part of the
implementation. While the results obtained with regard to search accuracy are of limited value they do
nonetheless provide an insight into the way facial images are currently stored on PND and their
suitability for use with automated facial recognition algorithms.
One of the key drivers for this work was concern being expressed over the lack of testing prior to
operational deployment of the facial search capability. However it is equally important not to lose sight
of the fact that significant operational successes are now being reported where it has been used as part
of police investigations. Moving forward the emphasis should be on ensuring that any future
developments of the facial matching capability (indeed government and law enforcement biometric
services generally) take note of the recommendations in this report.
Key Findings
It is essential that performance assessment of biometric systems be undertaken prior to
operational deployment, and the requirements governing this must be adequately covered within
the contractual arrangements with the supplier. A ‘lessons’ learned’ exercise should be
undertaken to understand why such formal testing was not included within the scope of this
deployment. Accuracy assurance should be embedded as an essential requirement in the
development and implementation of future systems which have a biometric component within
them.
This evaluation has revealed significant facial image duplication within the PND, potentially
impeding operational effectiveness, and incurring additional cost. Since PND acts as a central
repository, holding and sharing copies of information that is already stored locally, it may be
assumed that such duplication is also occurring within individual forces and that similar additional
costs are also being incurred at a local level. Alternative approaches to the storage and
management of custody image data should be explored; for example a centralised solution such
as is currently in place for fingerprints and DNA data would improve data quality and integrity,
and might also be expected to be more cost effective than continuing to maintain separate
custody image repositories within each force.
The evaluation has revealed significant variation between forces in the quality of facial images
submitted to PND (assuming the image files being uploaded are the same as those being held
locally). In general, forces are failing to ensure that their custody images comply with previously
issued national standards, both in terms of image quality and associated metadata. The
introduction of suitable (automated) facial image quality checking software is recommended, both
at the point of capture within individual forces, and also at the point of enrolment into the PND
facial search database.
This evaluation has not revealed any significant issues with the way the FR algorithm is working
or the matching threshold settings as currently applied. However the effect of the recent change
in the threshold from 0.7 to 0.65 should be carefully monitored to determine what, if any, benefits
it has brought and whether further reductions may be beneficial.
OFFICIAL – SENSITIVE
4 OFFICIAL - SENSITIVE
1 Introduction
This paper describes the approach explored by HO CAST to attempt an assessment of the performance
(on the operational system) of the Facial Search1 capability that was implemented on the Police National
Database (PND) in 2013.
There is no large scale test environment for the PND facial search capability, and no formal evaluation of
search performance was carried out on the operational system prior to ‘go-live’. Evaluating biometric
search accuracy on a live system, where previously ground-truthed test data cannot be introduced into
the database (and subsequently removed), is a significant challenge.
This has necessitated an alternative approach to performance assessment. Note that this should not be
considered 'best practice' for this type of testing, but in the absence of any additional funding or
contractual obligation on the Systems Integrator (SI) to carry out such testing on a dedicated test
environment, it should as a minimum provide insights into the operation of the facial search capability as
currently implemented on PND.
2 Aim of the evaluation
Although a facial search capability has been live on PND for more than a year, there has to date been no
objective assessment of its performance. The aim of this evaluation was to provide quantitative data on
the performance of the Facial Recognition (FR) algorithm as currently installed on the operational
system when typical custody quality photographs are used as enquiry images. It was also an
opportunity to better understand how facial images are uploaded and stored on PND, and to identify
areas where improvements might be made
Although the scale of this test was limited (due to the lack of any batch capability on PND) this
information will be of value in informing further development of PND as well as other large scale
(national) HO deployments of FR technologies. The results will also be helpful in informing policy
development for future uses of facial images (capture, storage, and searching).
3 Scope
This evaluation is primarily concerned with an assessment of the search functionality, following the
enrolment of face images. Assurance of the enrolment process was addressed as part of the release
testing activities for Rel18.
However the evaluation also provides an opportunity to look at variations in image quality between
forces, to investigate how images are uploaded and stored on PND, and to highlight those areas where
changes might be made which would lead to improvements in the performance of the facial search
1 ‘Facial Search’ is the term commonly used to describe the use of a face recognition algorithm to search the custody images on PND.
Within this document Facial Search is used interchangeably with the term Face Recognition or ‘FR’.
OFFICIAL – SENSITIVE
5 OFFICIAL - SENSITIVE
capability.
Excluded from the scope was operator training / expertise (which can have a major impact on end-to-
end operational accuracy) and the use of CCTV, Facebook or other similar ‘non-compliant’ facial
images. The evaluation only looked at the searching of custody images against a database of custody
images, due to the difficulties in obtaining sufficient numbers of other types of imagery with known
matches in the database.
5 Evaluation Objectives
The overarching objective of this work was to explore a methodology that could be used on the live
system to:
1) validate that the capability has been implemented correctly by the supplier
2) provide quantitative data to inform the setting of matching thresholds - to maximise the likelihood
of a match being found (if one exists in the database) whilst minimising the total number of
search results that the operator has to view;
3) provide a baseline figure for the search accuracy of automated facial recognition technology
when used to search good quality enquiry images against a database of 13M plus operational
custody images;
4) learn more about the way in which custody images are uploaded and stored on PND and to
identify areas where improvements might be made;
5) inform the user community about system capability, providing information that will help with
formulating guidance and best practice for the use of this capability, as well as future national
deployments of FR F
6 Evaluation Approach
In the absence of a dedicated large scale test environment for the facial search capability, two major
challenges in developing a methodology which would meet these objectives have been:
the lack of any 'ground truth' for the operational images on PND, coupled with an inability to
'seed' test data2 into what is now a live operational database, and
obtaining sufficient numbers of enquiry images, representative of operational data and with
known matches in the database.
2 Mixing test data with operational data is not considered good practice, as the test data could be returned in response to an operational search. Such testing should always be conducted on a dedicated test environment.
OFFICIAL – SENSITIVE
6 OFFICIAL - SENSITIVE
The test approach described here attempts to overcome these challenges by making use of a 'window of
opportunity' each month during which new images have been uploaded to PND by forces but have not
yet been enrolled into the FR system.
6.1 Evaluation Methodology
New images are routinely uploaded to PND from local force custody imaging systems, and these are
then linked to the relevant POLE (Person, Object, Location, Event) data for the subject (normally but not
necessarily a new arrestee). The diagram below shows a high level outline of the process.
After uploading to PND such images will appear in the results list of a POLE search, but as they are only
enrolled into the automated FR database on a monthly basis (being held in a 'pool' of newly uploaded
images until that time) they will not initially be searchable by the FR algorithm.
These images therefore constitute a good source of enquiry (probe) data; namely operational custody
images that are not in the FR database and therefore cannot match against themselves, but a proportion
of which will be of subjects who do already have images in the database (i.e. from previous arrests).
Unfortunately there is no easy way of determining which of the images are of first time offenders and
which are of recidivists (and who probably have other images from previous arrests already in the
database). For all searches that return a recognisable 'match' this is not an issue but for those that do
not it is difficult to say whether this was because a match existed but was not found by the algorithm, or
if there was no matching image present in the database.
To address this it is necessary to repeat these searches at a later date, once the images have all been
enrolled in the FR system. This is explained in more detail later in this report.
6.2 Test dataset compilation
Testing of biometric systems requires the use of personal biometric data from real subjects; it is not
possible to synthesise operationally representative test data. Prior to starting work on creating the test
datasets, confirmation was obtained from the Information Commissioner’s Office that it would be
OFFICIAL – SENSITIVE
7 OFFICIAL - SENSITIVE
acceptable to use un-enrolled operational images from police forces as probes, as described in this test
approach.
Operational facial images were then sourced from the 'pool' of images recently uploaded to PND by
individual forces. CGI (the systems integrator) was requested to extract a randomised subset of such
images (approx 2000) which were then transferred via an FTP protocol to the PND 'secure area' at HO
headquarters where they were then loaded onto a dedicated secure laptop with access to PND via the
‘Restricted’ channel.
After reviewing the data it was found that CGI’s extraction process had corrupted the images, rendering
them unusable, and a request was made for them to be re-sent. When the second set was examined it
was discovered that the overwhelming majority appeared to have originated from a single force. The
original requirement for this data was that it should be an operationally representative sample, so CGI
were asked to extract a second batch of 2000 images which proved to be more diverse, although still
only from a small number of forces.
From these 4000 images a smaller test sample was selected, chosen to be representative of the overall
data, in terms of both image quality (size, resolution, pose, illumination, expression etc) and
demographics (age, gender, ethnicity). This information was recorded on a spreadsheet (see Annex A).
6.3 Matching Thresholds
For current operational use the matching threshold for the facial recognition algorithm is set at the
default level as recommended by the supplier (Cognitec). However for testing purposes it is desirable to
have this set to a lower level in order to ensure that a sufficient number of candidate responses are
always returned from a search.
Therefore, after consultation with members of the PND team and the Systems Integrator, prior to the
start of the test CGI were requested to lower the matching threshold from the default setting of 0.7 to 0.5.
It was subsequently discovered that CGI had made the threshold change on the ‘Confidential’ channel
and not the ‘Restricted’ channel. This was not corrected until the second day resulting in the first day's
searches not returning the expected results and thus limiting their value in this evaluation.
6.4 Searching (Stage 1)
There is currently no batch capability for the facial search on PND. Therefore each image from the
chosen test set had to be manually selected as a probe/enquiry image and then a search launched. The
search was based solely on the image; with no entry of additional search criteria (such as age range)
and was against all 13M plus images in the FR database.
Note that no additional image processing (cropping, rotation, sharpening etc) was applied to the probe
images. Provided the images met the criteria of size (<500KB) and of both eyes visible on human
inspection, they were submitted for searching in the ‘as received’ condition.
Throughout the testing there were performance problems with the connection from the secure laptop to
PND which would regularly freeze and require a restart. It was not possible to identify the cause of this
but as a result the number of searches that could be launched in the time available was significantly
reduced.
OFFICIAL – SENSITIVE
8 OFFICIAL - SENSITIVE
6.5 Results Recording
The results of all 'successful' searches were recorded, success being defined as a completed search
that returned one or more results above the threshold. With a very low threshold in place for the test all
searches should return results which could then be viewed on line as well as being recorded in the
electronic 'results log' provided by CGI after all the searches were completed.
Note - Match Groups
PND assigns data relating to the same individual into a 'match group'. This also applies to images
where there is sufficient additional metadata to link the image to an existing match group. However in
many cases there is not sufficient information in the record that has been uploaded by the force, and as
a result images relating to a particular individual (in some cases possibly the exact same image) may be
allocated to different or even new match groups.
The effect of this is that the same individual may be returned in multiple positions in the respondent list.
However, only one of these match groups will contain a copy of the probe image, and it is the position
and score of this match group that matters when recording the result of the search.
This may result in some true matches being reported as 'misses' but this is an inevitable consequence of
the way PND uses match groups as well as not having any reliable ground truth against which to assess
the result
6.6 Searching (Stage 2)
After all the searches (Stage 1) had completed, CGI were asked to enrol them into the FR database, and
they were then resubmitted. However this time round, the probe was expected to always match against
itself, returning a score of '1' (maximum score), along with details of the match group to which the image
had just been added.
In effect, this second search established the 'ground truth' for the data, although as stated in the box
note above this may not be 100% reliable due to the way match groups are used on PND.
6.7 Results analysis and reporting
Once both sets of searches had been completed (Stages 1 and 2) CGI was asked to provide the results
spreadsheets, listing the Search ID, Match Groups and Image Scores for each search image both before
and after the images were enrolled into the FR system. These were then cross referenced (using the
Search IDs) with the author’s Enquiry Image spreadsheet, linking each probe image to the results
obtained both before (the genuine search result) and after that same image was enrolled (the ‘ground
truth’)
For each probe image the match group containing the highest scoring image returned by the first search
was compared with that from the second search and if they were the same then the result was declared
to be a match. The score for the second search was expected to always be a '1' (as it will have been
matched against itself) while the score from the first search is recorded as the genuine match score for
that search.
OFFICIAL – SENSITIVE
9 OFFICIAL - SENSITIVE
If the match groups were not the same there are two possibilities
1. At the time of the first search there was no matching image in the database to be found. After
enrolment the image was assigned to a newly created match group. This result is therefore
considered to be a genuine non match as there was nothing for the probe to match against the
first time round.3
2. There is more than one match group containing images of the probe subject, and the one
returning the highest scoring response differs from the one to which the image was actually
added. If these match groups genuinely relate to the same individual the result should be
recorded as a match but this information will not be evident solely from the results
spreadsheets.
In both the above cases it was acknowledged that a manual check of the results would be required to
determine what had happened but, depending on the numbers, it might only be possible to check a
subset of the searches.
7 Results Analysis
7.1 Probe Image Statistics
Before considering the search results it is useful to consider the breakdown of probe images by subject
gender / ethnicity / file size / quality etc as this provides an insight into the type of data currently being
uploaded by forces to PND.
Distribution by Ethnicity and Gender:
Number of Images
Percentage
Male Caucasian 124 58
Female Caucasian 50 24
Male non-Caucasian 20 9.5
Female non- Caucasian 18 8.5
Total number of images 212 100%
3 Note that on the operational service, whether this would be reported as a true non-match or a false match will depend on where the
threshold is set. For this evaluation the threshold was lowered to ensure that all searches would return results.
OFFICIAL – SENSITIVE
10 OFFICIAL - SENSITIVE
Distribution by file size:
Image dimensions and file sizes varied considerably between forces. The smallest file size included in
the test was 7KB while the largest was 185KB. The smallest images were 200 by 200 pixels and the
largest were 1280 by 1024.
0
10
20
30
40
50
60
Number of
images
Size in KB
Probe image file size distribution (in KB)
0
10
20
30
40
50
60
70
<50 <100 <150 <200 <250 <300 <350 <400 <450 <500 >500
Numberof
images
Image size (number of pixels x 1000)
Probe image size distribution (pixels)
600 x 480 pixels
OFFICIAL – SENSITIVE
11 OFFICIAL - SENSITIVE
Other statistics:
Number of Images
Percentage
No. subjects wearing glasses 8 4%
No. subjects with facial hair 39 18%
Non compliant pose 15 7.5%
Estimated age distribution:
Juvenile (<18) 4 2%
Young (18-35) 93 44%
Middle aged (35-60) 98 46%
Old (>60) 17 8%
Contrary to expectations it was noted that obtaining a compliant pose was not a major issue in most
cases and subjects were generally front facing and with a neutral expression. Approximately 7% of
cases exhibited significant deviation from this; typically these were images where the eyes were closed
or the mouth was open. In a few cases cuts or bruising was also evident on the face.
Lighting was an issue in just over a third of the images. Typically this was due to uneven illumination
across the image, resulting in strong shadows on parts of the face. In other cases it appeared that no
special lighting had been employed and that the only illumination was from standard fluorescent strip
lights.
Image quality is known to be the biggest single factor influencing the performance of FR algorithms.
While the above figures on file size, lighting etc are useful indicators of quality they are not sufficient in
themselves. For example in many of the larger images (which might have been thought to be of better
quality) the subject often occupied only a very small part of the image; if these were cropped along the
lines of a conventional custody image the actual facial image size would be towards the lower end of the
size scale.
While many of the images initially appeared acceptable on the laptop screen, when enlarged their
shortcomings became very apparent. Many were very ’soft’; either due to the faces being out of focus or
perhaps a result of poor quality or dirty lenses on the cameras. A significant proportion also exhibited
JPEG artefacts as a result of high levels of data compression.
The following chart shows the image quality as judged (subjectively) by the testers. Note that an image
that fully complied with the existing police image standard would score a 10; NONE of the probe images
used in this test fully achieved this level of quality.
OFFICIAL – SENSITIVE
12 OFFICIAL - SENSITIVE
7.2 Search Results – Stage 1
Total number of searches launched: 212
Number of searches not returning any result for Stage 14: 10
Number of searches returning any result for Stage 1: 202
Number of searches returning a result in Stage 1 with a score of less than 1: 134
Number of searches returning a result in Stage 1 with a score of 1: 68
A key assumption of the test was that images returning a match with a perfect score of '1' must have
matched against an exact duplicate image that was previously enrolled in the FR database. Such
images cannot be used as part of an accuracy evaluation and so were excluded from the subsequent
performance analysis.
7.3 Search Results – Stage 2
Following enrolment into the FR database, it was expected that when the probe images were launched a
second time for a search against images in the database, they would all match against the copy now
enrolled in the database and, being exact duplicates, would return a matching score of 1.
4 The Cognitec software does not provide a reason for specific failures to match. If a search returns no results above the matching threshold it will be marked as having failed; there were examples of this on Day 1 as the matching threshold had not been lowered as requested. Others may have failed for reasons such as very poor image quality or because the eyes could not be located by the software (e.g. due to the subject wearing glasses). One search failed to generate a transaction ID and was therefore excluded from Stage 2.
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10
No.of
Images
Quality Score
Distribution of Image Quality
OFFICIAL – SENSITIVE
13 OFFICIAL - SENSITIVE
Total number of searches launched: 212
Number of searches not returning any result for Stage 25 15
Number of searches returning any results for Stage 2 196
Number of searches returning a match in Stage 2 with a score of less than 1: 21
Number of searches returning a match in Stage 2 with a score of 1: 175
21 searches returned a score of less than the expected '1' and these were manually examined to identify
possible reasons. It was discovered that all of those with a match score greater than 0.99 (a further 10
images) had nonetheless matched against themselves and were thus genuine duplicate images despite
not scoring a ‘1’.
The remaining 11 images were confirmed as being genuine non-matches (the scores for these ranged
from 0.67 up to 0.85). It may be that they had not been enrolled in the FR database as anticipated, even
though they were all of reasonable quality and had all been previously accepted by the system as
probes. However it has not been possible to verify this.
7.4 Results of the Automated Analysis
A key objective of this work was to assess the feasibility of developing a largely automated method for
evaluating search accuracy on the live PND system.
The following sections describe the approach. Note that only 1st position respondents were considered
(based on the highest scoring image that was returned for each search).
Using the Probe Image ID as a common key between the two sets of data the results of the Stage 1 and
Stage 2 searches were combined. This resulted in a total of 189 ‘usable’ searches (i.e. searches
returning results for both Stage 1 and Stage 2 where the results could be cross-compared).
For those cases where a duplicate copy of the image had been found, it was assumed that both
of the following would be true:
Stage 1 comparison score =1 (i.e. the probe had matched against a duplicate image)
Stage 2 comparison score =1 (i.e. the probe had matched against itself or the previous duplicate image)
If both of these conditions were true then the probe image was considered to be a confirmed
duplicate. 68 images were found to have returned a score of 1 in Stage 1, while 175 of those in
Stage 2 scored 1. However only 49 images were found to have returned a score of 1 in both
Stages 1 and 2.
For those cases where a ‘true’ match had been found, it was assumed that the following would
be true:
Stage 1 comparison score <1 (i.e. the probe had not matched against a duplicate image)
Stage 2 comparison score =1 (i.e. the probe had matched against itself)
5 The expectation was that all images that returned results for Stage 1 would also return a result for Stage 2. It is not clear why this figure
for Stage 2 is higher than it was for Stage 1.
OFFICIAL – SENSITIVE
14 OFFICIAL - SENSITIVE
Highest scoring Match Group in Stage 1 is the Highest scoring Match Group in Stage 2
Image ID in the Match Group for Stage 1 is not the Image ID in the Match Group for Stage 2
Image Count within the Match Group in Stage 2 is higher than the Image Count for that Match Group in
Stage 1
If ALL of these conditions were met, the result was considered to be a true match, confirmed by the
fact that the probe image (once enrolled) had been added to the same Match Group as that found in
the Stage 1 search.
Analysis of the data revealed that only 20 searches met all of these criteria.
For those cases where there was no corresponding image in the database to be found, it was
assumed that the following would be true:
Stage 1 comparison score <1 (i.e. the probe had not matched against a duplicate image) OR no result
was returned (i.e. no image in the database scored above the matching threshold)
Stage 2 comparison score =1 (i.e. the probe had matched against itself)
Highest scoring Match Group in Stage 1 is not the Highest scoring Match Group in Stage 2 (i.e. the image
had been added to a different match group to the one that returned the highest scoring image in Stage 1).
If ALL of these conditions were met it meant that the probe image had been put into a new Match Group,
the implication being that no matching image existed in the database at the time of the first search.
Analysis of the data revealed that 101 searches met these criteria.
For the remaining images there are several possibilities; some may have been duplicates where one or
both searches had not returned a score of 1 as expected. In other cases it may be that that the probe
image was added to a different match group to the one it (correctly) matched against the first time round.
This can happen due to insufficient or inaccurate metadata associated with the image.
Note that there were also 10 cases where the comparison score for Stage 2 was lower than that seen in
Stage 1. It is not clear how this could happen since even if the image had failed to enrol (thereby
explaining why the score in Stage 2 was not ‘1’), it should still have matched against the same image
that was found the first time round, and with the same score. One possibility is that the image returning
the original match had since been removed from the database but there is no easy way to determine if
this was actually the case.
Examples of the data returned from both the Stage 1 and Stage 2 searches is provided in Annex B
7.5 Manual Analysis of Results
In order to validate the results above it was acknowledged that it would be necessary to manually verify
the results of at least a proportion of the searches.
During the test process and as the results were reviewed it became clear that some of the assumptions
on which this evaluation was based were not valid and that the figures given above were therefore
questionable; some of the duplicate images were scoring less than ‘1’ and in many cases matching
images had been added to different match groups from the ones expected and as a result were not
being recorded by the automated analysis.
The results of all of the Stage 1 searches were therefore manually examined to try to determine visually
whether or not they had returned a true match.
OFFICIAL – SENSITIVE
15 OFFICIAL - SENSITIVE
It is important to note that human comparison of unfamiliar faces is difficult, even for experts trained in
this field and working with high quality, high resolution images. In some cases, although the actual
images were different the subject’s clothing and the custody environment were clearly the same, and
making a decision was relatively straightforward. In others, multiple images of the potentially matching
subject were returned and these were used to help confirm the decision, but inevitably the results in this
section are a subjective assessment and must be treated with caution.
Of the total of 211 original probe (enquiry) images,
70 probe images were determined to have matched against exact duplicates in the database
before addition of the probe image in Stage 2, regardless of whether or not they were eventually
both included in the same match group;
56 probe searches were determined to be true matches to other, pre-existing, images of faces in
the databases, regardless of whether or not they were eventually added to the same match
group.
43 probe images were determined to be genuine non-matches (i.e. a new match group had been
created for them and they did not appear in any of the other match groups returned by the search
For the remaining images it was impossible to reach a conclusion.
In summary, out of the initial 211 searches, the automated facial search of PND identified just 20 true
matches, whereas visual examination by the tester identified a total of 56 matches. For declared non-
matches, the visual examination identified 43 images, whereas the use of the automated facial search of
the database resulted in 101.
7.6 Interpretation of Results
The large discrepancies between the automated and manual analysis indicate that this methodology is
not suitable for extension to larger tests of the automated facial search. In the main, this can be
attributed to shortcomings in the custody data which is uploaded to PND by police forces, not just in
terms of image quality but in the quality and quantity of associated metadata accompanying the images.
Evidence for this is provided by the following:
Duplicate Images
Over 40% of all searches launched found at least one (and often many more) duplicate images
already in the database. PND is designed to accept data from forces ‘as is’ and thus does not
carry out any checks that might highlight such duplication prior to enrolment in the database.
This issue of duplicate images is not a new problem. FR evaluations undertaken by PITO6 in
2006 using data from the FIND7 pilot indicated that around 60% of those images were present
more than once. With such a small sample size it is impossible to know if the 40% figure found
here is truly representative, but it should be noted that for a database of over 13 million images,
this would equate to around 5M images that are held in the database unnecessarily, with all of
the associated storage and licence costs. As many of the duplicates appear multiple times there
6 Police IT Organisation – the pre-cursor to the National Policing Improvement Agency 7 Facial Images National Database – a PITO project to establish a national custody image database but which resulted in a pilot system but
which was subsequently cancelled by the NPIA in 2008
OFFICIAL – SENSITIVE
16 OFFICIAL - SENSITIVE
is reason to believe the true figure may actually be much higher.
Images assigned to multiple Match Groups
On receipt of custody records from individual forces, PND assigns them into Match Groups. If
there is sufficient demographic data to link a record to an existing one, it will be assigned to that
Match Group, otherwise a new group will be created. As with duplicate images, shortcomings in
the custody data uploaded to PND by forces often result in insufficient metadata to link records of
an individual, with the result that a single individual appears in multiple match groups.
The conclusion from this study is that the use of Match Groups by PND makes analysis of the
data from tests like these extremely difficult.
Comparison Scores and Matching Thresholds.
Despite concerns over the statistical validity of the results from such a small sample size it has been
possible to plot comparison score distributions for the manually confirmed ‘matching’ and ‘non-matching’
searches.
In an ‘ideal’ biometric system, the two distributions would be well separated, allowing a threshold to be
set that returned all of the ‘true’ matches but no ‘false’ matches. In practice this is rarely, if ever,
observed.
The original aim of this test was to pave the way for a much larger scale evaluation using the same
methodology, thereby producing more statistically valid results. In the light of the results reported in this
paper, it now seems that this approach is inadequate.
Nevertheless, even with the limited data available the existence of two separate distributions is apparent,
with most non-matching images scoring no higher than 0.8, while the overwhelming majority of true
matches scored in excess of 0.95. These very high scores may be attributed to the fact that custody
images were used as the probes and, despite concerns around image quality, were matching against
images from previous arrests taken under the same conditions and often in the same custody suite.
Note also that using non-custody images taken with different camera equipment and in different
environments would produce very different (and generally lower) comparison scores and the differences
between matches and non-matches would not be so clear.
0
10
20
30
40
50
60
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0
Number of
images
Comparison score
Comparison Score distribution
True Matches True Non-Matches
OFFICIAL – SENSITIVE
17 OFFICIAL - SENSITIVE
8 Conclusions and Recommendations
8.1 General
As highlighted in the House of Commons Science and Technology Committee’s 6th report (Current and
Future Uses of Biometric Data and Technologies):
‘When biometric systems are employed by the state in ways that impact upon citizens’ civil
liberties, it is imperative that they are accurate and dependable. Rigorous testing and evaluation
must therefore be undertaken prior to, and after, deployment, and details of performance levels
published.’
The evaluation approach detailed in this report was developed as a possible way of mitigating the fact
that no formal accuracy assessment had been carried out prior to operational deployment of the facial
search capability on PND.
In terms of the original evaluation objectives, the small sample size and lack of any ground truth for the
images on PND means that it has not been possible to draw any firm conclusions with regards to search
accuracy and, given the very ‘noisy’ nature of the data, even large scale tests using this approach seem
unlikely to succeed. Nonetheless, in the absence of any additional funding or contractual obligation on
the Systems Integrator to carry out such performance testing, this work does still provide insights as to
the way facial images are currently stored on PND and their suitability for use with an automated facial
recognition algorithm as currently implemented on PND.
RECOMMENDATION 1. It is essential that performance assessment of biometric systems be
undertaken as part of the implementation process, prior to operational deployment, A ‘lessons’
learned’ exercise should be undertaken to understand why such formal testing was not included
within the scope of this particular deployment.
RECOMMENDATION 2. Accuracy assurance should be embedded as an essential requirement
in the development and implementation of future systems which include a biometric element
within them. Requirements governing this must be adequately covered within the contractual
arrangements with the supplier(s).
OBSERVATION 1. The benefits of undertaking a full scale evaluation on the current system at
this stage in the contract would most likely be outweighed by the cost as no suitable test
environment currently exists. However, future enhancements to the system may provide
opportunities whereby such testing could be accommodated for relatively little additional
expense. For example, if a second (backup) FR capability for PND was to be procured to
provide business continuity, it could potentially also be used for accuracy evaluation and other
testing purposes, provided the requirements for such activities are identified at the design stage.
8.2 Image Quality
The quality of custody images being uploaded to PND by forces is a cause for concern, with none of the
images used in this test fully meeting the 2008 police custody image standard (which itself was only ever
intended to be a minimum standard).
While PND accepts data ‘as is’ from forces, it is understood that steps have now been taken to exclude
images smaller than 10KB from enrolment into the FR database and to recommend an optimum image
file size of 50KB to forces. However these measures alone are not sufficient to guarantee image quality
as they do not take into account other aspects of the quality of the image (e.g. what percentage of the
image the head actually occupies).
OFFICIAL – SENSITIVE
18 OFFICIAL - SENSITIVE
Hence it is recommended that:
RECOMMENDATION 3. Consideration should be given to implementing a dedicated face image
quality software check as part of the FR enrolment process on PND, with parameters chosen8 to
exclude those images that are unsuitable for use with an FR algorithm.
RECOMMENDATION 4. Periodic feedback should be provided to forces on the percentage of
images which are rejected along with the reasons for rejection, in order to encourage the capture
of better quality images in the future.
RECOMMENDATION 5. By the time an image is uploaded to PND it is too late to address any
shortcomings in quality. In the longer term therefore the use of image quality software at the
point of capture is recommended (similar to that used on Livescan units for fingerprints). This
would require the facial image of the subject to be re-captured if it failed to meet certain quality
criteria considered necessary for effective use with facial recognition systems.
OBSERVATION 2. While not directly relevant to this particular evaluation it should also be noted
that internationally there is a trend towards taking much higher resolution custody images, as well
as taking images at multiple pose angles, to support forensic facial image comparison as well as
automated facial image matching.
8.3 Comparison Thresholds and Candidate List Length
The selection of appropriate matching thresholds is very dependent on the type of images being used,
both those in the database and the probe images. This particular evaluation only used custody images
as probes and the results suggest that a very high threshold setting would yield acceptable results in
such cases. However this type of image is not typical of those used in the majority of operational cases.
In general the greater the difference in image quality between the probe image and those in the
database, the lower the threshold scores for true matches are likely to be.
Prior to these tests the supplier’s default setting of 0.7 was in use, but a decision has now been taken by
the National User Group (NUG) to lower this to 0.65. This has the effect of increasing the number of
candidate images returned from the database for the investigating officer to review manually, but
increases the likelihood of a true match being returned in the list (if one actually exists in the database).
Balancing threshold settings and the maximum number of respondents returned, in order to maximise
the operational benefit that the FR (or any other biometric) capability delivers, is not a trivial exercise and
would require a far more extensive test than the one being reported on here.
RECOMMENDATION 6. The data from this evaluation is insufficient to draw any firm
conclusions about what the correct threshold setting should be. A formal process should be
introduced to monitor the impact of the recent change in threshold from 0.7 to 0.65, and feedback
sought from the user community as to whether the lower value has led to an increase in the
number of matches being found.
8 Note that optimisation of face image quality software requires significant testing to ensure that the parameters chosen are appropriate
for a particular application. The large variations seen in image size and quality would make this particularly challenging in the case of the police custody images currently being uploaded to PND; balancing the need to exclude those that provide no value (and may actually degrade overall performance) with the benefits of having as many images as possible in the database.
OFFICIAL – SENSITIVE
19 OFFICIAL - SENSITIVE
8.4 Multiple Match Groups and Duplicate Images
Although the test sample size was small, this evaluation suggests that there is a high proportion of
duplicate images on PND, both within a single Match Group and across multiple Match Groups. In many
biometric systems de-duplication of the data is routinely undertaken to improve data integrity and to
reduce the total number of images / templates that need to be stored, thereby reducing both storage and
licence costs.
These tests suggest that there is the potential to reduce the total number of images stored on PND (and
thus templates in the FR database) by around a third, and possibly much more. The cost of extending
the Cognitec licence beyond 10M templates is understood to have been £120k plus an ongoing charge
of £18k per year thereafter, something that would not have been necessary had de-duplication been built
into the enrolment process or, preferably, addressed at a local force level prior to uploading images onto
PND.
The large number of multiple Match Groups associated with a duplicate image of a single individual is
largely due to the poor quality / lack of metadata associated with the images, preventing consolidation
into a single Match Group. While data quality will always be a challenge on a system designed to store
intelligence data there is no technical reason why this should be the case with custody images which are
taken in a controlled environment alongside fingerprints and DNA.
The police image standard developed by PITO / NPIA provided some information on the metadata that
should be provided with the images and the NPIA also developed a Corporate Data Model to improve
consistency and interoperability of police data. However, take up of both of these with respect to
custody images has been limited to date. In the longer term, the development and implementation of
common data formats should be mandated.
RECOMMENDATION 7. Police forces need to be made aware of the importance of properly
managing their local custody image collections and capture processes if the benefits of having
national collection are to be fully realised.
RECOMMENDATION 8. Police forces should be made aware of the importance of providing
sufficient and accurate metadata with the images, so that records relating to the same subject
can be properly linked. In the longer term, the development and implementation of common data
formats should be mandated.
RECOMMENDATION 9. Some of the issues observed with duplicate images and inconsistent
meta data are a result of the fragmented landscape in forces for custody imaging systems and
processes. Alternative approaches to the storage and management of custody image data
should be explored. Better integration between local custody imaging systems and other national
systems (e.g. IDENT1 and PNC) would be one means of improving data integrity. Alternatively a
centralised solution such as is currently in place for fingerprints and DNA data would improve
data quality and integrity and should in the longer term be more cost effective than continuing to
maintain separate systems in each force.
OFFICIAL – SENSITIVE
20 OFFICIAL - SENSITIVE
Annex A – An example of the probe image data spreadsheet
Enquiry Image Details
UniqueID / Test
Image number
Subject details Image Details
M/F Ethnicity Age Compliant pose
Glasses Beard / Moustache
Other comments
File size (kB)
Image dimensions
I2I distance
Overall quality (1-10)
OFFICIAL – SENSITIVE
21 OFFICIAL - SENSITIVE
Annex B – An example of the results analysis spreadheet
Enquiry Image_ID Score Part 2 Match Group Part 2 Image ID Part 2 Score Part 1 Match Group Part 1 Image ID Part 1
1075 1 96534931561 57D1FDD5C44C404EBA2BEB1F90D8FCC3 0.60667384 349464132480 F2E014A7F26442538658D5C891478A2C
1085 1 835941528399 87F522589B634E2B8D272182D631A53E 0.7349086 74831202655 94FB87D83306495D930B52DAD221C06D
1090 1 3748440744 1B0AF15831BE4B7C954B005BCCBA5C15 1 3748440744 1B0AF15831BE4B7C954B005BCCBA5C15
1110 1 108686818734 67DEF9451429415D8A5815B641A339FB 0.9995563 108686818734 5454E7AF7BAE45EDA8D8681ACDBC4E99
3005 1 65566390764 F40C85415E584B18A932BA3F9B79073A 1 65566390764 775558CAFAC84DD592E4E895819319FE
2012 1 613539372441 A9292FBDF54A4285967EB6AE916435EE 0.6472585 50152545500 D06C224236A641F1B53E5F964E98922B
3000 0.72495764 421996099964 8670F9F5D87F416E8F24DF2598BE6F3F 1 738700910263 09352D5150E947C3BBC4C1E8F2B31E9A
3003 1 243620570040 121BF5DFAD8148C0BC83CCBD77D85517 1 243620570040 121BF5DFAD8148C0BC83CCBD77D85517
3005 1 65566390764 F40C85415E584B18A932BA3F9B79073A 0.8936661 65566390764 775558CAFAC84DD592E4E895819319FE
3006 0.9999833 718376855844 C321E7803CC04604B0904E9293979712 0.9999833 718376855844 4F639197E4A64BDA9E97BA2A3EF4336B