schofield 1996

Pergamon Pattern Recognition, Vol. 29, No. 8, pp. 1421 1428, 1996

Elsevier Science Ltd Copyright (c~, 1996 Pattern Recognition Society

Printed in Great Britain, All right reserved 0031-3203/96 $15.00 + 00

0031-3203 (95) 00163-8

A SYSTEM FOR COUNTING PEOPLE IN VIDEO IMAGES USING NEURAL NETWORKS TO IDENTIFY

THE BACKGROUND SCENE

A. J. SCHOFIELD,* P. A. MEHTA and T. J. STONHAM Department of Electrical Engineering and Electronics, Brunel University, Uxbridge UB8 3PH, U.K.

(Received 26 January 1995; in revised form 7 November 1995; received for publication 24 November 1995)

Abstract--A method for counting the number of people in any pre-defined scene is described. The method has three distinct stages: image pre-processing, background identification and object search. The method was designed to provide accurate counts, even when the background scene was allowed to vary. This tolerance to changes in the background scene was achieved using RAM-based neural network classifiers to identify sections of the background scene in each test image. The system was implemented on relatively low cost hardware and was found to give good results at moderately high frame rates. Copyright 1996 Pattern Recognition Society. Published by Elsevier Science Ltd.

People counting Neural network Serial search

RAM-based classifier Background identification

l . INTRODUCTION

1.1. The task

The system described in this paper was designed to address the task of automatically counting people in complex and variable scenes. Images were captured using a CCD video camera and then analysed to determine the number of people present. The background scene (that is the set of all images of the scene containing no people) was free to vary in a number of ways, including variations in lighting levels and patterns, and the occasional movement of objects, such as doors, that might appear in the scene. The background scene was also complex in that it could contain a number of objects with a variety of reflectance proper- ties which might compound the effects of variations in lighting.

Figures 1 (a) and (b) show two example images of the same background scene under different lighting conditions. This type of background scene is typical of the scenes that the system was designed to deal with. These images were taken from a ceiling mounted camera, fitted with a wide angle lens, looking down on the scene. In one image the room lights are offwhile in the other they are on, resulting in both a general improvement in contrast and strong reflections on the floor. Figure 1 (c) shows the same scene with some people present, the task of the people counting system being to count the number of people in such images.

*Address for correspondence: Dr A. J. Schofield, School of Psychology, University of Birmingham, Edgbaston, Birmingham, B 15 2TT.

1.2. Applications

The principal application of the system described here is to count the number of passengers using and waiting to use the lifts is large buildings so as to improve the service provided by the lifts by reducing the waiting and travelling times of the passengers. For this application, plan views, like those used in this report, are quite appropriate since the space available in lift cars and lift lobby areas is limited and the overhead view maximized the field of view and minimized the number of occlusions. So, Chan, Kuok and Liu ~1) describe a people-counting system for use with lift systems that is somewhat similar to that described here. Their system relies on image subtraction alone at the background removal stage and hence is less likely to be able to deal with changes in the background than the neural network approach described here.

Other potential application areas include pedes- trian flow surveys, crowd control, security access control and building usage surveys. It is clear, however, that some of these applications might require cameras to be sighted such that they give a panoramic view and the method has not been tested in such situations.

1.3. Object search versus background identification

The problem of counting the number of poeple in an image is an instance of the more general problem of counting objects. In order to count objects their pres- ence must be registered or detected. This does not imply that the objects need to be recognized as individual items or that their locations be specified, rather the pattern recognition system is simply required to know that some object(s) are present that do not normally occur in the empty background scene and be able to

1421

1422 A.J. SCHOFIELD et al.

a)

b)

c)

Fig. 1. (a) An example background scene with room lights on. (b) The same scene with room lights off. Notice the strong reflections on the floor in (b) and the general improvement in contrast when the lights are on. (c) The same scene with seven

occupants.

estimate the number of individual objects. This problem might be tackled in two ways: object search, and background identification.

Object search involves searching the image looking for shapes that correspond to some archetypal (or model) object shape (for example, see paper by Sullivant2~). Although objects need not match the archetype exactly to be counted, such methods perform best when the

objects of interest have a relatively stable shape, that is, when objects of the same type are similar and individual objects are non-deformable. People are not easily modelled because they represent a heterogeneous class. Moreover, people are not rigid objects and hence consecutive images of the same person, taken from the same relative camera position, can differ considerably. Another approach for detecting objects involves searching for key features that are common to objects in the target class. For example, movement has been used as a feature for detecting people. ~3-~) However, the search for moving objects may not be appropriate when the people to be counted are standing still, such as when they are in a queue.

Background identification involves detecting all parts of the image that conform to the patterns found in the normal background scene when no people are present. Those parts of the image that are not part of the background are either assumed to be objects of interest (in the current case people) and can be counted or may be compared with some object description prior to being added to the count. The identification of background regions facilitates both the segmentation of images into background and foreground regions (figure/ground segmentation), and the removal of the background image (background removal). For the purposes of the system described here background removal and figure/ground segmentation would be just as useful to the counting process as background identification alone and hence solutions to these tasks must be considered.

The two methods for detecting objects (object search and background identification) are best suited to different situations. Object search is most effective when the objects are less variable than the background image. Background identification works best when the background is less variable than the objects. In the case of people counting while the background is free to vary it is less variable than the people and hence background identification was chosen as the more appropriate method. Various approaches to background identification are considered in the next section.

1.4. Background identification methods

One method for background removal is to subtract a typical (previously stored) background image from each image under test. This method has been applied with some success to scenes with a simple, uniform and fixed background; (` *) here an estimate of the number of people was obtained by totalling the number of pixels that were not part of the background scene. If, however, the background scene varies considerably over time (see for example Fig. 1), then it is possible for an image to contain no objects (people) and yet be very different from the stored background image. Image subtraction using a single background image (without any further processing) cannot deal with the sort of variations found in background scenes like that of Fig. 1.

Counting people in video images 1423

Background removal methods that can deal with variations in the background scene have been proposed. Many such systems attempt to generate an average background image based on a number of example images. Such systems have the disadvantage that relatively uncommon versions of the background can become 'buried' in the average image. Also, the average background image will be slightly different from each of the contributing images and hence there is scope for detecting spurious objects in individual examples of the background scene. Kilger and Dietl, t6) among others, tv-~9) have proposed adaptive methods for generating a background template from a continuous stream of input images including images containing non-background objects. While these systems are very effective they still result in a single background template which may differ from the actual background images in some unspecified way, this is especially true if the background scene is subject to rapid variations. Most of these methods involve a time constant for updating the background images which must be sufficiently low to allow changes in the background scene to be incorporated into the average background yet high enough to ensure that people are not included in the background even if they remain stationary in the scene for some time. Further, such systems always result in a single background image which may under- represent certain background features which might then be erroneously counted.

In the system described here RAM (random access memory)-based neural networks t1'111 were used to create a generalized representation of the background scene that incorporated features from many background images. The networks operated on the difference images obtained from a subtraction process and were designed to counter the limitations of image subtraction as a method for background removal. RAM-based networks are somewhat like look-up tables that can be used to record the occurence of certain patterns during an initial training phase, but unlike look-up tables, they are able to generalize such that a test pattern need not be exactly like any of the training examples to be recognized by a network once training is complete. Further, these networks will recognize new combinations of the original training patterns.

Unlike an average image, the representation formed by RAM-based networks gives equal weight to all of the contributing images and is thus able to simulta- neously represent many features at each location of the image. When test images are compared with the generalized representation, areas of background scene can be identified. The representation preserves all of the relevant information from the many background images reducing the risk of detecting spurious objects.

The major drawback of the RAM-based approach used here is that it requires binary images. This implies a threshoiding operation which introduces the risk of information loss early in the counting process. While methods for inputting grey-scale images into RAM-

based networks have been proposed, I11) they were not deemed appropriate for use in this instance.

1.5. Overview

The people-counting algorithm can be split into three distinct stages: pre-processing, background identification and object counting. The pre-processing stage was concerned with thresholding and resolution reduction, the operation of this stage is discussed in Section 2. The background identification stage is discussed in Section 3 and the object counting stage in Section 4. Issues relating to the implementation of a prototype system are discussed in Section 5 and the performance of the resulting system is discussed in Section 6. Comments regarding further work are made in Section 7.

2. IMAGE PRE-PROCESSING

2.1. Image sub-samplin9

It is often assumed that high-resolution images are required for pattern analysis and recognition systems, however, it is frequently the case that sufficient information can be obtained from images with surprisingly low resolution. Initially the people-counting method described below was applied to medium resolution images (362 x 268). Once the algorithm had been tested and found adequate for the people counting task it was tested on low-resolution images and was found to work equally well when the resolution was reduced to just 96 72 pixels. This reduction in resolution was achieved by sub-sampling the input images. This sub- sampling took place before any image processing operations and resulted in a significant reduction in the overall processing time.

2.2. Threshold

It was noted earlier that the RAM-based neural networks used in this project required binary images as input. Thresholding is a process that requires some care because it removes much information from the image. A simple fixed theshold at (say) grey level 128 was not considered appropriate since this method takes on account of variations in ambient lighting levels. An adaptive global threshold, where the threshold level is a function of the mean grey level of the image, would have dealt with global changes in light levels, but would not have been able to deal with non-uniform changes in lighting.

Non-uniform changes in illumination can be dealt with using an adaptive local threshold. Here the grey level of each pixel is compared with the average grey level of the pixels in its immediate neighbourhood. For small neighbourhoods local adaptive thresholding is similar to edge detection. A system using this type of adaptive local thresholding has already been described/12~ Due to the differential nature of edge detection this system tended to increase the noise level of the


RAM-based neural networks were originally con- ceived of and constructed as hardware devices com- parising arrays of hardwired random access memories. Perhaps as a consequence of this history RAM-based networks are most easily described in terms of a hardware implementation, as is the case below. However, the emergence of fast serial computers has allowed software implementation of these networks. Despite the hardware description given here the neural network classifers used in this project were implemented entirely in software.

3.2. Construction and trainin9

Figure 3 shows a schematic diagram of the background removal stage. The pre-processed images were divided into 4 4 pixel sections. The pixel values from each section were applied to a single RAM-based neural network classifer. The structure of each classifer was identical to that shown inside the upper insert of Fig. 3. Each unit in each classifier can be thought of as a small 1 bit deep RAM with four address lines, containing 16 (1-bit) memory locations. Unlike convectional memory systems the RAMs were not used to store the image data. Instead the pixels of the input image were used as addresses for the RAMs. Groups of 4 pixels were selected, at random, from the image sections and together formed the address for one unit. This random mapping of pixels to address lines was a one- to-one mapping and remained constant throughout the lifetime of the classifier.

The classifiers were trained as follows. Background images were applied to the address lines of the RAMs. A logic "1" was written into each RAM at the ad- dressed location. Another image was applied, generating a new set of addresses, and another "1" written into

each RAM. This process was repeated for a number of training examples. Training images were only shown once because once a "1" had been written into a location it could not be overwritten by a "0".

3.3. Image testing~background detection

Once training was complete images were tested as follows. Each test image was applied to the address lines of the RAMs and the contents of each memory read out on the data-out lines. Whenever an individual RAM-unit was presented with an address that it had seen during training then it output a logic 'T', if the address had not been seen during training the RAM output a "0". Most image sections produced a mixture of ls and 0s on the data-out lines, but when an image section was similar to the corresponding section in the background scene there were more Is than 0s. Adding together all of the outputs for each image section generated section scores that were high for image sections that were similar to the background and low for sections that were part of a non-background object. The output scores from the classifiers were then in- verted such that the sections of the image that were least like the background produced the highest scores. The adjusted scores were then used to build up a section scores image in which sections corresponding to the background were shaded dark and those corresponding to part of a non-background object were shaded light.

Figure 4 shows an example section scores image. The neural networks had previously been trained on a number of images of the same scene under a variety of lighting conditions, but with no people present. The test image, which contained seven people, is shown in Fig. 1 (c).

Classifier Dam Ou---"~'x,t Random

~ 16

i ....L... ~ i ~ i i ' .............. ~l'

Sub-scrn~ed \ i i J

. . . . . . . . . . . . . . . x


images. Further, the method tended to remove all but the edges of objects within the scene and this tended to discourage the subsequent object counting stage from locating the true centres of the people being counted, causing it to occasionally count the gap between two people.

The current system used an alternative thresholding method that was less prone to noise and which preser- ved information from the centres of objects as well as their edges. A single example background image was stored. This image was then subtracted from each test image and the absolute values found (this step is similar to background removal by subtraction, however, it was not responsible for detecting background objects, but was intended only as a pre-processor for the RAM-based networks). The resulting difference image was a grey-scale image. The mean grey level of this image was estimated from a random sample of 2000 pixels. The difference image was then binarized according to a threshold grey level corresponding to the estimated average plus an offset. 2000 samples corresponds to ca 1/3 of the total number of pixels in the difference image and the resulting estimated mean values were considered good estimates of the true mean but required less time to calculate. Linking the threshold level to the mean grey level (as opposed to using a fixed level) gave the system a certain immunity to changes in overall illumination. The offset applied .to the theshold helps to increase the overall noise immunity of the system by causing it to ignore small fluctuations in pixel values due to noise on the video signal. The offset was determined by selecting the minimum value that would result in a blank thereshol- ded image for input images that were identical to the stored image except for noise. In the case of the example sequence discussed here an offset of 10 was found to be appropriate.

Whenever the image under test was similar to the background template a completely blank image was presented to the background identification system, greatly easing its task. When the lighting conditions changed the resulting thresholded image contained apparent objects corresponding to the differences in the two background images [see, for example, Fig. 2(a)]. When people were present in the scene they also appeared as bright blobs in the thesholded image [see Fig. 2(b)].

As was noted in Section 1, image subtraction cannot in itself provide a basis for people counting because it cannot distinguish between changes due to people and changes due to lighting and normal background variations. Due to this limitation the thresholded image was not used as the basis for the count. Rather, it was further processed using the neural network technique described in Section 3. The purpose of the neural networks was to discriminate between normal background variations (including lighting changes) and variations due to non-background objects, thus the limitations of the subtraction method were overcome.

a)

b)

Fig. 2. The effects of resolution reduction and thresholding. (a) A thresholded image of the background scene with the lights off when the image used in the subtraction process had the lights on. (b) The result of thresholding the image of

Fig. 1 (c).

3. BACKGROUND IDENTIFICATION

3.1. Introduct ion to RAM-based neural network classifiers

The background removal method described here utilized RAM-based neural networks that were con- ceptually similar in structure to the neural network classifiers used in the WlSARD pattern recognition system31'11 ) RAM-based networks have two advan- tages over other neural network architectures: they can be trained using examples of a single training class and they are fully trained after a single pass through the training set. In the current application the RAM-based networks were trained on examples of the background images only (that is no examples of the scene including people were presented to the network during training), yet the system was able to discriminate between parts of the background scene and non-background objects. This is important since it would not have been practi- cally possible to train a neural network on a represen- tative set of all non-background scenes.


for crowded images than it is for images containing only a few people the minimum distance was reduced as the count for each image imcreased. For the scene used throughout this paper the minimum distance was set to 5.5 and reduced toward 4.5 for crowded images. The brightness threshold for these image was ca 12% of the maximum possible section score. For plan views, the only calibration required was to determine base values for the minimum distance and brightness thresholds.

Fig. 4. Example section score image. The image was produced by processing the image of Fig. 1 (c) using the pre- processing outlined in Section 2 and the background identification method outlined in this Section. The neural network classifiers had previously been trained on multiple examples of the background scene under a variety of lighting conditions. The room lights were on in background image of

the thresholding stage.

4. PEOPLE COUNTING BY SERIAL SEARCH FOR PEAKS IN THE SECTION SCORES IMAGE

Having produced the section score images described above these images were then inspected to determine the number of people in each image. Three methods were considered: finding the total score for each image, groupino sections into person-sized regions and detecting bright spots or peaks in the section score images (peak detection). The total score method was found to be unreliable because the score was not linearly related to the number of people in the scene and the nature of the non-linearity varied from scene to scene. The grouping method has been implemented with some success,t1 n but was found to be difficult to calibrate for changes in camera viewpoint and lens type. The peak detection method described here was no more accurate than the grouping method, but required less effort during calibration.

The section score images were first smoothed using a Gaussian filter. Next the brightest section in the image was found and, provided that it was above a detection threshold, its location was noted and the count incremented to 1. This section was then set to black and a search made for the next brightest section. If this section was above the detection theshold it was set to black. If it was also sufficiently far from the first section then its location was recorded and the count incremented. Searching continued until the brightest remaining section was below threshold. At each step the count was only incremented if the new bright spot was sufficiently far from all the previously recorded points. The distance between sections was calculated according to the Euclidian distance measure. The minimum acceptable distance between peaks was set to equal the average distance between the centre of people in the image. Since this distance tends to be less

5. IMPLEMENTATION ISSUES

The people counting method outlined in Section 2, 3 and 4 was implemented on a dedicated image processing card fitted with a single Inmos T805 trans- puter. The input resolution of this device was 768 x 576 when grabbing images from a PAL format source. The input images were sub-sampled at a rate of 1 pixel (line) in 8 to produce images at the operating resolution of 96 x 72. All parts of the algorithm were implemented in software on the T805, which was also responsible for controlling the frame grabber hardware and a graphical user interface through which an operator could pass instructions to the system. The time taken to process each image was found to be dependent on the number of people present, but was normally in the range 0.7-1.0s. Once training and calibration were complete the system was capable of continuous, unsupervised, operation. The accuracy of the system is discussed in Section 6.

6. RESULTS

Figure 5 shows the results of the people-counting process. These results were obtained from the demons- tration system processing a previously recorded video tape. The system had previously been instructed to acquire a background image for the thresholding stage and trained to recognize variations in the background scene. The images used in this report were extracted from this video footage. The mean estimated number of people was calculated for each actual number of people and are plotted in Fig. 5. The standard deviations about these mean values are also shown. The actual number of people was supplied by the operator who used a "mouse" connected to the trans- puter hardware to increment (decrement) the count whenever a person entered (left) the scene. To aid the operator in this task the people taking part in the video were instructed to enter (leave) the scene one at a time.

The mean estimated values were all' close to the actual values and the standard deviations indicate that the estimated count was normally within 1 person of the actual count. For images containing few people the system made very few errors. It must also be remem- bered that there is some scope for operator errors in the supply of actual counts, although the operator (the first author) was well practised at this task.


o 7

o

"6 ~s r~ E "-1

r - 4 "1o (I)

c :2 S I I I I I I I 1 2 3 4 5 6 7

,~tual number of people

Fig. 5. Mean estimated number of people versus actual number of people.

7. FURTHER WORK AND CONCLUDING REMARKS

7.1. Tracking movements

The peak detection method for counting described above automatically provided an indication of the location of each peak in the section scores image and, therefore, each person in the scene. Although the current system cannot track individuals, the positional information could be used to aid the tracking of the movements of individuals around the scene. This could in turn be used to determine the actions of poeple moving through the scene. For example, it might also be used to discriminate between those boarding a lift and those alighting. People tracking would also increase the utility of the system as a surveillance tool and data gathering instrument.

7.2. Trainin9 and calibration

The algorithm described here was designed to be generally applicable to a variety of environments, provided the camera is sited so as to give a plan view. The number of adjustable parameters has been minimized so as to reduce the effort required to calibrate the system for each situation. The main requirements of the system are that it should be allowed to view the background scene so that background images can be acquired and the neural networks trained. Since RAM- based neural networks can be trained on a single pass through the training set, this initial training period need not be very long.

The minimum distance and brightness thresholds (see Section 4) must also be adjusted in relation to the height of the camera from the floor and the type of lens

in use. In the on-line prototype the user is able to adjust these parameters, via a keyboard, so as to tune the system for best results.

8. CONCLUSIONS

In conclusion, RAM-based neural networks can be used as part of a general-purpose background identification method that can deal with poorly constrained background scenes. This algorithm can be used to facilitate the location and counting of objects in the scene. The method does not require high resolution imagery and hence processing is fast even using modest hardware. RAM-based networks are particularly suited to this task, because they can be trained on examples of the background scene only. Training is fast and does not require multiple iterations though the training set. When applied to the task of people counting the method was found to produce reasonably accurate results, even when large variations in the background scene were allowed.

REFERENCES

I. A. T. P. So, W. L. Chan, H. S. Kuok and S. K. Liu, A computer vision based supervisory control system, Elevator Technology 4, Proc. Elevcon 249-258. 1AEE, Stockport, U.K. (1992).

2. G. D. Sullivan, Visual interpretation of known objects in constrained scenes, Phil. Trans. R. Soc. London B 337, 361-370 (1992).

3. A. Rouke and M. G. H. Bell, Video imaoe-processin 9 techniques and their application to pedestrain data collec- tion, Research report No. 83. Transport Operations Research Group, University of Newcastle upon Tyne (1992).


4. S. A. Velatsin, J. H. Yin, A. C. Davies, M. A. Vicencio- silva, R. E. Allsop and A. Penn, Analysis of crowd movements and densities in built-up environments using image processing, Proc. IEE Colloquium Image process. Trans- port Appl. Digest No: 1993/236 (1993).

5. A. Del Bimbo and P Nesi. A vision system for estimating people flow. Technical report DSI-RT 15/93. Department of Systems and Informatics, University of Florence, Italy (1993).

6. M. Kilger and T. Dietl Interpretation-driven low-level parameter adaption in scene analysis, Comput. Aided Syst. Theory, EUROCAST '93. F. Pichler and R. Moreno Diaz, eds, pp. 380-387. Springer-Verlag, Berlin (1993).

7. S. Brofferio and L. Carnimeo, A background updating algorithm for moving object scenes, Time Varying Image Process Moving Object Recognition 2, 297-307 (1990).

8. K. Karmann and A. von Brandt, Moving object recognition using an adaptive background memory, Time

Varying Image Process. Moving Object Recognition 2, 289-297 (1990).

9. N. L. Seed and A. D. Houghton, Background updating for real-time image processing at TV rates, Proc. SPIE 901, 172 180 (1988).

10. I. Aleksander, W. Thomas and P. Bowden, Wisard, a radical step forward in image recognition, Sensor Rev. 4(3), 120-124 (1984).

11. I. Aleksander and T. J. Stonham, Guide to pattern recognition using random-access memories, IEE J. Comput. Digital Techn. 2, 29 40 (1979).

12. A. J. Schofield, T. J. Stonham and P. A. Mehta, Auto- mated people counting using image processing and neural network techniques. Proc. 3rd Int. Conf. Automat. Robot Comput. Vis. Vol. 2, pp. 903 906, Singapore, 9-11 November (1994). Pub. Nanyang Technological Univer- sity, Singapore.

About the Author--ANDREW SCHOFIELD received his B.Eng. degree in Electronics from Brunel University in 1990 and his Ph.D. in Neuroscience from Keele University, Staffordshire, U.K., in 1994. His research interests are in image processing and neural network applications.

About the Author--PRATAP MEHTA is a Reader in the Department of Electrical Engineering and Electronics at Brunel University, West London, U.K. His research interests include power electronics and intelligent buildings.

About the Author--JOHN STONHAM is Professor of Neural Systems Engineering in the Department of Electrical Engineering and Electronics at Brunel University, West London, U.K. He received his B.Sc. degree in Electronics (1970) M.Sc. in Hybrid Computing (1972) and Ph.D. in Pattern Recognition (1974) from Kent University, U.K. His research interests are the theory and applications of neural networks, pattern recognition, image processing and image database management.

schofield 1996

Documents