3d cascade of classifiers for open and closed eye detection in driver distraction monitoring

3D Cascade of Classifiers for Open and Closed

Eye Detection in Driver Distraction Monitoring

Mahdi Rezaei and Reinhard Klette

The .enpeda.. Project, The University of AucklandTamaki Innovation Campus, Auckland, New Zealand

[email protected] , [email protected]

Abstract. Eye status detection and localization is a fundamental stepfor driver awareness detection. The efficiency of any learning-based ob-ject detection method highly depends on the training dataset as wellas learning parameters. The research develops optimum values of Haar-training parameters to create a nested cascade of classifiers for real-timeeye status detection. The detectors can detect eye-status of open, closed,or diverted not only from frontal faces but also for rotated or tilted headposes. We discuss the unique features of our robust training databasethat significantly influenced the detection performance. The system hasbeen practically implemented and tested in real-world and real-time pro-cessing with satisfactory results on determining driver’s level of vigilance.

1 Introduction

The automotive industries implements active safety systems into their top-endcars for lane departure warning, safe distance driving, stop and speed sign recog-nition, and currently also first systems for driver monitoring [Wardlaw 2011].Stereo vision or pedestrian detection are further examples of components of adriver assistant system (DAS).

Any sort of driver distraction and drowsiness can lead to catastrophic casesof traffic crashes not only for the driver and passengers in the ego-vehicle (i.e.the car the DAS is operating in) but also for surrounding traffic participants.Face pose and eye status are two main features for evaluating a driver’s levelof fatigue, drowsiness, distraction or drunkenness. Successful methods for facedetection emerged in the 2000s. Research is now focusing on real time eye detec-tion. Concerns in eye detection still exist for non-forward looking face positions,tilted heads, occlusion by eye-glasses, or restricted lightening conditions.

According to [Zhang and Zhang 2010], research on eye localization can beclassified into four categories. Knowledge-based methods include some predefinedrules for eye detection. Template-matching methods generally judge the presenceor absence of an eye based on a generic eye shape as a reference; a search foreyes can be in the whole image or in pre-selected windows. Since eye modelsvary for different people, the locating results are heavily affected by eye modelinitialization and image contrast. High computational cost also prevents a wideapplication for this method. Feature-based approaches are based on fundamentaleye-structures; typically a method starts here with determining properties such

A. Berciano et al. (Eds.): CAIP 2011, LNCS 6855, pp. 171–179, 2011.c© Springer-Verlag Berlin Heidelberg 2011

172 M. Rezaei and R. Klette

as edges, intensity of the iris and sclera, plus colour distributions of the skinaround eyes to identify ‘main features’ of eyes [Niu et al. 2006]. This approachis relatively robust to lightning but fails in case of face rotation or eye occlusion(e.g. by hair or eye-glasses). Appearance-based methods learn different types ofeyes from a large dataset and are different to template matching. The learningprocess is on the basis of common photometric features of human eye from acollective set of eye images with different head poses. The paper develops thelast one-appearance-based method.

2 Cascade Classifiers Using Haar-Like Masks

Such a system was developed by [Viola and Jones 2001] as a face detector. Thedetector combines three techniques: the use of a comprehensive set of Haar-likemasks (also called ‘features’ by Viola and Jones) that are in analogy to basefunctions of the Haar transform, the application of a boosted algorithm to selecta set of masks for classifier training, and forming a cascade of strong classifiersby merging week classifiers. Haar-like masks are defined by adjacent dark andlight rectangular regions; see Fig. 1.

Selection process of the object is based on the value distributions in dark orlight regions of a mask that models expected intensity distributions. For example,the mask in Fig. 2, left, relates to the idea that in a face there are darker regionsof eyes compared to the bridge of the nose. similarly, the mask in Fig. 2, right,models that the central part of an eye (the iris) is darker than the sclera area.

Computing Mask Values. Mean values in rectangular mask regions are cal-culated by applying the integral image as proposed in [Viola and Jones 2001];see Fig. 3. For a given M × N picture P , at first the integral image

I(x, y) =∑

0≤i≤x∧0≤j≤y

P (i, j) (1)

is calculated. The sum P (R1) of all P -values in rectangle region R1 (see Fig. 3) isthen given by I(D)+ I(A)− I(B)− I(C). Analogously we calculate sums P (R2)and P (R3) from corner values in the integral image I. Values of contributingregions are weighted by reals ωi that create regional mask values in form of

Fig. 1. Four different sets of masks for calculating Haar-like masks

3D Cascade of Classifiers for Driver Distraction Monitoring 173

Fig. 2. Left: Application of two triple masks for collecting mean intensities in brightor dark regions. Right: Camera assembly in HAKA1 for driver distraction detection.

vi = ωi ·P (Ri), and then a total mask value; for the shown example this is Vi =ω1 ·P (R1)+ω2 ·P (R2)+ω3 ·P (R3). Signs of ωi’s are opposite for light and darkregions. In generalizing this approach, we also allow for arbitrary rotations. Ri isnow defined by five parameters x, y, w, h, and ϕ, where x and y are coordinates ofthe lower-right corner, w and h are width and height, and ϕ is the rotation angle[Zhang and Zhang 2010]. For example, Pϕ(R1) = Iϕ(B)+Iϕ(C)−Iϕ(A)−Iϕ(D)and for ϕ = 45◦ we have

I45◦(x, y) =∑

|x−i|≤y−j ∧ 0≤j≤y

P (i, j) (2)

For any angle ϕ, the calculation of all M × N integral values Iϕ takes timeO(M ×N). This allows for real-time calculation of features on Haar-like masks.

Cascaded Classifiers via Boosted Learning. In a search window of 24× 24pixel there are more than 180,000 different rectangular masks of different shape,size, or rotation. However, only a small number of masks (usually less than 100)is sufficient to detect a desired object in an image (e.g. eye). In addition todefining regional mask weight wi, using a boosting algorithm, the classifier canlearn to sort out the prominent masks μi based on their overall wight Wi. Suchwights determine the importance of each mask in an object detection process sowe arrange all the masks in cascaded nodes as Fig. 4.

Each node (weak classifier) tries to determine whether the object (e.g. an eye)is inside the search window or not. The first classifier simply reject non-objects if

Fig. 3. Illustration for calculating a mask value using integral images. The coordinateorigin is in the upper left corner.


the main masks (such as in Fig. 2) do not exist. If they exist then more detailedmasks will be evaluated in next classifiers and the process continues. Actuallyeach node represents a boosted classifier adjusted not to miss any object while itis rejecting non-objects if not matching the desired masks. Although each nodeis a weak classifier but all of them are considered a strong classifier and reachingthe final node means that all non-objects have already been rejected and we haveonly one object (here: an eye). The function μi returns +1 if the mask value Vi

is greater or equal to a trained threshold, and -1 if not:

μi =

{+1 if Vi ≥ Ti

−1 if Vi < Ti

(3)

μi = +1 means that the current weak classifier matches the object and wecan proceed to the next classifier. Statistically about 75% of non-objects arerejected by the first two classifiers; the remaining 25% are for a more detailedanalysis. This speeds up the process of object detection. In order to train thethe algorithm we need a database of positive images (e.g. eyes)and on the firstpass through the positive image database, we learn threshold T1 for μ1 such thatit best classifies the input. Then boosting uses the resulting errors to calculatethe overall weight W1. Once the first node is trained then boosting continuesfor other nodes but with some other masks that are more sophisticated thanprevious ones [Freund et al. 1996].

Assume that each node (a weak classifier) is trained to correctly match anddetect objects of interest with the true rate of p = 99.9% (true positive, TP).Since each stage alone is a weak classifier it is expected to be many false de-tections of non-objects, say f = 50% (false positive, FP)in each stage. This isstill acceptable because, due to the serial nature of cascade classifiers, the overalldetection ratios remains high (near 1) but it leads to a logarithmic decrease inthe false positive rate (approaches to 0).

3 Scenarios and 3D Cascaded Classifiers

Most of eye detection algorithms such as [Wang et al. 2010] just look for theeyes in an already localized face. Therefore, eye detection simply fails if there is

Fig. 4. Structure of cascaded classifiers for object detection


no full frontal view of a face, or some parts of face be occluded, or if parts of aface are outside of the camera viewing angle .

Our method follows a dynamic approaches, if the initial result of face detec-tion is positive then we just look through the face region. Detection of an eyein a previously detected face region supports a double confirmation, and moreconfidence for the validity of eye detection. But if the face is not detected our 3Dcascade looks for eye in the whole image. In our particular context we considerdriver fatigue, drowsiness, distraction, or drunkenness when the driver misses tolook forward on the road, or when the eyes are closed for some long uninter-rupted period of time (say 1 sec. or more). As an example, when driving witha speed of 100 km/h, just one second eye closure means passing of 28 meterswithout paying attention. This can easily cause lane drift and a fatal crash. Inour method we assume two status of Looking Forward and Open Eyes as im-portant properties for judging driver’s vigilance. for the face detection we followthe classifier in [Lienhart et al. 2003] for face detection and for the eye statusdetection we design our own classifiers. the proposed 3D designed classifier isable to detect and define 5 different scenarios while driving as below (see Fig. 5from left to right):

Scenario 1: Obviously eyes are in the upper half of face region. By assessing 200different faces from different races we derived that human eyes are geometricallylocated in segment A between 0.55 to 0.75 of the face’s height. Applying thisrough estimation in eye localization we already increased the search speed byfactor 5 compared to a blind search, as we are only looking into 20% of the face’sregion. An eye pair is findable in segment A while the driver is looking forward.

Scenario 2: Some rare times happens that only one eye is detectable in segmentA when the driver tilts his face. In that case we need to look for the second eyein segment B in the opposite half of the face region. Segment B is considered tobe between 0.35 to 0.95 of the face’s height; this covers more than ±30 degrees offace tilt. The size of the search window in segment B is 30% of the face region.In that case of a tilted face we search both sections A and B (in total, 50%of the face’s region). In Scenarios 1 and 2, the driver is looking forward to theroadway. So if we detect two open eyes then we decide that the driver is in theAware state.

Scenario 3: If a frontal face is not detectable and just one of the eyes is detected,then this can be due to more than 45◦ of face rotation. The driver is looking

Fig. 5. Left to right: Scenarios 1 to 5 for driver’s face and eye poses; see text for details


towards the right or left such that the second eye is occluded by the nose. Thesystem immediately measures the period of time that the driver is looking toother sides instead of forward. This scenario also happens when the driver looksto side mirrors (but this takes normally less than second). Depending on theego-vehicles speed, any occurrence of this scenario that takes more than 1 sec isconsidered as a sign of Distraction and the system will raise an alarm.

Scenario 4: Detection of closed eyes. Here we use an individual classifier for closeeye detection. A closed-eye status happens frequently for normal eye blinking,and the eye closure time tc is normally less than 0.3 sec. Any longer eye closuresis a strong evidence of fatigue, drowsiness, or drunkenness. The system will raisean alarm for Drowsiness status if there is no open eye and at least one closedeye is detected.

Scenario 5: The worst case is when neither face, nor open eyes, nor closed eyesare detectable. This case occurs, for example, when the driver is looking overthe shoulder, when the head falls in, or when the driver is performing secondarytasks. The system will raise an alarm for a detected Risky Driving status.

Considering all active detectors (face, open-eye, and close-eye detectors), wehave cascaded classifiers in three dimensions that work in parallel. Implementingseparate detectors for open and closed eye detection is important because atsome times the open eye detector may fail to detect open eyes, but this doesnot necessarily mean that the eyes are closed. Missing eyes may be because ofa specific head pose or bad lightening conditions. Having a separate closed-eyedetector is a step toward high accuracy in driver distraction detection.

4 Training Image Database

The process of selecting positive and negative images is a very important stepthat affects the overall performance considerably. After several experiments it isdetermined that, although a larger number of positive and negative images canimprove the detection performance in general, there is also an increase of the riskof mask mismatching during the training process. Thus, a careful considerationfor number of positive and negative images and their content is essential. Inaddition, the multi-dimensionality of training parameters and the complexity ofthe feature space defines challenges. We propose optimized values of trainingparameters as well as unique features for our robust database.

In the initial negative image database, we removed all images that containedany objects similar to human eye (e.g. animal eyes). We prepared the trainingdatabase by manually cropping closed or open eyes from positive images. Impor-tant questions needed to be answered: how to crop the eye regions and in whatshapes (e.g. circular, isothetic rectangles, squares)? There is a general believethat circles or horizontal rectangles are best for fitting eye regions. However, weobtained the best experimental results by cropping eyes in square form. We fitthe square enclosing full eye-width; for the vertical positioning we select balancedportions of skin area below and above the eye region. We cropped 12,000 eyes


from selected positive images of our own database plus six other databases:FERET database sponsored by the DOD Counterdrug Technology Develop-ment Program Office [Phillips et al. 1998, Phillips et al. 2000], Radbound facedatabase [Langner et al. 2010], Yale facial database B [Lee et al. 2005], BioIDdatabase [Jesorsky et al. 2001], PICS database [PICS], and the “Face of Tomor-row” [FTD]. The positive database includes more than 40 different poses andemotions for different faces, eye types, ages, and races:

– Gender and age: females and males between 6 to 94 years old,– Emotion: neutral, happy, sad, anger, contempt, disgusted, surprised, feared,– Looking angle: frontal (0◦), ±22.5◦, and profile (±45.0◦), and– Race: East-Asians, Caucasians, dark-skinned people, and Latino-Americans.

The generated multifaceted database is unique, statistically robust and compet-itive compared to other training databases.

We also selected 7,000 negative images (non-eye and non-face images) includ-ing a combination of common objects in indoor or outdoor scenes. Consideringa search window of 24 × 24 pixel, we had about 7,680,000 sub-windows in ournegative database. An increasing number of positive images in the training pro-cess caused a higher rate for true positive cases (TP) which is good, and alsoincreased false positive cases (FP) which is bad. Similarly, when the number ofnegative training images increased, it lead to a decrease in both FP and TP.Therefore we needed to consider a good trade-off for the ratio of number of neg-ative sub-windows to the number of positive images. For eye classifiers, we gotthe highest TP and lowest rate for false negative detection when we arrangedthe ratio of Np/Nn = 1.2 (this may vary for face detection).

5 AdaBoost Learning Parameters and Experiments

We implemented the training algorithm in OpenCV 2.1. With respect to ourdatabase we gained a maximum performance by applying the following settings:Size of mask-window: 21×21 pixel. Total number of classifiers (nodes): 15 stages;any smaller number of stages brought a lot of false positive detection, and a largernumber of stages reduced the rate of true positive detection. The minimum ofacceptable hit rate for each stage: 99.80% and increasing; a rate too close to 100%may cause the training process to take for ever or early failure. The maximumacceptable false alarm for the 1st stage: 40.0% per stage; this error goes tozero exponentially when the number of iterations increases. Weight trimmingthreshold: 0.95; this is the similarity weight to pass or fail an object in eachstage. Boosting algorithm: among four types of boosting (Discrete AdaBoost,Real AdaBoost, Logit AdaBoost, and Gentle AdaBoost), we got about 5% moreTP detection rate with Gentle AdaBoost. [Lienhart et al. 2003] also proved thatGAB will result into lower FP ratios for face detection.

We performed a performance evaluation test on 2,000 images from the secondpart of the FERET database plus on 2,000 other image sequences recorded byHAKA1, our research vehicle (see Fig. 2, right). None of the test images were


Table 1. Classifiers accuracy (in %) in terms of true positive and false positive rate

Open-eye detection Closed-eye detectionFacial status TP FP TP FP

Frontal face 98.6 0.0 97.7 0.20

Tilted face (up to ±30◦) 98.2 0.002 97.1 0.54

Rotated face (up to ±45◦) 96.8 0.0 96.8 0.7

included before in the training process and all the images are recorded in day-light condition. Table 1 shows the final results of open and closed eye detectionrate.

6 Conclusions

With the aim of driver distraction detection, we implemented a robust 3D de-tector based on Haar-like masks and AdaBoost machine learning that is able toinspect for face pose, open eyes and closed eyes at the same time. Despite the sim-ilar research that are only able to work on frontal faces, The developed classifieris also able to works for tilted and rotated faces in real-time driving applica-tions. There are no comprehensive data about performance evaluation for eyedetection. Comparing results in [Kasinski and Schmidt 2010], [Niu et al. 2006],[Wang et al. 2010] and in [Wilson and Fernandez 2006] with our results (see Ta-ble 1), the method appears to be superior in a majority of cases. The methodstill needs improvement for dark environments. High-dynamic range cameras orsome kind of preprocessing might be sufficient to obtain satisfactory detectionaccuracy also at night or in low-light environments.

References

[Langner et al. 2010] Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D.H.J., Hawk,S.T., Van Knippenberg, A.: Presentation and validation of the Radbound facesdatabase. Cognition Emotion 24, 1377–1388 (2010)

[Freund et al. 1996] Freund, Y., Schapire, R.E.: Experiments with a new boosting al-gorithm. In: Machine Learning, pp. 148–156 (1996)

[Jesorsky et al. 2001] Jesorsky, O., Kirchberg, K., Frischholz, R.: Robust face detectionusing the Hausdorff distance. J. Audio Video-based Person Authentication, 900–995 (2001)

[Kasinski and Schmidt 2010] Kasinski, A., Schmidt, A.: The architecture and perfor-mance of the face and eyes detection system based on the Haar cascade classifiers.J. Pattern Analysis Applications 3, 197–211 (2010)

[FTD] Face of tomorrow database (2010),http://www.faceoftomorrow.com/posters.asp

[Lee et al. 2005] Lee, K.C., Ho, J., Kriegman, D.: Acquiring linear subspaces for facerecognition under variable lighting. IEEE Trans. Pattern Analysis Machine Intel-ligence 27, 684–698 (2005)

http://www.faceoftomorrow.com/posters.asp


[Lienhart et al. 2003] Lienhart, R., Kuranov, A., Pisarevsky, V.: Empirical analysis ofdetection cascades of boosted classifiers for rapid object detection. In: Michaelis,B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 297–304. Springer, Heidel-berg (2003)

[Niu et al. 2006] Niu, Z., Shan, S., Yan, S., Chen, X., Gao, W.: 2D cascaded AdaBoostfor eye localization. In: ICPR, vol. 2, pp. 1216–1219 (2006)

[Phillips et al. 1998] Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.: The FERETdatabase and evaluation procedure for face recognition algorithms. J. Image VisionComputing 16, 295–306 (1998)

[Phillips et al. 2000] Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERETevaluation methodology for face recognition algorithms. IEEE Trans. PatternAnalysis Machine Intelligence 22, 1090–1104 (2000)

[PICS] PICS image database: University of Stirling, Psychology Department (2011),http://pics.psych.stir.ac.uk/

[Viola and Jones 2001] Viola, P., Jones, M.: Rapid object detection using a boostedcascade of simple features. In: CVPR, pp. 511–518 (2001)

[Wang et al. 2010] Wang, H., Zhou, L.B., Ying, Y.: A novel approach for real time eyestate detection in fatigue awareness system. IEEE Robotics Automation Mecha-tronics, 528–532 (2010)

[Wardlaw 2011] Wardlaw, C.: 2012 Mercedes-Benz C-Class preview (2011),http://www.vehix.com:80/articles/auto-previewstrends/

2012-mercedes-benz-c-class-preview

[Wilson and Fernandez 2006] Wilson, P.I., Fernandez, J.: Facial feature detection us-ing Haar classifiers. J. Computing Science 21, 127–133 (2006)

[Zhang and Zhang 2010] Zhang, C., Zhang, Z.: A survey of recent advances in facedetection. MSR-TR-2010-66, Microsoft Research (2010)

http://pics.psych.stir.ac.uk/

http://www.vehix.com:80/articles/auto-previewstrends/2012-mercedes-benz-c-class-preview

http://www.vehix.com:80/articles/auto-previewstrends/2012-mercedes-benz-c-class-preview

3d cascade of classifiers for open and closed eye detection in driver distraction monitoring

Documents