increasing security of mobile devices by decreasing user effort in verification

Increasing Security of Mobile Devices by Decreasing User Effort in Verification

Elena Vildjiounaite, Satu-Marja Mäkelä, Mikko Lindholm, Vesa Kyllönen and Heikki Ailisto Technical Research Centre of Finland

[email protected]

Abstract

Reliable user verification is important for security

of computers and personal devices; however, most of well-performing verification methods require explicit user effort. As a consequence, an access is granted for a long time after the only successful verification, which allows replacing the authorized user to the advantage of an impostor, as it is often the case with mobile phones. This work proposes a method of frequent user verification, based on cascading of unobtrusive biometrics with more reliable biometrics, provided explicitly, in such a way that explicit effort is required only if unobtrusive verification fails. Experiments with voice, gait and fingerprint data have shown that in most of noise conditions cascade was able to satisfy security requirements of False Accept Rate 1% and to achieve overall False Reject Rate 3% or less, while requiring explicit effort in 10 - 60% of cases. 1. Introduction

User verification can be performed by something that user knows (a password); something that user owns (a token) and by user features (biometrics). None of these methods guarantees perfect security; however, biometrics has an advantages that it is always with a person and that spoofing multimodal biometric system requires certain skills and efforts. Thus, protection of personal devices by biometrics would help to avoid current situation when one can simply pick up any mobile phone and make calls or view personal data. Since not only mobile devices themselves, but also services provided by them, as well as the stored information (names, addresses, images, short messages, user calendar etc) have significant monetary and personal value, the risk of a mobile device ending up in the wrong hands presents a significant and unfortunately very common threat to information security and user privacy.

The main reason for this threat is lack of authentication methods which would allow performing user authentication frequently and as unobtrusively as possible. This work proposes a method of multimodal biometrics - based user verification, which aims at increasing security of personal devices by reducing user effort, required for verification. The verification is performed in two stages: first stage attempts to perform user verification by unobtrusive biometrics only; and second stage, which requires explicit effort, is involved only if the first stage fails. Since security requirements imply that False Accept Rate (FAR) should not be too high, the cascade is trained in such a way that FAR of both stages is kept within the desired limits, and False Reject Rate (FRR) is as low as possible for the target FAR.

The proposed method can be applied to different combinations of biometric modalities. For mobile devices the most suitable unobtrusive modalities are voice (users talk to each other via a mobile device or in its close proximity; or talk directly to their mobile devices, utilizing the speech recognition functionality) and walking style, gait (users carry devices with them), which can be acquired via embedded accelerometers. The most suitable explicit modality is fingerprint, because fingerprint sensors for mobile devices exist and are often perceived as a “cool” feature.

Gait biometrics has been studied for more than a decade [1-3], and it was acknowledged that deliberate imitation of another person's gait is difficult, whereas gait of an individual is fairly stable [4]. However, gait recognition was mainly video-based until recently, when accelerometer-based gait recognition has been suggested [5]. Gait biometrics is still in its infancy, and its performance is not very good: Equal Error Rates (for a definition see Section 3.4) of gait exceed 10%.

On the contrary, voice biometrics (speaker recognition) is a widely researched area. The usability of systems, however, is limited due to the vulnerability of speech to the background noise which is present in real-life situations. The performance of speaker

recognition can deteriorate from Equal Error Rate (EER) 0.7% on clean speech to EER = 28.08% in the presence of white noise with a Signal-to-Noise Ratio (SNR) of 0 dB [6]. Other example shows [7] how Speaker Identification rate of 97.7%, achieved under conditions of car noise with SNR 30 dB, can drop to 21% under conditions of SNR 5 dB. Increasing noise robustness improves performance, e.g., in reference [6] the use of a noise-robust method for 0 dB white noise reduced the EER from 28.08% to 11.68%, while for white noise of 18 dB, where the initial EER was 1.36%, the use of a noise-robust method improved the EER to 1.06%. However, noise-robust methods are usually computationally expensive, which can be a problem for limited capabilities of mobile devices.

Since mobile devices are often stolen or lost in noisy urban environments, unobtrusive user verification by fusion of voice and gait biometrics may be feasible, but its recognition rates in noisy conditions are not sufficiently high either [8]. Cascading unobtrusive and fingerprint verification is one solution to the problem because sweep fingerprint sensors for mobile devices already exist, although their error rates are higher than that of other fingerprint sensors, e.g., the work of Zhang et al. [9] reports EER of 3.5 – 4% for different methods.

Cascaded multimodal biometric fusion has not received much attention among researchers, probably because the main effort in fusion is devoted to improving overall system performance, not to reducing user effort. Thus, parallel systems (those which perform fusion of all modalities simultaneously) are most commonly studied, because they are flexible and easy to train. Cascaded systems have been proposed mainly for identification purposes in order to increase operational speed: the system first finds best matches for one modality, and then searches the best match for the second modality only among these matches [10].

For verification purposes, a few cascaded systems have been proposed. The work Takahashi et al. [11] proposes a cascade which would allow users to choose the order of modalities. This should help to increase user-friendliness and population coverage, but can also facilitate spoofing. The Sequential Probability Ratio Test is used for decision fusion, and experiments on the database of five people prove its ability to keep FAR in desired limits. However, False Reject Rates in different system configurations are not presented.

The work of Erzin et al. [12] proposes a method to select the order of modalities in cascade for improving the overall performance; however, the method assumes that all modalities are available simultaneously. The method, called adaptive classifier cascade, was applied for identification with audio and video data (face and lip movement), where five different sets of scores were

produced from the audio and video stream. The order of classifiers in the cascade depends on the estimated reliability of the modality, based on the assumption that a correct speaker model would create a likelihood ratio significantly higher than the likelihood ratios of the other speaker models. The experimental results on a database of 50 persons show that the proposed fusion method outperforms such fusion schemes as product rule and max likelihood when selecting three best modalities out of five single modalities.

Unlike the previous work, the main goal of this work is to study how to decrease user effort while maintaining acceptable error rates. The novel idea of cascading unobtrusive and explicit user verification was tested in offline experiments on the gait, voice and fingerprint data collected from same real persons (unlike the common practice to use so-called “virtual” persons, created by mixing biometric data of one modality from one person with biometric data of another modalities from other people). Experimental results confirmed the feasibility of the idea.

2. Overview of the proposed method

In order to decrease user effort, we propose to perform user authentication by means of unobtrusive biometrics first, and to require explicit user authentication only if the unobtrusive stage fails and application-dependent security risk is present (see Figure 1). Examples of a security risk can be the user login or start of a sensitive application (when accept/ reject decision is required immediately); or failure of unobtrusive verification during a certain time period. Security risk can be considered lower if a mobile device is in a user's home, and higher in urban streets and new places. UK statistics, for example, show that "a mobile phone [is] stolen approximately every three minutes" [13], mainly in urban environment. The proposed method can be used with different biometric modalities; however, since security and privacy issues related to mobile devices are becoming crucial, we have tested it on mobile devices – related data.

Figure 1. Overview of the proposed method.

The experiments were aiming at answering the

following questions: first, what is the performance of cascade compare with the performance of fingerprint recognition alone; second, how much cascade could reduce the need in explicit user effort; and third, what are better choices for unobtrusive stage (voice modality alone or fusion of gait and voice) and for explicit stage (fingerprint alone or fusion).

3. Single modalities

For all three single modalities we collected data from same real persons (32 test subjects) and obtained similarity scores by comparison of each person data against every other person.

3.1 Gait data collection and processing

Gait data was collected in the form of a three-dimensional acceleration signal from an accelerometer module carried by test subjects while walking. We collected two sets of data (training and test data) in two sessions at one month interval between sessions.

Figure 2. Gait data collection setup.

During each session the subjects were walking

along the corridor (about 20 meters) at their normal walking speed. The three-dimensional accelerometer, composed of two perpendicularly positioned Analog Devices ADXL202JQ accelerometers, was embedded into a mobile phone – like module. The attachment system mimicked two common places where people often carry things: the breast pocket of a shirt and the hip pocket of trousers (see Figure 2). The accelerometer signals were recorded by a laptop with National Instruments Data Acquisition Card at a sampling frequency of 256 Hz (the data was decimated later).

All three axis of accelerometer signal were used for gait recognition. The data for both training and testing phases were first preprocessed, i.e. normalized to a range -1 and 1, low-pass filtered and decimated by a factor of two. Two similarity scores (a correlation and FFT scores) were calculated by comparing the test data with the corresponding training data. More details of the correlation-based recognition are provided in [5],

where the method was applied to the data from an accelerometer module carried at the waist, in the middle of the back. The FFT (Fast Fourier Transform) coefficients were calculated in a 256-sample window with a 100 sample overlap. The 128 FFT coefficients of each training file were clustered with K-means algorithm into eight clusters; and the FFT score was produced by finding the minimum distance of the test data from the trained clusters.

3.2 Voice data collection and processing

The speech samples were collected by a computer in a quiet environment. Each subject spoke the required four utterances (training data) in the first session and one utterance (test data) in the second session; each utterance being an eight-digit string. We used a very small amount of data for training because we were aiming at decreasing user effort. The data were collected in wave format at a sampling frequency of 8000 Hz. The speech samples were normalized and contaminated with white, city and car noise at three SNR conditions, 20, 10 and 0 dB. The white noise was artificially generated. The city and car noise were taken from the NTT-AT Ambient Noise Database [14].

Speaker recognition was text-independent and was performed using the widely known MASV (Munich Automatic Speaker Verification) environment [15]. MASV uses a Gaussian Mixture Models (GMM) classifier. In this work GMM was used with 32 components and the feature vector contained 39 components (12 Mel Frequency Cepstrum Coefficients and log energy together with their first and second derivatives). The world model was generated from a small subset of the training samples.

3.3 Fingerprint data collection and processing

A commercially available optical fingerprint sensor Biometrika FX2000 with associated software development kit FX3 SDK was used in this study. The fingerprint recognition method, employed in FX3, uses minutia matching and four other types of features which encode the ridge-line flow and density, and the ridge-line shape in some regions [16]. In this work a template based on single fingerprint image was used.

3.4 Summary of single modalities

Performance of biometric system is usually evaluated by its False Accept Rate (FAR) and False Reject Rate (FRR). In a concise form, the system performance can be presented by its Equal Error Rate (EER). EER equals to the half of TER (Total Error

Rate, calculated as the sum of FAR and FRR) at the point where FAR and FRR are approximately equal. EER of single modalities are presented in Table 1.

Table 1. Equal error rates of single modalities Modality Use Case EER,

% Gait, Corr. Accelerometer in breast pocket 14.8 Gait, Corr. Accelerometer in a hip pocket 14.1 Gait, FFT Accelerometer in breast pocket 13.7 Gait, FFT Accelerometer in a hip pocket 16.8 Fingerprint All 3.2 Voice Clean speech 2.9 Voice City noise, SNR 20 db 2.8 Voice City noise, SNR 10 db 2.9 Voice City noise, SNR 0 db 12.1 Voice Car noise, SNR 20 db 3.1 Voice Car noise, SNR 10 db 12.1 Voice Car noise, SNR 0 db 27.8 Voice White noise, SNR 20 db 21.2 Voice White noise, SNR 10 db 31.3 Voice White noise, SNR 0 db 41.6 4. Cascaded fusion of single modalities

For each person-to-person comparison we had a set of three scores obtained by comparison of biometric sample of one user against a sample of the same (genuine score) or different (impostor score) user, and scores of all single modalities were provided by same persons. The data was divided by halves to “Set 1” and “Set 2”. Experiments were performed in such a way that first each algorithm was trained on “Set 1” and tested on “Set 2”. After that algorithm was trained on “Set 2” and tested on “Set 1”, and test results averaged.

In this study we used score-level fusion with Weighted Sum fusion rule for different possible configurations of cascade, due to its simplicity. We used Weighted Sum of three scores (correlation and FFT scores of gait modality and voice score) for first stage of gait-voice-finger cascade, and Weighted Sum of all four scores for the last stage. For cascade without gait we used voice score only at the first stage, and Weighted Sum of voice and finger scores at the last stage. Additionally, we tested cascade which uses only finger score (without any fusion with unobtrusive modalities) at the last stage, but the error rates in this case were higher, and we omit the results here.

Training of cascade was done as follows: first, we set the upper limit of target FAR (same for both stages) and the lower limit, which we selected to be equal to 60% of the upper limit. Next, we train the first stage to

achieve the target FAR. In case of fusion in the first stage weights of single modalities are calculated according to the Formula 1 (TER is Total Error Rate of modality on the train set for the target FAR).

∑∑

∗−= ≠

kk

ikkk

i TERK

TERW

)1(, (1)

When weights for fusion are calculated (in case of voice only on the first stage we don’t need this step), the threshold for the first stage of cascade is calculated so that FAR on the training set falls within desired limits and TER is as small as possible for this target FAR. After that second stage is trained in a similar way, so that FAR after two stages should fall within the desired limits. For this study we have chosen the following upper limits for target FAR: 0.5%, 1%, 2%, 3%, and 4%. Considering fairly high error rates of our modalities, setting lower target FAR was not feasible. 5. Cascade performance in different noise conditions and for different target FAR

Experiments were performed in order to compare the following configurations: 1) cascade with voice-gait fusion at the first stage and voice-gait-fingerprint fusion at the last stage); 2) cascade with only voice at the first stage and voice-fingerprint fusion at the last stage; 3) cascade with fingerprint only at the last stage and two different configurations of the first stage; 4) one-stage fingerprint verification.

Figures 3 and 4 present performance of unobtrusive user verification for low and high noise conditions.

Figure 3. Performance of unobtrusive authentication in low noise conditions.

Unlike traditional multimodal biometrics systems,

performance of such cascade can not be evaluated only by its FAR and FRR after the last stage. FRR of unobtrusive stage is also important, because in case of false rejection and in presence of security risk the user shall be disturbed and asked to provide more biometric data. Figure 3 shows that for most of low noise conditions and target FAR unobtrusive verification succeeds in 90-80% of cases. It is also seen that fusion with gait decreases performance in most cases (compare with voice alone), although not significantly.

Figure 4. Performance of unobtrusive authentication in high noise conditions.

Figure 4 shows, that fusion with gait improves

recognition rates for high noise conditions, and that unobtrusive verification is possible in large number of cases (70 - 40% of cases, depending on type of noise and target FAR) for city and car noises with SNR 0db and for white noise with SNR 20db. Chances for unobtrusive verification to succeed decrease to 30-50% in conditions of white noise with SNR 10db and 0db; however, white noise is not very common in real life.

Figures 5 and 6 present cascade performance after the last stage, together with performance of one-stage fingerprint verification, for fairly low noise conditions and for fairly high noise conditions correspondingly. Figure 5 shows that performance of cascade in low noise conditions in both configurations (with and without gait) is better than that of fingerprint verification only; and that fusion with gait improves performance in most cases. Figure 6 shows that in high noise conditions both configurations of cascade (with and without gait) perform better than fingerprint verification alone, with the exception of car noise 0db and certain range of target FAR for white noise 0 db.

Figure 6 also shows that fusion with gait decreases FRR in most cases, while slightly increasing FAR in some cases, compare with voice/ fingerprint cascade.

Figures 5 and 6 also show that for both low and high noise conditions increasing target FAR does not necessarily help to decrease FRR; and that target FAR was maintained fairly well in all configurations.

Figure 5. Performance after the last stage of cascade in low noise conditions.

Figure 6. Performance after the last stage of cascade in high noise conditions.

6. Conclusion

This work proposes the method of increasing security of personal devices by reducing user effort required for verification, and thus by providing the possibility to verify the users more frequently. The method was tested in offline experiments with unobtrusive biometric modalities, natural for mobile users: gait and voice. Since performance of unobtrusive biometrics is not sufficiently high for achieving reasonably low False Accept and False Reject Rates in all noise conditions, it is proposed to train the system in such a way that its False Accept Rate falls within desired limits (for such not-so-well performing unobtrusive modalities as gait and voice reasonably low desired FAR would be 1%-2%), and to complement unobtrusive verification with more reliable explicit modality (fingerprint) when unobtrusive verification fails and security risk is present (e.g., when unobtrusive verification failed during certain time period, or when the users start certain applications).

Experimental results confirm the feasibility of the proposed method: for fairly low noise levels (such as clean speech, city and car noise with SNR 20db and city noise with SNR 10db) unobtrusive verification rate was not less than 80%, while overall False Accept Rate was less than 1%, and False Reject Rate was in a range of 1-2%. For fairly high noise levels (such as city and car noise with SNR 0db, and white noise with SNR 0-20db) the method has shown False Reject Rate 3-7% (which was not worse than if verification would be based solely on fingerprint modality) for the same FAR. The recognition rates of unobtrusive verification decreased to 40% for car noise and 60% for city noise, but even 40% of unobtrusive verification is not too bad result, considering that in these experiments only one attempt of unobtrusive verification was allowed, whereas more realistic scenario would be to perform several attempts of unobtrusive verification before requiring an explicit verification. We are planning evaluation of method’s performance in such scenario.

Further data collection and experiments are needed in order to investigate performance of gait recognition in realistic scenario; and especially how different walking speeds, user tiredness and health affect recognition. Nevertheless these initial experiments are encouraging, because they have shown that cascading unobtrusive and obtrusive biometrics (in both configurations: with and without gait) could significantly reduce user effort for most of noise conditions; and that overall recognition rates of

cascade were usually better than that of the best (obtrusive) modality alone. Although recognition rates, achieved in these experiments, were not very good, we hope that they can be significantly improved when gait, voice and fingerprint recognition improves. 10. References [1] S.A. Niyogi and E.H. Adelson, “Analyzing and recognizing walking figures in XYT”, Conf. of Computer Vision and Pattern Recognition, Seattle, WA, 1994. [2] M. Nixon, J. Carter, J. Shutler, and M. Grant, “New Advances in Automatic Gait Recognition”, Information Security Technical Report, 2002, 7(4), pp. 23-35. [3] L. Wang, T. Tan, W. Hu, and H. Ning, “Automatic gait recognition based on statistical shape analysis”, IEEE Trans. Image Processing, 2003, 12(9), pp. 120-1131. [4] L. Bianchi, D. Angelini, and F. Lacquaniti, Individual characteristics of human walking mechanics, Eur.J.Physiol, 1998, 436, pp. 343 –358 [5] H. Ailisto, M. Lindholm, J. Mäntyjärvi, E. Vildjiounaite, and S.-M. Mäkelä, “Identifying people from gait pattern with accelerometers”, SPIE, Vol. 5779, 2005, pp. 7 – 14 [6] N.B. Yoma, and M. Villar, “Speaker verification in noise using a stochastic version of the weighted Viterbi algorithm”, IEEE Trans. on Speech and Audio Processing, 10 (3), 2002, pp. 158 – 166 [7] H. Guangrui, and W. Xiaodong, “Improved robust speaker identification in noise using auditory properties”, Intelligent Multimedia, Video and Speech Processing, 2001, pp. 17 – 19 [8] E. Vildjiounaite, S.-M. Mäkelä, M. Lindholm, R. Riihimäki, V. Kyllönen, J. Mäntyjärvi, and H. Ailisto, ”Unobtrusive Multimodal Biometrics for Ensuring Privacy and Information Security with Personal Devices”, Pervasive 2006 [9] Y.–L. Zhang, J. Yang, and H.-T. Wu, “Sweep fingerprint sequence reconstruction for portable devices”, Electronics Letters, 42 (4), 2006 [10] L. Hong, and A. Jain, “Integrating Faces and Fingerprints for Personal Identification”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(12), 1998 [11] K. Takahashi, M. Mimira, Y. Isobe, and Y. Seto, “A Secure and User-Friendly Multimodal Biometric System”, SPIE Vol. 5404 [12] E. Erzin, Y. Yemez, and A.M. Tekalp, “Multimodal Speaker Identification Using an Adaptive Classifier Cascade Based on Modality Reliability”, IEEE Trans. on Multimedia, 7(5), 2005, pp. 840-852 [13] BBC news, http://news.bbc.co.uk/1/hi/uk/1748258.stm [14] NTT-AT Ambient Noise Database: http://www.ntt-at.com/products_e/noise-DB/ [15] MASV: http://www.bas.uni-muenchen.de/Bas/SV/ [16] Biometrika: http://www.biometrika.it/eng/wp_fx3.html

increasing security of mobile devices by decreasing user effort in verification

Documents