srpeek: super resolution enabled screen peeking via cots

Post on 16-May-2022

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SRPeek: Super Resolution Enabled Screen Peekingvia COTS Smartphone

Jialuo Du∗, Chenning Li†, Zhenge Guo‡, Zhichao Cao†∗Tsinghua University, †Michigan State University, ‡Xi’an Jiaotong University

Abstract—The screens of our smartphones and laptops dis-play our private information persistently. The term “shouldersurfing” refers to the behavior of unauthorized people peekingat our screens, easily causing severe privacy leakages. Manycountermeasures have been used to prevent naked eye-basedpeeking by reducing the possible peeking distance. However,the risk from modern smartphones with powerful cameras isunderestimated. In this paper, we propose SRPeek, a long-distance shoulder surfing attack method using smartphones.Our key observation is that although a single image capturedby smartphone cameras is blurred, the attacker can leveragesuper-resolution (SR) techniques to recover the information frommultiple blurry images. We design an end-to-end system deployedon commercial smartphones, including an innovative deep neuralnetwork (DNN) architecture, StARe, for efficient multi-imageSR. We implement SRPeek in Android and conduct extensiveexperiments to evaluate its performance. The results demonstratewe can recognize 90% of characters at a distance of 6m withtelephoto lenses and 1.8m with common lenses, calling for thevigilance of the quietly growing shoulder surfing threat.

Index Terms—shoulder surfing, deep learning, super resolution

I. INTRODUCTION

Digital screens play an unprecedented role in our dailylife, and screen privacy has been well-researched [1]–[4] fora long time. Unauthorized people peeking at your screenover your shoulder, known as “shoulder surfing”, takes placefrequently in public places. A malicious attacker can directlygain access to considerable amounts of private information,including passwords, chat messages, and texts. For example,this puts at risk the widely-used verification code messages,leading to catastrophic results.

When attackers only use their own naked eyes, multipleworks aim to evaluate the risks in real life [5] [6] and measurehuman adversaries [2], [7]. And shoulder surfing susceptibility[8] [4]. Some methods are proposed to defend against thisthreat by shortening the readable distance of the screen [9],increasing user vigilance, [10] [11] [12] [3] or hiding criticalinformation [1] [13]. Unfortunately, the full potential of shoul-der surfing is drastically underestimated when the attackeris equipped with a modern smartphone, which can capturesnapshots of a faraway screen and is probably capable ofdeciphering critical information from these images.

We present SRPeek, a new shoulder-surfing attack sys-tem deployed on a COTS smartphone, taking photos andrunning a super-resolution (SR) neural network as shown inFigure 1. When taking multiple snapshots of the same scene,the images will slightly differ from each other, providing

Range: 1.8 / 6 m & 30°

Massive Blurred Snapshots

Complex Characters

Commercial Smartphone

Fig. 1: An illustrative attack scenario.

additional information, which can be used by multi-frame SRalgorithms to produce high-res images [14]. We use burstmode of smartphone cameras to obtain more snapshots, butfusing the information is challenging and time-consuming. Oursolution is StARe, a novel and lightweight Super-resolutionArchitecture for a larger number of Repeated images. Thismodel contains three novel designs: parameter sharing, featurezero, and feature-wise merge, to improve performance andreduce computation. Although as a byproduct of an attackmodel, the value of StARe is not limited to a malicious attack:it can be utilized in a great variety of scenarios ranging fromsurveillance to self-driven cars, or even mobile AugmentedReality(AR) applications which require distortion resistance[15]. StARe enables a camera to provide high-quality imagingof stationary objects by staring at it for a slightly longer time.

A possible attack scenario is shown in Fig. 1. During theattack, we hypothesize that the attacker (on the left) gains LOSof the victim’s screen (on the right) within the 6m range.And any information of interest appears on the screen forhalf a second. Given the burst mode, the attacker can taketen snapshots in this time and process these images afterwardwith multi-frame SR algorithms to generate a high-resolutionresult, obtaining the information within 2 seconds. This wholeprocess can even be looped to achieve real-time surveillance,while few would notice or suspect someone 6 meters away.

Summary: Our contribution in this work is as follows:

• A new threat model: We reveal the threat posed bypresent-day smartphones and SR technology. The exper-imental multi-frame SR network we designed can recon-struct text from highly blurred snapshots. We evaluate itsimpact on screen privacy, but it can also be applied to awide range of long-range visual perception applications.

• A new attack system: We present SRPeek, an end-

to-end threat model of long-range shoulder-surfing, de-ployed on commercial smartphones. To the best of ourknowledge, we are the first to consider the presenceof smartphone cameras and SR algorithms in shouldersurfing scenarios.

• A new threat analysis: We evaluate this new shoulder-surfing threat model in multiple scenarios. SRPeekoutperforms the state-of-the-art for content recognition,calling for new privacy concerns.

• A new defense exploration: We discuss the effectivenessand study a set of passive and active countermeasures toprevent the leakage of information against this unprece-dented information threat.

The rest of the paper is organized as follows. After somebackground information presented in Section II, we introducethe design principles in Section III and the specific systemdesign in Section IV, followed by the implementation inSection V and evaluation in Finally, section VI and VII. Wefurther wrap up this paper by delivering the limitations andcountermeasures of our system in Section VIII. Section IXdescribes the related work, especially the state-of-the-art forshoulder surfing and SR techniques. And the conclusion isshown in Section X.

II. BACKGROUND

Telephoto lens: The present-day smartphones are mostlyequipped with multiple cameras. Among them, the telephotocamera, also known as the periscope camera, empowers thesesmartphones up to 50× or 100× zooming. These camerashave a much longer focal length, giving them 5× to 10×magnification, enhancing the smartphone’s ability to imagefaraway objects. We utilize this camera in SRPeek.

Burst mode: Improvements in-memory read & write speedhas kept increasing the frame rate of burst mode and videophotographing. By holding down the snapshot button, userscan take 10 to 20 snapshots per second with recent smart-phones, providing more graphical information for super reso-lution. Compared to video recording, burst mode has a lowerframe rate but can generate images with higher resolution,which is beneficial for super resolution tasks. SRPeek usesburst mode to capture multiple images as the inputs for multi-frame SR algorithms.

Computational abilities The new generation of smart-phones is equipped with unprecedented computational abil-ities. Most high-end phones have 8-core CPUs and pro-grammable GPUs installed, making it possible to runlightweight deep neural network (DNN) models locally. How-ever, computational abilities are still extremely limited for thetraditional SR DNN model. How to run an SR system in real-time is the main challenge in designing SRPeek.

III. DESIGN PRINCIPLES

A. Challenges

The core of SRPeek system is StARe, a lightweight multi-frame SR architecture, which aims to extract more information

(a) reconstructed image (b) ground truth

Fig. 2: An example of Concept Drifting.

from a large number of input frames while reducing compu-tational complexity. The designs of StARe are based on thenature and unique challenges of our application:• Computational Complexity: Traditional SR DNN mod-

els are computationally expensive, and the increasednumber of input images exacerbates the problem. Toachieve the real-time SR, only 0.05 seconds is availablefor the model to process each snapshot. On the contrary,existing SR models often contain more parameters andoperations than image classification models, as they havea much larger output vector.

• Concept Drifting: The blurriness functions may varydrastically in different distances, illumination, and withdifferent lenses of smartphones, leading to concept drift.Furthermore, the reconstructed images might be percep-tually satisfactory but incorrect. For example, in Fig.2,the upper part is wrongly reconstructed, and the networkprobably mistakes it as part of a left-falling stroke.

• Information fusion: Increasing input images can lead toenhanced reconstruction quality, but this requires sophis-ticated algorithms to fuse information between the noisyand perceptually heterogeneous images, which is of-ten computationally expensive. High efficient algorithmsare required for this fusion process, avoiding redundantcomputation while extracting and preserving essentialfeatures. This is rarely addressed in previous works, andStARe aims to fill this gap.

B. Primary Design Features

Different from traditional multi-frame SR networks, toaddress the challenges of this unique application, three coreimprovements are made by our StARe:• Parameter Sharing and Channel Isolation: We use

convolutional layers to process images, but each imageis processed individually with the same set of param-eters, reducing parameter count and training complex-ity. Information is still shared between the images ateach layer. However, unlike 3D convolutions, which arecomputationally intensive, this cross-channel dataflow isperformed by simple, statistical calculations, reducing thecomputational cost.

ABCABC

ABCABC

ABCABC

ABC

Input Alignment Adjustment Processing Output

Fig. 3: Workflow of the SRPeek system.

• Feature Zero: To encourage the network to generateimages closer to ground truth, we extract a set of simplefeatures with CNN from each original image, calledFeature Zero. The features are then merged into featuremaps of this image repeatedly at each layer. As an addedbenefit, the model has varying depths, thus extractingdifferentials between inputs and reconstructing high-resimages at multiple scales, which is critical in the existingSR technique [16].

• Feature-wise Merge: With Channel Isolation convolu-tions, a set of feature maps are extracted from each imageindividually, and the parameter sharing feature ensuresthat these feature maps are comparable. Consequently,we can merge the corresponding feature maps fromeach image by simply calculating the max, min, andmean values, resolving the consensus of the informationcollected from each image. This result is then distributedalong with Feature Zero to each image, and in thisway, we enable efficient horizontal data flow without anyadditional parameters.

IV. SYSTEM DESIGN

We propose a portable, unobtrusive, robust system to facili-tate screen activity type recognition and sensitive informationreconstruction. In this section, we introduce the input of thenetwork and the preprocessing procedures it requires, thedesign of the network architecture, and its detailed structure.The workflow of SRPEEK is shown in Fig. 3.

A. Image input and preprocessing

1) Burst mode and Alignment: The attacker points hiscamera at the victim’s screen and obtains multiple imagesof the same scene with burst mode. These images are thenaligned to remove slight shifts due to hand tremors. In ourapplication, we use the border of screens for more accuratealignment, shown in Fig. 4. Specifically, we use Houghtransform [17] to detect the edges roughly (Fig. 4(b)). Then weselect the border edges and refine them at the pixel level, usingthe contrast between the luminescent screen and its dimmerbackground(red rectangle in Fig. 4(c)). Finally, we crop out thescreen for each image and turn it into a fixed-size rectangleby using affine transformation(Fig. 4(d)). After these imagesare perfectly aligned, they will then undergo the preprocessingphase.

(a) the original image (b) the detected edges

(c) border detection (d) cropped image

Fig. 4: The alignment process.

2) Preprocessing: Images are preprocessed to removeenvironment-specific features before being sent to the neuralnetworks so that features learned from images in one envi-ronment can be used in all other environments. To reduceworkload, we crop out the non-text areas from the images,which can be detected with the density of edges from theHough transform result in the alignment process(Fig. 4(b)).Due to the scarceness of texts and large line spacing in chatapps, in most cases, we can filter out more than 70% of theimage, reducing the workload of future procedures. Then, theimages are normalized to a value range of 0 to 1 and a standarddeviation of 1, before being processed by the neural networkStARe.

B. Network Design

The core of the SRPeek system is a specially designedmulti-frame SR neural network, accepting a group of N

images indexed x(0)1 to x

(0)N as input and generating an image

with higher resolution y as output. The detailed structure isshown in Fig. 5. The network comprises L layers, each ofwhich implements a non-linear transformation Hl(·), wherel indexes the layer. As mentioned before, in each layer,images are processed separately, with the merging layers asrevenue for communication so that the output of each layeris correspondent to the input. We denote the output of the lth

layer as x(l)1 to x

(l)N , which is also the input of the (l + 1)th

layer. Till now, the model is not different from traditional SRmodels:

x(l)i = Hl(x

(l−1)i ), i = 1, 2, ...N. (1)

Additionally, we introduce the initial images x(0)i as an input

for all the layers:

x(l)i = Hl(x

(l−1)i , x

(0)i ), i = 1, 2, ...N. (2)

Prev.Profile

Conv 3

Merge

Conv 2

Conv 1

InputProfileInfo.

H

Input

HH

HOutput

Fig. 5: Core network architecture of SRPeek.

The last layer is exceptional, it yields a single image y asoutput.

In these layers, nothing is done to raise the resolution of theimages, so that the resolution of x(0)

i to x(l−1)i and y remains

the same. To increase resolution, we insert several 2× near-est upsampling layers U evenly throughout the architecturebetween the layers:

x(l)i ← U(x

(l)i ), x

(0)i ← U(x

(0)i ), i = 1, 2, ...N. (3)

we upsample the input images x(0)i simultaneously to keep

the two inputs of the following layers Hl(x(l−1)i , x

(0)i ) unani-

mous in resolution.Inside each layer Hl there are 3 convolutionlayers Conv1l(·), Conv2l(·), Conv3l(·) and 1 merging layerMergel(·).Conv1: The first convolutional layer, Conv1, accepts thelayer’s first input parameter x(l−1)

i as input. Note that all threeconvolutional layers accept a single image (or its feature mapsfrom the last convolutional layer) as input, the convolutionalprocess is repeated for all the images, and calculations withinthe same convolutional layer share the same group of parame-ters all the time. The parameters are denoted as Paramsl forconvolutional layer Convsl, s = 1, 2, 3.

a(l)i = Conv1l(x

(l−1)i , Param1l), i = 1, 2, ...N. (4)

Merge: The results of the previous step of all the images{a(l)1 , a

(l)2 , ..., a

(l)n } are then passed to the merging layer

Mergel to generate t groups of feature maps. Suppose theresults of Conv1l consists of R channels:

a(l)i = {a(l)i1 , a

(l)i2 , ..., a

(l)iR}, i = 1, 2, ...N. (5)

The data in each channel will be merged separately in themerging layer. The output is T × R channels, denoted asb(l)tr (t = 1, 2, ...T, r = 1, 2, ...R):

b(l)tr(p,q) =

N∑i=1

a(l)ir(p,q)e

kta(l)

ir(p,q)/∑

ekta

(l)

ir(p,q) , (6)

where (p,q) represents the pixel at this coordinate, and kt isa set of fixed parameters shared in all the merging layersthroughout the model, controlling the behavior of the mergingprocess. Apparently, k=0 leads to averaging, k=+∞ leadsto the max operator, and k=−∞ leads to min operator. We

use T=5 and k=-1,-0.5,0,0.5,1 in our model, giving consid-eration to both consensuses (k=0,averaging) and prominentfeatures(k=1, ’soft’max and k=-1, ’soft’min). these T × Rchannels b

(l)tr is the output of this merging layer Mergel.

Conv2: Conv2 is a replica of Conv1, processing the layer’ssecond input parameter x

(0)i , also generating N outputs with

R channels per output, denoted as cir(l), i = 1, 2, ...N, r =

1, 2, ...R:

c(l)i = Conv2l(x

(0)i , Param2l), i = 1, 2, ...N,

c(l)i = {c(l)i1 , c

(l)i2 , ..., c

(l)iR}, i = 1, 2, ...N.

(7)

Conv3: The data from Merge and Conv2 are merged to-gether, in that all the T ×R channels of b(l)tr are replicated Ntimes and stacked with each one of the N outputs of Conv2l,before these N outputs, each with (T + 1)×R channels, arepassed through the third convolutional layer Conv3l. Thereare also N output of this convolutional layer, denoted asd(l)i , i = 1, 2, ...N :

d(l)i = Conv3l(Stack(c

(l)i , b(l)), Param3l), i = 1, 2, ...N.

(8)

Output: If l < L, this is not the last layer, the N outputsof step (4) will be the output of layer Hl. Otherwise, weadd another merging and a common convolutional layer afterConv3l to merge data into one output image y. The merginglayer is identical to the previous Mergel, merging the Noutputs d

(l)i into T × R channels e

(l)tr . This is followed by

a convolutional layer to generate a single channel of output y.

{e(L)tr , t ≤ T, r ≤ R} = Merge′L({d

(L)i , i ≤ N}),

y = Conv′L({e(L)tr , t ≤ T, r ≤ R}).

(9)

V. IMPLEMENTATION & TRAINING

In our experiment, we are forced to collect the trainingdataset on our own. To the best of our knowledge, there is nopublicly available image dataset built for shoulder-surfing, andbecause of the uniqueness of our application, i.e., blurriness,targeting characters, working with burst mode snapshots, etc.,we cannot find any publicly available substitutes. To collecttraining data, we use two smartphones, one for the attacker,taking snapshots and running SRPeek, and one for thevictim, displaying Chinese and English characters while takingscreenshots of itself as ground truth. The experimental settingof this data collection phase is illustrated in Fig. 6(a).

The data was collected at different times of the day andnight, with different illumination and at different positions andangles(tilting no more than 30 degrees). We collected 800,000images in this way, modifying these environmental parametersevery 2,000 images. This process is time-consuming but canbe largely automated and is crucial for training a robust model.

VI. MODEL EVALUATION

We perform the following experiments with two commercialoff-the-shelf(COTS) smartphones: a Redmi 6A smartphone,with a single rear camera with 13 million pixels and digital

Victim

Attacker

Result

(a) Data collection phase. (b) Real-life attack phase.

Fig. 6: The experimental setting, with the distance betweenthe attacker and the victim shortened for demonstration.

zoom only, and a HUAWEI P40 Pro, with multiple rearcameras. The telephoto camera possesses up to 5× opticalzooming ability which we will utilize fully in our experiments.The Peak Signal to Noise Ratio(PSNR) metric and OpticalCharacter Recognition (OCR) services are used to evaluate theaccuracy of our system(the latter using accuracy per character).

A. Performance in Controlled Environments

In these experiments, we train and test the model withimages captured with the same environment parameters, asshown in Fig. 7. The traditional lens group is trained and testedat 1-2 meters, while the optical lens is 5-7.5 meters, where lessthan 5% of the characters can be read with the naked eye. Thismodel can achieve an OCR accuracy above 90% at 1.8m witha traditional lens and at 6m with an optical lens. Performancesare relatively consistent both day and night, while increaseddistances mean less data, causing more artifacts like missingor misplaced strokes.

B. Performance in Random Environments

We train the model with data captured with varying envi-ronmental parameters and test its ability at a new environmentsetting. The results are shown in Fig. 8. The model can achievean OCR accuracy above 85% at 1.8m with traditional lens,and above 90% at 6m with an optical lens. This verifies theefficiency of our model for environment adaption.

C. Performance with Fewer Available Images

As mentioned in sec. IV, our model is designed to workon any number of input images, which is requisite because inspecific scenarios, the data displayed on the victim’s screen istransient and ever-changing, e.g., password entry.We evaluatethe impact of fewer available images on the performance ofthe SR model, see fig. 9. This and the following experimentare performed at the 1.8m daytime scenario for the traditionallens and 6m daytime for the optical lens. Results show thatthe model can achieve decent performance with at least tenimages, ample for most real-life scenarios.

D. Adapting Ability

We expose the model to training data containing fewervariations of a specific environment parameter, and examine

TABLE I: Comparison with existing systems.

System SRPEEK SRCNN VideoSR

PSNR 13.3db 7.69db 8.40dbFLOPs 419K 857K 1020K

Parameters 405K 85K 1370KOCR Accuracy 100% 10% 23%

TABLE II: Success rate in different tasks.Accuracy Read text Type text Enter PIN Enter password

Raw Image 5% 0 0 0SRPeek 100% 100% 100% 80%

its performance in new environments. The ’All’ group usesall the training data, while the ’Light’, ’Distance’ and ’Angle’groups only use data with the same lighting, distance, andangle, respectively. The results are shown in fig. 10. For thetraditional lens, the effective range is closer, so the distanceparameter has little importance. In contrast, variations in thelight and angle parameters in the training data are crucial to arobust model. For optical lenses, where the effective range isfurther, the distance parameter surpasses the light parameter inimportance for training. The results indicate that attackers candrastically reduce preparation time by omitting the distance orlight parameter variations for training data collection.

E. Comparison with Other Architectures

We train and test other widely used networks with the samesets of data and evaluate their results. We chose SRCNN [18],a commonly used single image SR network, applying it toevery image before merging the results by pixel-level average.We also used a multi-frame version of CNN with 3D con-volutions, originally designed for video super resolution [19](VideoSR). However, as mentioned above, it is challengingfor the single image approaches to utilize information anddistinguish the noisy and deformed patterns. In contrast,VideoSR approaches rely upon consistency between frames,so they fail to give satisfactory results. The results are shownin Table I.We can see that the PSNR of SRPEEK is 13.32dBwhich is 73.2% and 58.6% higher than SRCNN and VideoSR,with fewer memory and computation overhead. StARe hasthe least floating-point operations(FLOPS). SRCNN has thesmaller parameter count than StARe, but it’s a single frameSR model and have to be run once for each input frame. Forthe OCR accuracy, SRPEEK can recognize all characters, butSRCNN and VideoSR can only recognize 10% and 23% ofthem. The results show the superiority of our SR model.

VII. CASE STUDY

A. Accuracy

We build the system on smartphones and evaluate itsperformance in real-life scenarios (shown in Fig. 6(b)). Weexperiment with a Redmi 6A smartphone (with a camera of13 million pixels, no optical zooming) for the attacker and a

85

90

95

100

OC

R A

ccu

racy

(%)

D, 1.4 D, 1.8 N, 1.4 N, 1.8Day/Night & Distance(m)

0

5

10

15

20

25

PS

NR

(db

)

(a) Traditional Lens

5 5.5 6 6.5 7 7.5Distance(m)

0

50

100

OC

R A

ccu

racy

(%)

Day TimeNight Time

(b) Optical Lens

Fig. 7: Performance in controlled environment.

75

80

85

90

95

100

OC

R A

ccu

racy

(%)

D, 1.4 D, 1.8 N, 1.4 N, 1.8Day/Night & Distance(m)

0

5

10

15

20

25

PS

NR

(db

)

(a) Traditional Lens

5 5.5 6 6.5 7 7.5Distance(m)

0

50

100

OC

R A

ccu

racy

(%)

Day TimeNight Time

(b) Optical Lens

Fig. 8: Performance in random environments.

0

50

100O

CR

Acc

ura

cy(%

)

20 15 10 6 3Different Numbers of Images

0

5

10

15

20

25

PS

NR

(db

)

(a) Traditional Lens

0

50

100

OC

R A

ccu

racy

(%)

20 15 10 6 3Different Numbers of Images

0

5

10

15

20

25

PS

NR

(db

)

(b) Optical Lens

Fig. 9: Performance with fewer available images.

50

60

70

80

90

100

OC

R A

ccu

racy

(%)

All Light Distance AngleDifferent Parameters

0

5

10

15

20

25

PS

NR

(db

)

(a) Traditional Lens

20

40

60

80

100

OC

R A

ccu

racy

(%)

All Light Distance AngleDifferent Parameters

0

5

10

15

20

25

PS

NR

(db

)

(b) Optical Lens

Fig. 10: Performance when adapting to new environments.

TABLE III: Recognition accuracy in various scenarios.

Scenarios Home Transport Theater

Naked Eye 5% 0 0

OCR 100% 10% 23%Human 95% 85% 70%

HUAWEI Mate8 smartphone for the victim. The experimentalsetting is 1.8 meters, daytime, and with traditional lens. Weinstructed five human participants to read the reconstructedcharacters to evaluate the usability of our model. No partic-ipants can read the unprocessed images, but all of them candecipher the information on the reconstructed image withoutmuch difficulty. The results are shown in Table III.

The results show that humans can read 95%, 85%, and 70%contents at home, transport, and theater while the OCR accu-racy is 100%, 10%, and 23%. That verifies that humans canobtain the most parts of information from peeking in variousenvironments. In transport, the vibration of the smartphone andthe darker environment can fool the OCR model for contentrecognition compared to human recognition, which leads tolower accuracy in transport and theater.

B. Influence of Hand Tremors

We change the attacker’s camera and/or the victim’s targetscreen from stationary to handheld, introducing tremors incamera and/or target side, to see how our system deals witha moving target screen. We ask participants to hold still theattacker and/or victim’s phones in their hands. The results areshown in Table IV. We can see that with the existence oftremors, the recognition accuracy drops from 95% to 85%/80%for both OCR tools and humans. In addition, hand tremorscan cause motion blur and erratic shifts in the sub-pixel level,impacting performance.

TABLE IV: Impact of hand tremors while holding the phone.

Accuracy None Camera Target Both

Naked Eye 5% 0 5% 5%

OCR 95% 85% 80% 80%Human 95% 85% 80% 85%

C. Success rate in different tasks

We test the success rate of obtaining crucial informationwhen the victim performs several tasks on the phone: readingtext messages, typing text messages, entering the PIN, andtyping passwords with numbers, English, and special charac-ters(typing at two characters per second). We use accuracyper character as the evaluation metric. Fewer photos willbe available in the PIN and password tasks, but decipheringEnglish characters is also easier than Chinese characters, andwe use specifically trained models(with the same structure anddifferent training data). Results are shown in Table II.

We conclude that SRPeek functions normally in everydayscenarios and poses a subtle threat to screen privacy.

D. Perceived shoulder surfing susceptibility

We asked the participants to rate the perceived shoulder-surfing susceptibility after the experiment. The attacker sitsor stands at 1.8m range, pretending to be interacting withtheir phone while continuously running the shoulder-surfingAPP. None of the participants reported suspicion of shoulder-surfing. Thus, our system can enable a malicious attacker togather large amounts of critical information from the victimwhile remaining unnoticed.

VIII. DISCUSSION

A. Limitations

There are a few limitations to this work. We require acertain degree of image capturing and processing abilities of

the attacker’s phone, and we also expect the victim not to exerttoo much disturbance to the target phone.• Image Capturing Ability The latest models of smart-

phones can capture images at 10 frames per second inburst mode easily, however, this ability is not commonin phones that are 3 years old. Heavily used phones mayalso take a longer time capture images in burst mode.

• Processing Ability To achieve the best performance theuser needs a phone with strong processing capabilities torun the neural network in real-time. As neural networkshave been commonplace in numerous modern APPs, mostphones of the latest generation have upgraded their pro-cessing ability to run neural networks, but older versionsmight not possess such processing powers and cannotprocess images in real-time.

• Motion and Line of Sight We assume the observeduser will hold still his/her phone, but there might beextreme cases where frequent movement of the screencauses severe motion blur, degrading the result. Also, ourwork assumes LoS of the victim’s screen and an anglewithin 30 degrees, which might not be possible if the userholds the phone too close to his or her body.

B. Countermeasures

Although SRPeek proves to be highly efficient againstunprotected screens, there are some simple methods to mitigatethis unique threat while not cumbering the user.• Dynamic background. Most multi-frame SR algorithms,

including ours, assume the consistency of the scene. Bydeploying a dynamic background behind the characters,such as tiny moving dots, we can break this assumptionand confuse the sensitive SR algorithms. Furthermore,the blurriness of the images makes this backgroundinterference hard to remove, providing stable protection.

• Active scanning. There are several works providingactive countermeasures against shoulder surfing threatswith the naked eye, scanning for passers-by and analyzingtheir gaze directions with front-facing cameras [11] [10][12] [3]. To the extent of our knowledge, none of theseworks have included cameras in their detection scope, butwe believe it’s practical to implement such features.

• Adversarial machine learning methods. Recent studieshave discovered the weaknesses of neural networks. Mi-croscopic changes, undetectable to the human eye, mayconfuse neural networks severely. Theoretically, by exert-ing a certain pattern to the victim’s screen, it can confusethe attacker’s SR algorithms and provide protection.

IX. RELATED WORK

A. Shoulder Surfing

Shoulder surfing has been studied heavily in recentyears [5]–[7]. To mitigate this threat, some systems hidecritical information [10] or warn the user [12] once sensingmalicious passers-by; others modify the user interfaces, includ-ing creating honeypots (for passwords) [13], confusing unau-thorized parties [20], and making the interactions invisible [21]

TABLE V: A comparison of the state-of-the-arts on shouldersurfing.

Reference Scenarios Metric Quantitative Distancea

Eiband [5] Naked eye × × -Kwon [7] Naked eye 3 × 1m

Schaub [8] Naked eye 3 × -b

Maggi [22] Camera 3 × -b

SRPeek COTS phones 3 3 1.8 / 6maThe maximum distance to the victim’s screen.bThe attacker stands next to the victim.

or unreadable from a distance [9]. Most of the works assumethat the attacker is a casual passer-by, taking occasional peekswith the naked eye, as is the case most of the time [5],[20]. Given the assisted equipment, the malicious attackercan however acquire the sensitive information (passwords,business correspondence, etc.) readily to do real harm.

However, compared to the various works focusing ondefenses against shoulder surfing, the works studying andmodeling this threat are sparse and outdated. Most of theseworks focus on scenarios where the attacker peeks at thephone with his/her naked eye, investigating real-life stories [5],designing threat models [7] or evaluating shoulder surfingsusceptibility of different keyboards [8]. Under the naked eyelimitation, these works fail to uncover the full potential ofshoulder surfing. There are also works where the attacker isequipped with auxiliary devices, e.g. a digital camera [22],but they fail to consider the use of SR networks, limitingtheir performance. The related works can be concluded withTable V. To the best of our knowledge, we are the first work todesign and model the new form of shoulder surfing attack withthe assistance of smartphones and multi-frame SR algorithms.

B. Super Resolution

Image Super Resolution is the process of reconstructingan image with a higher spatial resolution.Multi-image SRtechniques work on a set of pictures on the same scene,collecting extra data from slight differences between thesepictures to reconstruct high-quality images.Recently, deeplearning networks are widely applied to Multi-image SR due totheir advantages in noise reduction and feature extraction fromlarge amounts of complex data [23]. The most commonly usedarchitectures are Convolutional Neural Networks(CNN) [18]and Generative Adversarial Networks(GAN) [24], the formeroften gets closer to ground truth while the latter generatesfewer artifacts and is more pleasing to the human eye. Toresolve multi-image SR tasks, say video SR, some works [19],[25] use 3-dimensional convolutions to utilize sequentialityand consistency between adjacent frames. Some works alsomodify the data flow among the network layers to mergeneighboring frames [26], or recurrently process the framesunder the guidance of the output of the previous frame [27].For images without consistency or sequential information, likesatellite images, most works choose hybrid methods, solvingthe multi-image SR problem with multiple single-image SR

procedures. They either merge the results of single-imageSR algorithms for efficiency [28], or build a multi-imagenetwork to create a comprehensive view based on single-image SR networks [29]. Due to the extreme blurriness ofsnapshots and the absence of consistency, the above methodsare not competent at the new shoulder surfing threat model weproposed, and we design StARe to fill this gap.

X. CONCLUSION

In this work, we designed a holistic system, SRPeek,for shoulder surfing on smartphones, which serves as anup-to-date version of a threat model for shoulder surfingand proved its efficiency. We proved that this threat towardsscreen privacy is imminent and can steal critical information,including personal texts or passwords, from long distances,thus escaping detection. It is our wish that this work can stirsome discussion in the field of screen privacy protection andpropagate defense mechanisms across critical mobile apps.

The core of SRPeek is a specially designed multi-frameSR network. With its innovative architecture, this networkoutperforms other algorithms of the same field in our appli-cation. The design ideology enables this network to processhigher levels of data integration ability while keeping a lowcalculation profile, and we believe the elements of this designcan be used in other applications with large amounts of data,such as natural language processing or anomaly detection. Ourmodel can also be used in OCR tasks when multiple imagesare available, functioning as a preprocessing phase to improvethe quality of the images and increase accuracy.

ACKNOWLEDGEMENT

This study is supported in part by NSFC Grant 61972218,61872081 and 61772446.

REFERENCES

[1] P Nimbalkar, Y Pachpute, N Bansode, and Vaishali Bhorde. A surveyon shoulder surfing resistant graphical authentication system. Int J SciEng, 2017.

[2] Leon Bosnjak and Bostjan Brumen. Shoulder surfing: From an exper-imental study to a comparative framework. International Journal ofHuman-Computer Studies, 2019.

[3] Shiguo Lian, Wei Hu, Xingguang Song, and Zhaoxiang Liu. Smartprivacy-preserving screen based on multiple sensor fusion. IEEETransactions on Consumer Electronics, 2013.

[4] Furkan Tari, A Ant Ozok, and Stephen H Holden. A comparison ofperceived and real shoulder-surfing risks between alphanumeric andgraphical passwords. In Proceedings of the second symposium on Usableprivacy and security, 2006.

[5] Malin Eiband, Mohamed Khamis, Emanuel Von Zezschwitz, HeinrichHussm ann, and Florian Alt. Understanding shoulder surfing in thewild: Stories from users and observers. In Proceedings of the 2017 CHIConference on Human Factors in Computing Systems, pages 4254–4265,2017.

[6] Wendy Goucher. Look behind you: the dangers of shoulder surfing.Computer Fraud & Security, 2011.

[7] Taekyoung Kwon, Sooyeon Shin, and Sarang Na. Covert attentionalshoulder surfing: Human adversaries are more powerful than expected.IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2013.

[9] Chun-Yu Daniel Chen, Bo-Yao Lin, Junding Wang, and Kang G. Shin.Keep others from peeking at your mobile device screen! In The 25thAnnual International Conference, 2019.

[8] Florian Schaub, Ruben Deyhle, and Michael Weber. Password entryusability and shoulder surfing susceptibility on different smartphoneplatforms. In Proceedings of the 11th international conference on mobileand ubiquitous multimedia, 2012.

[10] Frederik Brudy, David Ledo, Saul Greenberg, and Andreas Butz. Isanyone looking? mitigating shoulder surfing on public displays throughawareness and protection. In Proceedings of The International Sympo-sium on Pervasive Displays, 2014.

[11] Hee Jung Ryu and Florian Schroff. Electronic screen protector withefficient and robust mobile vision. In Demos section, Neural InformationProcessing Systems Conference, 2017.

[12] Alia Saad, Michael Chukwu, and Stefan Schneegass. Communicatingshoulder surfing attacks to users. In Proceedings of the 17th Interna-tional Conference on Mobile and Ubiquitous Multimedia, 2018.

[13] Nilesh Chakraborty and Samrat Mondal. Tag digit based honeypot todetect shoulder surfing attack. In International Symposium on Securityin Computing and Communication, 2014.

[14] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, DamienKelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and PeymanMilanfar. Handheld multi-frame super-resolution. ACM Transactionson Graphics (TOG), 2019.

[15] Guohao Lan, Zida Liu, Yunfan Zhang, Tim Scargill, Jovan Stojkovic,Carlee Joe-Wong, and Maria Gorlatova. Edge-assisted collaborativeimage recognition for mobile augmented reality. ACM Trans. Sen. Netw.,2021.

[16] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariha-ran, and Serge Belongie. Feature pyramid networks for object detection.In Proceedings of the IEEE conference on computer vision and patternrecognition, 2017.

[17] Richard O Duda and Peter E Hart. Use of the hough transformation todetect lines and curves in pictures. Communications of the ACM, 1972.

[18] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Imagesuper-resolution using deep convolutional networks. IEEE transactionson pattern analysis and machine intelligence, 2015.

[19] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsagge-los. Video super-resolution with convolutional neural networks. IEEETransactions on Computational Imaging, 2016.

[20] Susan Wiedenbeck, Jim Waters, Leonardo Sobrado, and Jean-CamilleBirget. Design and evaluation of a shoulder-surfing resistant graphicalpassword scheme. In Proceedings of the working conference onAdvanced visual interfaces, 2006.

[21] Manu Kumar, Tal Garfinkel, Dan Boneh, and Terry Winograd. Reducingshoulder-surfing by using gaze-based password entry. In Proceedings ofthe 3rd symposium on Usable privacy and security, 2007.

[22] Federico Maggi, Alberto Volpatto, Simone Gasparini, Giacomo Bo-racchi, and Stefano Zanero. Poster: Fast, automatic iphone shouldersurfing. In Proceedings of the 18th ACM conference on Computer andcommunications security, 2011.

[23] Chenning Li, Zhichao Cao, and Yunhao Liu. Deep ai enabled ubiquitouswireless sensing: A survey. ACM Comput. Surv., 2021.

[24] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, AndrewCunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Jo-hannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of theIEEE conference on computer vision and pattern recognition, 2017.

[25] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew PAitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-timesingle image and video super-resolution using an efficient sub-pixelconvolutional neural network. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2016.

[26] Yan Huang, Wei Wang, and Liang Wang. Video super-resolution viabidirectional recurrent convolutional networks. IEEE transactions onpattern analysis and machine intelligence, 2017.

[27] Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 6626–6634, 2018.

[28] Michal Kawulok, Pawel Benecki, Szymon Piechaczek, KrzysztofHrynczenko, Daniel Kostrzewa, and Jakub Nalepa. Deep learning formultiple-image super-resolution. IEEE Geoscience and Remote SensingLetters, 2019.

[29] Z. Dong, S. Zhang, B. Ma, D. Qi, L. Luo, and M. Zhou. A hybrid multi-frame super-resolution algorithm using multi-channel memristive pulsecoupled neural network and sparse coding. In Proceedings of the 7thInternational Conference on Information, Communication and Networks(ICICN), 2019.

top related