journal of la deep joint encryption and source-channel

13
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1 Deep Joint Encryption and Source-Channel Coding: An Image Privacy Protection Approach Jialong Xu, Student Member, IEEE, Bo Ai, Senior Member, IEEE, Wei Chen, Senior Member, IEEE, Ning Wang, Member, IEEE, and Miguel Rodrigues, Senior Member, IEEE Abstract—Joint source and channel coding (JSCC) has achieved great success due to the introduction of deep learn- ing. Compared with traditional separate source channel coding (SSCC) schemes, the advantages of DL based JSCC (DJSCC) include high spectrum efficiency, high reconstruction quality, and the relief of “cliff effect”. However, it is difficult to couple encryption-decryption mechanisms with DJSCC in contrast with traditional SSCC schemes, which hinders the practical usage of the emerging technology. To this end, our paper proposes a novel method called DL based joint encryption and source-channel cod- ing (DJESCC) for images that can successfully protect the visual information of the plain image without significantly sacrificing image reconstruction performance. The idea of the design is using a neural network to conduct image encryption, which converts the plain image to a visually protected one with the consideration of its interaction with DJSCC. During the training stage, the proposed DJESCC method learns: 1) deep neural networks for image encryption and image decryption, and 2) an effective DJSCC network for image transmission in encrypted domain. Compared with the perceptual image encryption methods with DJSCC transmission, the DJESCC method achieves much better reconstruction performance and is more robust to ciphertext-only attacks. Index Terms—Image encryption, image transformation, pri- vacy protection, security, joint source-channel coding, deep learn- ing. I. I NTRODUCTION T HE modular design principle based on Shannon’s sepa- ration theorem [1] is the cornerstone of modern commu- nications and has enjoyed great success in the development of wireless communications. However, the assumptions of unlimited codeword length, delay and complexity in the sepa- ration theorem are not possible in real wireless environments, leading to sub-optimal separate source channel coding (SSCC). Moreover, for time varying channels, when the channel quality is lower than the target channel quality, the SSCC cannot decode any information due to the collapse of channel coding; (corresponding authors: Bo Ai; Wei Chen) Jialong Xu is with the State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Bo Ai is with the State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Wei Chen is with the State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Ning Wang is with School of Information Engineering, Zhengzhou Univer- sity, Zhengzhou 450001,China (e-mail: [email protected]). Miguel Rodrigues is with the Department of Electronic and Electrical Engineering, University College London, London, WC1E 7JE, U.K. (e-mail: [email protected]). When the channel quality is higher than the target quality, the separation coding cannot further improve the reconstruction quality. This is the famous “cliff effect” [2], which increases the cost of SSCC during the wireless transmission. In the past years, joint source-channel coding (JSCC) has been demon- strated theoretically to have better error exponent than SSCC in discrete memoryless source channels [3]–[6] motivating the development of various JSCC designs over the years [7]–[10] are proposed. More recently, deep learning (DL) based approaches have been proposed for source coding [11]–[14], channel coding [15]–[18], and JSCC [19]–[22]. Compared with the SSCC scheme (e.g., JPEG/JPEG2000 for image coding and LDPC for channel coding), the DL based JSCC (DJSCC) scheme designed in [19] has better image restoration quality espe- cially in the low signal-to-noise ratio (SNR) regime. To well adapt variable bandwidth and exploit the utility of channel output feedback in real wireless environments, the schemes of DJSCC with adaptive-bandwidth image transmission and image transmission with channel output feedback are proposed by [20] and [21], respectively. However, all the aforementioned schemes are trained and inferred at the same channel condi- tions (the single SNR) to ensure optimality, demanding the use of multiple trained networks to suit a range of SNR that leads to considerable storage requirements in transceivers. To overcome this challenging problem in DJSCC, [22] proposed a single network for DJSCC which can adapt to a wide range of SNR conditions to meet the memory limit of the device in real wireless scenarios. So far, by using the data-driven approach, DJSCC successfully reduces the difficulty of coding design in traditional JSCC, and balances the performance and the storage requirement. These benefits brought by DL make DJSCC methods easier to be employed and deployed than traditional JSCC schemes. Yet, to protect information privacy and confidentiality, one must also couple encryption and decryption mechanisms with DJSCC based wireless communication systems as illustrated in Fig. 1. The image owner intends to transmit a plain image (an unencrypted image) to the image recipient through the network containing a untrusted wired network and a wireless transmission service. To protect visual information of the plain image, the image owner encrypts the plain image into an en- crypted image (the visually protected image) before providing it to the wireless service provider. Then the encrypted image is transmitted by the wireless service provider through the DJSCC transmission. After the DJSCC wireless transmission, the corrupted encrypted image (the decoded image by DJSCC arXiv:2111.03234v1 [cs.CR] 5 Nov 2021

Upload: others

Post on 19-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1

Deep Joint Encryption and Source-Channel Coding:An Image Privacy Protection Approach

Jialong Xu, Student Member, IEEE, Bo Ai, Senior Member, IEEE, Wei Chen, Senior Member, IEEE, NingWang, Member, IEEE, and Miguel Rodrigues, Senior Member, IEEE

Abstract—Joint source and channel coding (JSCC) hasachieved great success due to the introduction of deep learn-ing. Compared with traditional separate source channel coding(SSCC) schemes, the advantages of DL based JSCC (DJSCC)include high spectrum efficiency, high reconstruction quality,and the relief of “cliff effect”. However, it is difficult to coupleencryption-decryption mechanisms with DJSCC in contrast withtraditional SSCC schemes, which hinders the practical usage ofthe emerging technology. To this end, our paper proposes a novelmethod called DL based joint encryption and source-channel cod-ing (DJESCC) for images that can successfully protect the visualinformation of the plain image without significantly sacrificingimage reconstruction performance. The idea of the design is usinga neural network to conduct image encryption, which converts theplain image to a visually protected one with the considerationof its interaction with DJSCC. During the training stage, theproposed DJESCC method learns: 1) deep neural networks forimage encryption and image decryption, and 2) an effectiveDJSCC network for image transmission in encrypted domain.Compared with the perceptual image encryption methods withDJSCC transmission, the DJESCC method achieves much betterreconstruction performance and is more robust to ciphertext-onlyattacks.

Index Terms—Image encryption, image transformation, pri-vacy protection, security, joint source-channel coding, deep learn-ing.

I. INTRODUCTION

THE modular design principle based on Shannon’s sepa-ration theorem [1] is the cornerstone of modern commu-

nications and has enjoyed great success in the developmentof wireless communications. However, the assumptions ofunlimited codeword length, delay and complexity in the sepa-ration theorem are not possible in real wireless environments,leading to sub-optimal separate source channel coding (SSCC).Moreover, for time varying channels, when the channel qualityis lower than the target channel quality, the SSCC cannotdecode any information due to the collapse of channel coding;

(corresponding authors: Bo Ai; Wei Chen)Jialong Xu is with the State Key Laboratory of Rail Traffic Control

and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail:[email protected]).

Bo Ai is with the State Key Laboratory of Rail Traffic Controland Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail:[email protected]).

Wei Chen is with the State Key Laboratory of Rail Traffic Controland Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail:[email protected]).

Ning Wang is with School of Information Engineering, Zhengzhou Univer-sity, Zhengzhou 450001,China (e-mail: [email protected]).

Miguel Rodrigues is with the Department of Electronic and ElectricalEngineering, University College London, London, WC1E 7JE, U.K. (e-mail:[email protected]).

When the channel quality is higher than the target quality, theseparation coding cannot further improve the reconstructionquality. This is the famous “cliff effect” [2], which increasesthe cost of SSCC during the wireless transmission. In the pastyears, joint source-channel coding (JSCC) has been demon-strated theoretically to have better error exponent than SSCCin discrete memoryless source channels [3]–[6] motivating thedevelopment of various JSCC designs over the years [7]–[10]are proposed.

More recently, deep learning (DL) based approaches havebeen proposed for source coding [11]–[14], channel coding[15]–[18], and JSCC [19]–[22]. Compared with the SSCCscheme (e.g., JPEG/JPEG2000 for image coding and LDPCfor channel coding), the DL based JSCC (DJSCC) schemedesigned in [19] has better image restoration quality espe-cially in the low signal-to-noise ratio (SNR) regime. To welladapt variable bandwidth and exploit the utility of channeloutput feedback in real wireless environments, the schemesof DJSCC with adaptive-bandwidth image transmission andimage transmission with channel output feedback are proposedby [20] and [21], respectively. However, all the aforementionedschemes are trained and inferred at the same channel condi-tions (the single SNR) to ensure optimality, demanding theuse of multiple trained networks to suit a range of SNR thatleads to considerable storage requirements in transceivers. Toovercome this challenging problem in DJSCC, [22] proposeda single network for DJSCC which can adapt to a wide rangeof SNR conditions to meet the memory limit of the devicein real wireless scenarios. So far, by using the data-drivenapproach, DJSCC successfully reduces the difficulty of codingdesign in traditional JSCC, and balances the performance andthe storage requirement. These benefits brought by DL makeDJSCC methods easier to be employed and deployed thantraditional JSCC schemes.

Yet, to protect information privacy and confidentiality, onemust also couple encryption and decryption mechanisms withDJSCC based wireless communication systems as illustratedin Fig. 1. The image owner intends to transmit a plain image(an unencrypted image) to the image recipient through thenetwork containing a untrusted wired network and a wirelesstransmission service. To protect visual information of the plainimage, the image owner encrypts the plain image into an en-crypted image (the visually protected image) before providingit to the wireless service provider. Then the encrypted imageis transmitted by the wireless service provider through theDJSCC transmission. After the DJSCC wireless transmission,the corrupted encrypted image (the decoded image by DJSCC

arX

iv:2

111.

0323

4v1

[cs

.CR

] 5

Nov

202

1

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

Image Owner

Wired Network(e.g., Internet)Untrusted

Wireless Service(e.g. 5G, Wi-Fi)

DJSCCWireless Transmission

Image Recipient

Application(e.g., Facebook, Twitter)

Trusted

Deep Joint Source-Channel Encoder

Deep Joint Source-Channel Decoder

Network

Fig. 1. The DJSCC based wireless communication system.

SourceEncoder Encryption Channel

Encoder

0 1 …… 1 0 ……

(a)

Encryption SourceEncoder

ChannelEncoder

1 0 ……

(b)

Encryption Deep Joint Source-Channel Encoder

(c)

Fig. 2. Different Strategies for protecting image privacy in SSCC and DJSCC.(a) Encrypt the data encoded by source encoder before the channel encoderin SSCC;(b) Encrypt the image source before the source encoder in SSCC;(c) Encrypt the image source before the source encoder in DJSCC.

decoder) is transmitted to the image recipient through theuntrusted network. The image recipient decrypts the corruptedencrypted image to the plain image. Even if the encryptedimage or the decoded image is leaked or stolen during thewired network transmission process, visual information of theplain image can not be acquired directly.

In SSCC, there are two strategies to safeguard image pri-vacy: 1) encrypting the data encoded by source encoder beforethe channel encoder as shown in Fig. 2(a), and 2) encryptingthe image source before the source encoder as shown inFig. 2(b) [23]. The first strategy regarded as compression-then-encryption (CtE) applies the source encoder to compress theimage source to the binary data, and then uses some encryption

method (e.g., data encryption standard [24], advanced en-cryption standard [25], and Rivest–Shamir–Adleman [26]) toencrypt the binary data generating the ciphertext. The secondstrategy called encryption-then-compression (EtC) is fit for thetypical scenario, where the image provider only takes careof protecting the image privacy and the telecommunicationsprovider has an overriding interest on improving spectrum ef-ficiency. Following the EtC strategy, various perceptual imageencryption methods have been developed via transforming anoriginal image to a visually protected image [23], [27]–[48].

To protect image privacy for DJSCC transmission, theencryption module should be placed in front of deep jointsource-channel encoder as shown in Fig. 2(c), which is akinto the position of encryption module in Fig. 2(b). However,a major issue is that EtC change the visual structure of theplain image, causing DJSCC transmission degradation. To thebest of our knowledge, there are no works addressing theencryption problem in the DJSCC framework. In [48]–[52],different encryption methods are developed for DNN-basedclassification tasks. In addition, Homomorphic Encryption(HE) [53] is a promising privacy preserving computationmethod. Some of the HE-based methods have been appliedin DL domain [54]–[57]. However, the high computationalcomplexity, the loss of calculation accuracy due to the use ofpolynomial instead of nonlinear activation function, and thelarge ciphertext expansion when combining HE with DNNs isan obstacle to the use of HE in DJSCC.

In this paper, we design a DL based joint encryption andsource-channel coding method that can generate a visuallyprotected image suitable for DJSCC transmission with highreconstruction performance. The inspiration of our proposedmethod originates from image transformation illustrated inFig. 3. Image transformation has been successfully appliedin image compression (e.g., JPEG [58], JPEG2000 [59]) and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

DCT

Plain Image Transformed Image(a) (b)

Fig. 3. Illustration of image transformation. (a) The plain image, (b) Thetransformed image using discrete cosine transform (DCT).

image processing (e.g., image classification, image semanticsegmentation [60]) for facilitating the subsequent operations.Although the transformed image contains the information ofthe plain image, little visual information can be perceived dueto the transformation. Compared with the existing perceptualimage encryption methods, the proposed method not onlyprotects the image privacy, but also have a better reconstructionperformance. Another advantage of our proposed method isthat the proposed method is more robust against the ciphertext-only attacks than the existing perceptual image encryptionmethods. Moreover, the full convolutional architecture of theproposed method makes it more flexible to deal with imagesof different sizes without loss of the visual protection perfor-mance.

The rest of this paper is organized as follows. Section IIpresents related work on deep joint source-channel coding,perceptual image encryption, and image transformation. Then,the proposed method is presented in Section III. In SectionIV, the proposed method is evaluated on datasets with lowresolution and high resolution, respectively. Visual security ofthe proposed is evaluated by ciphertext only attacks in SectionV. Finally, Section VI concludes this paper.

II. RELATED WORK

A. Deep Joint Source Channel Coding

As shown in the lower part of Fig. 1, DJSCC followsthe end-to-end autoencoder architecture [61] which replacesthe source encoder and the channel encoder (with/withoutthe modulation) with a deep joint source-channel encoder(DJSCE) in the transmitter, and similarly, replaces the corre-sponding source decoder, the channel decoder (with/withoutdemodulation) by a deep joint source-channel decoder(DJSCD).

The initial DJSCC work proposed a recurrent neural net-work (RNN) based DJSCE and DJSCD for text transmissionover binary erasure channels [62]. From then on, DJSCCattracted increasing interests, especially for image compressionand transmission. In [19], fully convolutional neural networks(FCNNs) are used for DJSCE and DJSCD, and are shown tooutperform SSCC method especially in the low SNR regime.Furthermore, the DJSCC method in [19] provides a gracefulperformance degradation in communications scenarios suffer-ing from large channel estimation errors, associated with thewell-known “cliff effect” in SSCC.

In the wireless communication systems, the transmissionbandwidth is always dynamically allocated according to user

requirements and system overload. DJSCC-l proposed by[20] can transmit progressively in layers; When the availablebandwidth is limited, codewords of the first few layers aretransmitted to the receiver to reconstruct the transmitted imagewith lower quality; When the available bandwidth is increased,codewords of the residual layers are transmitted to the receiverand are combined with the first codewords to reconstruct thetransmitted image with higher quality.

Once channel output feedback is available, DJSCC-f pro-posed by [21] can further improve the reconstruction qualityby exploiting the channel output feed back. The differencebetween DJSCC-l and DJSCC-f is that DJSCC-f not onlyaggregates the layered transmission information at the receiver,but also jointly processes the transmitted image and the pro-cessed channel output feedback at the transmitter. Comparedwith DJSCC without channel output feedback, DJSCC withboth noisy and noiseless channel output feedback can achieveconsiderable performance gains. However, [19]–[21] musttrain multiple networks in a range of SNR and select oneof them depending on the real SNR condition to keep theiroptimality, which would lead to heavy burden on the storageoverhead of devices.

Inspired by the resource assignment strategy in traditionalJSCC, [22] introduces the SNR feedback to DJSCC andproposes a general DJSCC method named Attention DLbased JSCC (ADJSCC). ADJSCC can dramatically reduce thestorage overhead while maintaining the similar performanceby using attention mechanisms. In addition, [63] proposeda DJSCC based on maxmizing the mutual information be-tween the source and noisy received codeword for the binaryerasure channel and the binary symmetric channel. [64] and[65] model their DJSCC systems via variational autoencoderand manifold variational autoencoders for a Gaussian source,respectively. However, all of the aforementioned work are notdesigned to guarantee information privacy protection.

B. Perceptual Image EncryptionPerceptual image encryption aims to transform a plain

image into an encrypted image, which downgrades visualquality to protect the original perceptual information.

One can change image visual information along two di-mensions: the space dimension and the pixel dimension. Thespace based encryption involves re-arranging (scrambling)the different pixels within the image. There are numerousimage scrambling algorithms, e.g., Arnold [27], Baker Trans-formation [28], Fibonacci Transformation [29], Magic Square[30], RGB Scramble [31], Chaos Scramble [32], and SCANPattern [33]. The pixel based encryption only changes pixelvalues to protect visual information. Popular methods includeDES [34], Hill algorithm [35], chaotic logistic map [36],and cellular automata [37] based methods. Combining spacebased encryption and pixel based encryption can make en-cryption more robust. Well-known techniques include Arnoldand Chen’s chaotic system based method [38], compoundchaotic sequence based method [39], and 3D chaotic mapbased method [40]. However, the aforementioned encryptionmethods do not take into account the effect of subsequentoperations such as source or channel coding.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

Considering the compression task in EtC system, streamcipher methods using a pseudo-random key generator followedby Slepian-Wolf coding and resolution progressive compres-sion have been used for lossless compression in [42], [43].To further improve the compression ratio, pixel based imageencryption followed by a lossy scalable compression technique[44], image encryption via prediction error clustering andrandom permutation followed by a context-adaptive arithmeticcoding [45], pseudorandom permutation based image encryp-tion followed by orthogonal transformation based lossy com-pression [46], and block scrambling based image encryptionfollowed by JPEG lossy compression [23] are proposed forlossy compression. However, when the final signal processingtask is changed from image reconstruction to e.g. imageclassification, the methods designed for EtC systems stopworking.

DL has led to state-of-the-art performance in various imageprocessing tasks, motivating more recently the application ofdeep learning techniques to encrypted images. The methodsproposed by [48]–[52] are designed for DL based classifica-tion. Tanaka’s method [48] is a hybrid encryption method,adding an adaptation network prior to DNNs. The pixel basedimage encryption methods with/without key management [49],[50] can directly be fed into DNNs for classification. DLbased encryption methods, e.g., generative adversarial net-works (GAN) and DNNs, are proposed for image encryptionin [51] and [52], respectively. Both of [51] and [52] employimage transformation—transforms the image from one domainto the other domain—to protect visual information. However,The image encryption methods designed for DL based classi-fication are still not fit for DJSCC. The encryption images forclassification only reserve some specific semantic informationrelevant to image class while the pixel based information of theimage is discarded, which cause the performance degradationfor the image reconstruction task, i.e., DJSCC.

III. DEEP JOINT ENCRYPTION AND SOURCE-CHANNELCODING

Our method motivation is to successfully protect the visualinformation of the plain image without significantly sacrificingimage reconstruction performance in DJSCC. Conventionalimage transformations, e.g., discrete cosine transform (DCT)and discrete wavelet transform (DWT), can provide the conve-nience for subsequent operations and simultaneously generatea visually protected image in the transformation. However,[60], [66], [67] demonstrate that conventional image transfor-mations are hardly matched with the DNNs if the structuresof DNNs are not modified. In this Section, a DL based jointencryption and source-channel coding (DJESCC) method isproposed for protecting visual information of plain images inDJSCC transmission.

A. System Model

Considering a visually protected DJSCC transmission sys-tem as shown in the lower part of Fig. 4, a plain image isrepresented by x ∈ Rn, where R denotes the set of realnumbers and n = h×w× c. h,w, c denote the width, height,

and the number of channels of an image, respectively. Theencryption network transforms the plain image into a visuallyprotected image. This encryption process can be expressed as:

y = eµ(x) ∈ Rn, (1)

where eµ(·) represents an encryption deep neural networkbased parameterized by the set of parameters µ. Note thatthe plain image x and the visually protected image y havethe same size.

Then the visually protected image y is encoded by the jointsource-channel encoder as:

z = fθ(y) ∈ Ck, (2)

where C denotes the set of complex numbers, k represents thesize of channel input symbols, and fθ(·) represents an jointsource-channel encoder parameterized by the set of parametersθ. The encoded complex-valued z represents the transmittedsignals at the transmitter. The real parts and the imaginaryparts of z are considered as in-phase components I andquadrature components Q of the transmitted signals, respec-tively. Due to the average power constraint at the transmitter,1kE(zz

∗) ≤ 1 must be satisfied, where z∗ is the complexconjugate transpose of z.

The transmitted signals are corrupted by the wireless chan-nel. We adopt a well known AWGN model1 given by:

z = η(z) = z + ω, (3)

where z ∈ Ck is the channel output and ω ∈ Ck denotesthe additive noise modeled by ω ∼ CN(0, σ2I) where σ2

represents the average noise power and CN(·, ·) denotes acircularly symmetric complex Gaussian distribution.

In turn, the channel output symbols z is decoded by thejoint source-channel decoder as:

y = gφ(z) ∈ Rn, (4)

where gφ(·) represents an joint source-channel decoder pa-rameterized by the set of parameters φ. The decoded imagey ∈ Rn is with the same size of the visually protected imagey.

Similarly to the encryption step, the decryption networkis employed to convert the decoded image to the decryptedimage, as follows:

x = dν(y) ∈ Rn, (5)

where dν(·) represents a decryption deep neural networkparameterized by the set of parameters ν and the decryptedimage x ∈ Rn is an estimation of the plain image. Thebandwidth ratio R is defined as k/n, where n is the sourcesize (i.e., image size) and k is the channel bandwidth (i.e.,channel input size).

1By applying equalization at the receiver, the flat fading channel model canbe represented as AWGN model, while the noise has a different distribution.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

EncryptionNetwork

(𝑒!)

Deep JointSource-Channel

Encoder(𝑓")

Deep JointSource-Channel

Decoder(𝑔#)

DecryptionNetwork

(𝑑$)

NoisyChannel

(𝜂)𝑥 𝑦 𝑧 �� %𝑦 %𝑥

FeatureExaction Network

(ℎ%)

FeatureExaction Network

(ℎ%)

FeatureExaction Network

(ℎ%)

Encryption Loss(ℒ&)

Reconstruction Loss(ℒ')

Decryption Loss(ℒ()

ℎ%(𝑥)

ℎ%(𝑦) ℎ%( ,𝑦)

Test Stage

Training Stage

Fig. 4. The system model of the proposed DL based joint encryption and source-channel coding method.

B. The Proposed Method

In sharp contrast with DJSCC methods [19]–[22], we re-quire our DJESCC method to address two issues:

1) To hide visual information of a plain image.2) To extract effective features from the visually protected

image for subsequent DJSCC transmission.The classical full-reference metric of image similarity is

peak signal-to-noise ratio (PSNR) between the original imageand the restored image, which is defined as:

PSNR = 10log10MAX2

MSE(dB). (6)

where MAX is the maximum possible value of the imagepixels and MSE is the abbreviation of mean square errorbetween the original image and the restored image. Althoughthe prediction of PSNR performance is not always consistentwith quality perception by the human visual system, thesimplicity and the inexpensive computational complexity makeit widely used in the field of image processing [68].

However, as is illustrated in Fig. 5, PSNR is not a good met-ric to assess the effect of hiding visual information. Fig. 5(b)has a lower PSNR than Fig. 5(c), while the visual information(e.g., the birds and the leaf) are more easily identified inFig. 5(b) than in Fig. 5(c). To solve this problem, a featureextraction network is employed to measure the effect of hidingvisual information. The feature extraction method was initiallyused to measure the similarity between two images [69] andthen successfully used to measure the difference between twoimages [52].

In the training stage, features of the plain image x, theencrypted image y, and the decoded image y are extracted bythe feature extraction network hψ in Fig. 4. The feature lossLe between the plain image x and the encrypted image y isexpressed as:

Le =1

m‖hψ(x)− hψ(y)‖22, (7)

and the feature loss Ld between the plain image x and theencrypted image y is expressed as:

Ld =1

m‖hψ(x)− hψ(y)‖22, (8)

Plain ImagePSNR:6.179 dB PSNR:9.957 dB

(a) (b) (c)Invert Image Shuffle Image

Fig. 5. PSNR comparison. (a) plain image, (b) invert image (the intensityvalues of the plain image are substracted by 255), (c) shuffle image (theintensity values of the plain image are randomly shuffled in space and channeldimension).

where hψ(x) ∈ Rm, hψ(x) ∈ Rm, and hψ(y) ∈ Rm are thefeatures of the plain image x, the encrypted image y, and thedecoded image y extracted by the feature extraction networkand m = hf × wf × cf . hf , wf , and cf denote the width,height, and the number of channels of an extracted feature,respectively. The reconstruction loss Le2e between the plainimage x and the decrypted image x is expressed as:

Lr = d(x, x) =1

n‖x− x‖22 =

1

n

n∑i=1

(xi − xi)2, (9)

where xi and xi represents the intensity values of the plainimage x and the decrypted image x, respectively.

Different from the image-to-image translation tasks [51],[69] which minimizes the feature loss in the training stage,the proposed method maximizes Le and Ld to hide visualinformation for the encrypted image x and the decoded imagey. The total loss used to train the proposed method:

Ltotal = Lr − λeLe − λdLd, (10)

where λe ∈ R+ and λd ∈ R+ are the weights of Le andLd, respectively. Under a certain bandwidth ratio R, DJESCClearns the parameters of the encryption network µ, the deepjoint source-channel encoder θ, the joint source-channel de-

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

coder φ, and the decryption network ν by minimizing theend-to-end distortion as follows:

(µ∗,θ∗,φ∗,ν∗) = argminµ,θ,φ,ν

Ep(σ2)Ep(x,x)(Ltotal), (11)

where µ∗,θ∗,φ∗,ν∗ are the optimal parameters, p(x, x)represents the joint probability distribution of the plain imagex and the decrypted image x, and p(σ2) represents theprobability distribution of the average noise power σ2. Notethat the probability distribution of the average noise powerinstead the fixed average noise power is adopted due to theconsiderations of the storage overhead and the difficulty toacquire the average noise power in the image owner/recipient.In addition, empirical average instead of statistic average isadopted in the training stage. During the training stage, theproposed DJESCC method learns: 1) an effective methodto hide visual information of the plain image, 2) an easyto be extracted image domain for the subsequent DJSCCtransmission, 3) an effective DJSCC transmission method, and4) an effective method to reconstruct the plain image.

After the DJESCC network has been trained, the encryptionnetwork eµ and the decryption network eµ are securelydistributed to the image owner and the image recipient by somesecurity protocol, e.g., Secure Sockets Layer (SSH) protocol,respectively. The deep joint source-channel encoder and thedeep source-channel decoder are distributed to the DJSCCtransmission service provider.

In the test stage, the plain image x is first converted to theencrypted image y by the image owner using Eq. (1). Then theencrypted image y is sent to the DJSCC transmission serviceprovider. DJSCC transmission is executed using Eq. (2) andEq. (4) and the decoded image y is obtained. The DJSCCtransmission service provider sends the decoded image y tothe image recipient, which uses Eq. (5) to decrypt the decodedimage y. The process of test stage is illustrated in the lowerpart of Fig. 4.

IV. EXPERIMENTAL RESULTS

Considering the generality of the DJESCC method, thereare multiple architecture choices for the encryption network,the DJSCC encoder, the DJSCC decoder, and the decryptionnetwork. To prove the potential of our proposed method, theoriginal DJSCC network architecture in [19] is chosen in thesubsequent experiments. It is worthy noting that our proposedmethod can be applied to other extensions of the originalDJSCC network architecture.

The DJSCC architecture proposed by [19] is shown inFig. 6, where the encoder and decoder networks are adoptedin our DJESCC method. The DJSCC encoder consists of thenormalization layer, five alternant convolutional layers andPReLU layers, the reshape layer, and the power normalizationlayer. The DJSCC decoder consists of the reshape layer, fivealternanting transposed convolutional layers and activationlayers (i.e., four PReLU layers and one sigmoid layer), andthe denormalization layer. The normalization layer converts theinput image with the pixel value range [0, 255] to the imagewith pixel value range [0, 1], and the denormalization layerperforms the opposite operation. The notation F × F ×K|S

in a convolution/transpose convolution layer denotes that ithas K filters with size F and stride down/up S. The powernormalization layer is used to satisfy the average powerconstraint at the transmitter. The channel number of the lastconvolutional layer in the DJSCC encoder is t. The bandwidthratio of the proposed method is R = t/96. The architectureshown in Fig. 7 is employed as the architectures of theencryption network and the decryption network, which isa shallow version of U-Net [70]. In the training stage, theparameters of the feature extraction network is fixed.

Tensorflow [71] and its high-level API Keras is used toimplement the proposed DJESCC method2. The proposedmethod is trained under a uniform distribution within the SNRrange [0, 20] dB. The following experiments run on a Linuxserver with twelve octa-core Intel(R) Xeon(R) Silver 4110CPUs and sixteen GTX 1080Ti GPU. Each experiment wasassigned six CPU cores and a GPU.

We first consider the performance of the proposed methodon CIFAR-10 dataset, which consists of 60000 32 × 32 × 3color images associated with 10 classes where each classhas 6000 images. Note the goal of our proposed methodis to generate visually protected images for the untrustedtransmission channels and reconstruct the plain image at thereceiver, so the class label of each image is not used inthe following experiments. Training dataset and test datasetcontain 50000 images and 10000 images, respectively. Weuse a part of the VGG16 [72] with batch normalization,i.e., a classical network for classification tasks, as the featureextraction network, which is pretrained on the CIFAR-10. Allof the networks were trained for 500 epochs by using Adamoptimizer with an initial learning rate of 10−3. Once learningstagnated for 10 epochs, the learning rate was reduced by afactor of 10. The performance of the DJESCC networks wereevaluated at specific SNRtest ∈ [0,20] dB on CIFAR-10 testdataset. To alleviate the effect of the randomness caused bythe wireless channel, each image in CIFAR-10 test dataset istransmitted 10 times. PSNR is used in the evaluation of thereconstruction performance between the plain image and thedecrypted image.

Fig. 8 compares the reconstruction performance of theproposed method with different loss weights (e.g., λe = λd =0.005, 0.05, 0.5) at bandwidth ratio R = 1/6. The reconstruc-tion performance of the DJSCC without visual protection, i.e.,λe = λd = 0 is also plotted. The reconstruction performanceis gradually decreased with the increase of λe and λd, as thetraining promotes protection of visual information instead ofreconstruction of the images. Fig. 9 shows the visualizationof the plain image, the encrypted images transformed by theimage owner, the decoded images decoded by the DJSCCtransmission service provider, and the decrypted images trans-formed by the image recipient for SNR = 0, 10, 20 dBwith different loss weights λe and λd. The plain imagecomes from CIFAR-10 test dataset. Fig. 9(a) shows that eventhough the feature losses between the plain image and theencrypted/decoded image are not used, the visual information

2Source codes for constructing the proposed method are available at:https://github.com/alexxu1988/DJESCC.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

Po

wer

No

rmal

izat

ion

No

rmal

izat

ion

DJSCC Encoder

DJSCC Decoder

(h,w,c) (h/2,w/2,16) (h/4,w/4,32) (h/4,w/4,t)

Com

mu

nic

atio

n

Chan

nel

Res

hap

e

Con

v|5

×5

×1

6|2

PR

eLU

Con

v|5

×5

×1

6|2

PR

eLU

Con

v|5

×5

×3

2|2

PR

eLU

Con

v|5

×5

×3

2|2

PR

eLU

Con

v|5

×5

×3

2|1

PR

eLU

Con

v|5

×5

×3

2|1

PR

eLU

Con

v|5

×5

×3

2|1

PR

eLU

Con

v|5

×5

×3

2|1

PR

eLU(h/4,w/4,32)

Con

v|5

×5

×t|1

PR

eLU

Con

v|5

×5

×t|1

PR

eLU(h/4,w/4,32) (h×w×t/32,2)

Res

hap

e

(h×w×t/32,2)(h/4,w/4,t)

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU (h/4,w/4,32)

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU

Tra

nsC

on

v|5

×5

×3

2|1

PR

eLU (h/4,w/4,32)

Tra

nsC

on

v|5

×5

×1

6|2

PR

eLU

Tra

nsC

on

v|5

×5

×1

6|2

PR

eLU (h/4,w/4,32)

Tra

nsC

on

v|5

×5

×3

2|2

Sig

mo

id

Tra

nsC

on

v|5

×5

×3

2|2

Sig

mo

id (h/2,w/2,16)

Deo

rmal

izat

ion

(h,w,c)

Encrypted

Image

Decoded

Image

Fig. 6. The architecture of the DJSCC network [19] adopted in this paper.

No

rmal

izat

ion

Con

v|3

×3

×6

4|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

Con

v|3

×3

×6

4|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

(h,w,c)

Max

Po

oli

ng

|2x

2

(h,w,64)

Con

v|3

×3

×1

28

|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

Con

v|3

×3

×1

28

|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

(h/2,w/2,64)

Tra

nsC

on

v|3

×3

×6

4|2

(h/2,w/2,128)

ReL

U

Con

catn

ate

(h,w,64)

Con

v|3

×3

×1

28

|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

Con

v|3

×3

×1

28

|1

ReL

U

Bat

ch N

orm

aliz

atio

n

Repeat 2 times

(h,w,128)

Con

v|1

×1

×3

|1

ReL

U

No

rmal

izat

ion

(h,w,128) (h,w,c)Plain Image/

Decoded Image

Encrypted Image/

Decrypted Image

Fig. 7. The architecture of the encryption and decryption networks adopted in this paper.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0SNRtest(dB)

22

24

26

28

30

32

PSNR

(dB)

AWGN Channel (R=1/6)

DJESCC( e = d = 0)DJESCC( e = d = 0.005)DJESCC( e = d = 0.05)DJESCC( e = d = 0.5)

Fig. 8. Performance of DJESCC and DJSCC trained on CIFAR-10 trainingdataset and evaluated on CIFAR-10 test dataset with R=1/6.

of the plain image is partially protected in the encrypted imageand the decoded image. The outline of the dog can be vaguelyidentified in the encrypted image in Fig. 9(a), and less visualinformation is captured in the decoded images than in theencrypted image. With the increase in loss weights λe and λd,the visual protection in the encrypted image is enhanced, andthe corresponding visual pattern in the decoded image is alsochanged. In Fig. 9(b) with the loss weights λe = λd = 0.005,the outline of the dog can be vaguely identified in theencrypted image. The encrypted image decays to regular blackand white lattice with no visual information of the plain imagein Fig. 9(c) with the loss weights λe = λd = 0.05. As theloss weights increase to λe = λd = 0.5, the left part of the

encrypted image contains alternated red and blue stripes andthe right part of it contains twisted black and white lattice,which is more irregular than the encrypted image in Fig. 9(d).In Fig. 9(b)(c)(d), all of the decoded images with differentSNRs can protect visual information of the plain image. Theproposed method with λd = λe = 0.05 shows a better trade-off between the reconstruction performance and the visualprotection performance than others.

Since our proposed method is the first work that constructsvisually protected image for DJSCC transmission, we comparethe proposed method with two perceptual image encryptionmethods, i.e. the learnable image encryption (LE) method [48]and the pixel-based image encryption (PE) method [50], whichare designed for image classification task. The classificationaccuracy of ResNet-20 [73] based on the LE method and thePE method are 87.02% and 86.99% on CIFAR-10 test dataset,respectively [52].

Fig. 10 compares the DEJSCC method with the LE basedDJSCC (DJSCC LE) method and the PE based DJSCC(DJSCC PE) method for bandwidth ratios R= 1/6 and R=1/12.The performance of the DJSCC PE with R=1/6 is slightlyincreased around 11.5 dB as the SNR increases from 0dBto 20dB, which is much lower than that of the DJESCCwith λe = λd = 0.05 and R=1/6. Although the performanceof the DJSCC PE with R=1/6 is better than that of theDJSCC LE with R=1/6, the performance of the DJSCC PEwith R=1/6 is still 5 dB lower than that of the DJESCCat SNRtest = 0 dB and the performance gap between theDJSCC PE and the DJESCC with λe = λd = 0.05 isfurther widened with the increase of the SNR. Comparingthe methods with R=1/12 shows the similar results. Withthe increase of R from 1/12 to 1/6, DJESCC achieves moregains than DJSCC PE and DJSCC LE. Fig. 11 shows the

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

Plain Image Encrypted Image

PSNR:7.57dBSSIM:-0.098

(i) SNR=0dB

PSNR:7.40dBSSIM:0.019

PSNR:20.84dBSSIM:0.823

PSNR:7.68dBSSIM:-0.022

Decoded Image

PSNR:28.34dBSSIM:0.961

Decrypted Image

PSNR:7.79dBSSIM:-0.023

PSNR:31.28dBSSIM:0.981

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB

PSNR:9.04dBSSIM:0.000

(i) SNR=0dB

PSNR:6.238dBSSIM:-0.013

PSNR:20.72dBSSIM:0.842

PSNR:6.57dBSSIM:-0.014

PSNR:28.12dBSSIM:0.960

PSNR:6.60dBSSIM:-0.014

PSNR:30.92dBSSIM:0.980

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB

(ii) SNR=10dB (iii) SNR=20dB(a) 𝜆! = 𝜆 " = 0

(ii) SNR=10dB (iii) SNR=20dB(b) 𝜆! = 𝜆 " = 0.005

PSNR:4.72dBSSIM:0.001

(i) SNR=0dB

PSNR:4.72dBSSIM:0.003

PSNR:20.17dBSSIM:0.817

PSNR:4.72dBSSIM:0.003

PSNR:24.91dBSSIM:0.907

PSNR:4.72dBSSIM:0.003

PSNR:25.37dBSSIM:0.916

(ii) SNR=10dB (iii) SNR=20dB(d) 𝜆! = 𝜆 " = 0.5

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB

PSNR:4.81dBSSIM:0.004

(i) SNR=0dB

PSNR:5.04dBSSIM:0.006

PSNR:21.38dBSSIM:0.864

PSNR:5.12dBSSIM:0.006

PSNR:27.44dBSSIM:0.954

PSNR:5.13dBSSIM:0.006

PSNR:29.68dBSSIM:0.969

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB(ii) SNR=10dB (iii) SNR=20dB(c) 𝜆! = 𝜆 " = 0.05

Fig. 9. Visually protected images generated by the DJESCC method. The image in the first column is the plain image. The images in the second columnare the encrypted images transformed by the image owner. The images in the third column are the decoded images decoded by the DJSCC decoder forSNR = 0, 10, 20 dB. The images in the last column are the decrypted images transformed by the image recipient for SNR = 0, 10, 20 dB. PSNR andstructural similarity index (SSIM) are given under the images. (a) λe = λd = 0, (b) λe = λd = 0.005, (c) λe = λd = 0.05, (d) λe = λd = 0.5.

corresponding visual performance for DJESCC, DJSCC LE,DJSCC PE with SNR = 0, 10, 20 dB and R=1/6. Differentfrom the black and white lattice characteristic of the encryptedimage shown in the DJESCC method, the encrypted imagesof the DJSCC LE method and the DJSCC PE method areshown as noisy images. However, the pixels in the part of theencrypted images corresponding to that in the dog part of theplain image show less randomness than the pixels in otherparts. The decoded images of the DJSCC LE method and theDJSCC PE method are similar to the corresponding encryptedimages of the DJSCC LE method and the DJSCC PE method,respectively. The decrypted images of the DJSCC LE methodat SNR = 0, 10, 20 dB are disturbed by some noisy pixels,while the decrypted images of the DJSCC PE method atSNR = 0, 10, 20 look blurred.

V. VISUAL SECURITY OF THE PROPOSED METHOD

As described in Section I, it is unsafe during the networktransmission. The visually protected images—the encryptedimage generated by the image owner then sent to the DJSCCprovider through the Internet and the decoded image processedby the DJSCC provider then sent to the image recipient—maybe leaked or stolen in the wired network. Hence the cipher-text-only attack (CoA), where the attacker has access todecoded image and wishes to recover plain image, must beconsidered to evaluate the visual security of the proposedmethod. In this paper, we focus on two CoAs: feature recon-struction (FR) attack and generative adversarial network based(GAN-based) attack.

A. FR Attack

Based on the fact that many visually protected methods dealwith the plain image in patch (e.g., the LE method operates

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0SNRtest(dB)

5

10

15

20

25

30

PSNR

(dB)

AWGN Channel

DJESCC( e = d = 0.05, R=1/6)DJESCC( e = d = 0.05, R=1/12)DJSCC_LE(R=1/6)DJSCC_LE(R=1/12)DJSCC_PE(R=1/6)DJSCC_PE(R=1/12)

Fig. 10. Performance of DJESCC, DJSCC LE, and DJSCC PE evaluated onCIFAR-10 test dataset. DJESCC is trained on CIFAR-10 training dataset.

the pixels in a 4×4 block) or in pixel (e.g., the PE methodonly deal with the pixels without change their positions), FRattack proposed by [74] is employed to reconstruct the edgeinformation of plain images from the encrypted images. Thealgorithm of FR attack is described in Algorithm 1, where b·cdenotes the floor function.

Algorithm 1 The FR-attack method [74]Input: the 8-bit visually protected image y ∈ Rh×w×3,

leading bit b ∈ {0, 1}Output: the 8-bit restored image x ∈ Rh×w×3

1: the 8-bit RGB image is divided into m = h × w pixels,y(i) ∈ R1×1×3 represents the i-th pixel

2: for i = 0 : 1 : h× w do3: for j = 0 : 1 : 2 do4: if byi[j]c 6= b then5: xi[j] = yi[j]⊕ (28 − 1)6: end if7: end for8: end for9: assemble xi, i ∈ [1,m] to form x ∈ Rh×w×3

10: return x

B. GAN-based Attack

GAN-based attack adopted in [75] has shown state-of-the-art attack performance for CoA. As illustrated in Fig. 12,the Generator intends to generate images from encryptedimages to cheat the discriminator while the purpose of thediscriminator is to well distinguish whether the images fedinto it originates from the plain image dataset or from thedecrypted images. In the training stage, the training datasetcontaining plain images is divided into X1 and X2 with thesame size. X2 is encrypted by some encryption method. BothX1 and E(X2) are employed for alternatively training thegenerator and the discriminator. Note that paired images arenot required in the training stage. With the progress of training,

the generator can gradually generate the images more andmore like plain images from the encrypted images. In otherwords, the generator decrypts the encrypted images and getsthe visual information.

The generators proposed in [75] are used as generatorsof GAN-based attack for the DJSCC LE method and theDJSCC PE method except using the sigmoid activation in-stead of tanh activation in the last layer. The transformationarchitecture shown in Fig. 7 is employed as the generatorfor the DJESCC method. The discriminator contains threeconvolutional layers with kernel size 2, stride 2, the numberof channels 16, 16, 32 and leaky ReLU activation, and adense layer with sigmoid activation. The models of GAN-based attack networks were trained for 600 epochs by usingthe Adam optimizer with the learning rate 0.0001 and batchsize with 64.

C. Visual Performance of CoAs

We employ the STL-10 dataset in the expriments for abetter visual comparison. Full convolutional architecture basedDJSCC can deal with tasks of different image sizes in thetest phase [19] and [22]. The DJESCC model is trained onCIFAR-10, which is unknown to the attacker. The LE methodand the PE method are size sensitive. Hence the DJSCC LEmethod and the DJSCC PE method were trained on the STL-10 dataset under a uniform distribution within the SNR range[0, 20] dB. FR attack can be directly employed on the cipherimages (the encrypted image and the decoded image generatedby DJESCC, DJSCC LE, or DJSCC PE), while the GAN-based attack should first be trained and then be employedon the cipher images. The GAN-based attack was trainedand tested on STL10 dataset for DJESCC, DJSCC LE andDJSCC PE, respectively.

Visual information reconstructed by FR attack and GAN-based attack with SNRtest=10 dB are shown in Fig. 13. Forattacking encrypted images by FR attack, edge informationagainst the PE method can be successfully reconstructed andedge information against the LE method can be vaguelyreconstructed. However, the black image generated by FRattack against the DJESCC method shows that FR attack cannot reconstruct any edge information from the encrypted imagegenerated by the DJESCC method. Similar results can befound for attacking decrypted images by FR attack, exceptthat the edge information can not be reconstructed by FRattack attack against the PE method because of the low DJSCCtransmission quality in the DJSCC PE method. GAN-basedattack can successfully recover visual information from theencrypted image protected by the DJSCC LE method, thedecrypted image protected by the DJSCC LE method, and theencrypted image protected by the DJSCC PE method, whilelittle visual information is recovered from the encrypted imageand the decrypted image protected by the DJESCC method.Compared with the DJSCC LE method and the DJSCC PEmethod, the visual protected images (i.e.,the encrypted imageand the decoded image) generated by the DJESCC method aremore robust against FR attack and GAN-based attack.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

Plain Image Encrypted Image Decoded Image Decrypted Image

PSNR:4.81dBSSIM:0.004

(i) SNR=0dB

PSNR:5.04dBSSIM:0.006

PSNR:21.38dBSSIM:0.864

PSNR:5.12dBSSIM:0.006

PSNR:27.44dBSSIM:0.954

PSNR:5.13dBSSIM:0.006

PSNR:29.68dBSSIM:0.969

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB

PSNR:7.30dBSSIM:-0.003

(i) SNR=0dB

PSNR:8.09dBSSIM:0.002

PSNR:10.33dBSSIM:0.174

PSNR:7.89dBSSIM:-0.005

PSNR:10.64dBSSIM:0.180

PSNR:7.88dBSSIM:-0.011

PSNR:10.62dBSSIM:0.205

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB(iii) =2(iii) SNR=20dB(ii) SNR=10dB

PSNR:7.54dBSSIM:-0.042

(i) SNR=0dB

PSNR:7.79dBSSIM:0.033

PSNR:16.38dBSSIM:0.535

PSNR:8.07dBSSIM:0.039

PSNR:18.49dBSSIM:0.663

PSNR:7.54dBSSIM:-0.042

PSNR:18.74dBSSIM:0.689

(i) SNR=0dB (ii) SNR=10dB (iii) SNR=20dB(iii) =2(iii) SNR=20dB(ii) SNR=10dB

(ii) SNR=10dB ((iiiiii)) SNR==220dB(a) DJESCC(𝜆! = 𝜆" = 0.05)

(b) DJSCC_LE

(c) DJSCC_PE

Fig. 11. Visually protected images comparison generated by the DJESCC method, the DJSCC LE method, and the DJSCC PE method with R=1/6. Theimage in the first column is the plain image. The images in the second column are the encrypted images transformed by the image owner. The images inthe third column are the decoded images decoded by the DJSCC decoder for SNR = 0, 10, 20 dB with different methods. The images in the last columnare the decrypted images transformed by the image recipient for SNR = 0, 10, 20 dB with different methods. PSNR and SSIM are given under images. (a)DJESCC with λe = λd = 0.05, (b) DJSCC LE, (c) DJSCC PE.

Generator

Discriminator

Real

Fake

X!

E(X") G(E(X"))

Unpaired

Fig. 12. GAN-based attack.

VI. CONCLUSION

Inspired by image transformation, we have proposed anovel DJESCC method. By applying end-to-end training, theproposed DJESCC method learned two DNNs to transformthe images, i.e., one transformed from the plain image domainto the encrypted image domain for encryption, and the otherone transformed from the DJSCC decoded image domain fordecryption. Besides that, the proposed DJESCC method simul-taneously learned an effective DJSCC transmission method inthe encryption domain during the training stage.

Compared with perceptual image encryption methods (e.g.,the LE method and the PE method) for DJSCC, the proposedDJESCC method has shown a much better reconstruction per-

formance. In addition, the experimental results demonstratedthat the proposed DEJSCC method were robust to the twotypes of CoA, i.e., the FR attack and the GAN-based attack. Itis worth noting that the proposed DJESCC method is a generalmethod for protecting visual information of the plain imagetransmitted via DJSCC. With some appropriate modificationsof the DJESCC, the proposed mechanism can be applied tovarious different DJSCC architectures, e.g., the DJSCC-l [20],the DJSCC-f [21], and the ADJSCC [22].

REFERENCES

[1] T. M. Cover, Elements of Information Theory. John Wiley & Sons,1999.

[2] M. Skoglund, N. Phamdo, and F. Alajaji, “Hybrid digital–analog source–channel coding for bandwidth compression/expansion,” IEEE Transac-tions on Information Theory, vol. 52, no. 8, pp. 3757–3763, 2006.

[3] R. G. Gallager, Information Theory and Reliable Communication.Springer, 1968, vol. 2.

[4] I. Csiszar, “Joint source-channel error exponent,” Problems of Control& Information Theory, vol. 9, pp. 315–328, 1980.

[5] Y. Zhong, F. Alajaji, and L. L. Campbell, “Joint source–channel codingerror exponent for discrete communication systems with markovianmemory,” IEEE Transactions on Information Theory, vol. 53, no. 12,pp. 4457–4472, 2007.

[6] ——, “Error exponents for asymmetric two-user discrete memorylesssource-channel systems,” in 2007 IEEE International Symposium onInformation Theory. IEEE, 2007, pp. 1736–1740.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

Plain Image Encrypted Image Decoded ImageFR Attack FR-Attack

PSNR:7.62dBSSIM:-0.002

PSNR:10.22dBSSIM:0.061

PSNR:11.23dBSSIM:0.068

PSNR:6.90dBSSIM:0.019

PSNR:5.17dBSSIM:0.004

PSNR:7.04dBSSIM:0.028

PSNR:8.57dBSSIM:0.005

PSNR:8.04dBSSIM:-0.015

PSNR:10.61dBSSIM:0.246

PSNR:10.29dBSSIM:0.044

PSNR:8.50dBSSIM:0.014

GAN-based Attack

PSNR:17.32dBSSIM:0.504

PSNR:3.99dBSSIM:0.079

PSNR:10.49dBSSIM:0.330

PSNR:7.36dBSSIM:-0.159

PSNR:6.99dBSSIM:0.00

PSNR:7.87dBSSIM:0.027

(b) DJSCC_LE

(c) DJSCC_PE

(a) DJESCC (𝜆 ! = 𝜆 " = 0.05)

PSNR:6.90dBSSIM:0.019

GAN-based Attack

Fig. 13. Reconstructed visual information from the encrypted image and the decoded image by FR attack and GAN-based attack at SNRtest=10 dB. Theimage in the first column is the plain image. The images in the second column are the encrypted image, FR attack for the encrypted image and GAN-basedattack for the encrypted image. The images in the third column are the decoded image, the FR-attack for the decode image, and the GAN-attack for thedecoded image. decoded by the DJSCC decoder for SNR= 0, 10, 20 dB. PSNR and structural similarity index (SSIM) are given under the images. (a) DJESCCwith λe = λd = 0.05, (b) DJSCC LE, (c) DJSCC PE.

[7] B. Belzer, J. D. Villasenor, and B. Girod, “Joint source channel codingof images with trellis coded quantization and convolutional codes,” inProceedings., International Conference on Image Processing, vol. 2.IEEE, 1995, pp. 85–88.

[8] S. Heinen and P. Vary, “Transactions papers source-optimized channelcoding for digital transmission channels,” IEEE Transactions on Com-munications, vol. 53, no. 4, pp. 592–600, 2005.

[9] J. Cai and C. W. Chen, “Robust joint source-channel coding for imagetransmission over wireless channels,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 10, no. 6, pp. 962–966, 2000.

[10] T. Guionnet and C. Guillemot, “Joint source-channel decoding ofquasiarithmetic codes,” in Data Compression Conference, 2004. Pro-ceedings. DCC 2004. IEEE, 2004, pp. 272–281.

[11] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres-sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085,2015.

[12] J. Balle, V. Laparra, and E. P. Simoncelli, “End-to-end optimized imagecompression,” arXiv preprint arXiv:1611.01704, 2016.

[13] D. Minnen, J. Balle, and G. D. Toderici, “Joint autoregressive andhierarchical priors for learned image compression,” in Advances inNeural Information Processing Systems, 2018, pp. 10 771–10 780.

[14] D. Minnen and S. Singh, “Channel-wise autoregressive entropy modelsfor learned image compression,” arXiv preprint arXiv:2007.08739, 2020.

[15] T. O’Shea and J. Hoydis, “An introduction to deep learning for thephysical layer,” IEEE Transactions on Cognitive Communications andNetworking, vol. 3, no. 4, pp. 563–575, 2017.

[16] J. Xu, W. Chen, B. Ai, R. He, Y. Li, J. Wang, T. Juhana, and A. Kurni-awan, “Performance evaluation of autoencoder for coding and modula-tion in wireless communications,” in 2019 11th International Conference

on Wireless Communications and Signal Processing (WCSP). IEEE,2019, pp. 1–6.

[17] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, “Learncodes: Inventing low-latency codes via recurrent neural networks,” IEEEJournal on Selected Areas in Information Theory, 2020.

[18] ——, “Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels,” in Advances in Neural InformationProcessing Systems, 2019, pp. 2758–2768.

[19] E. Bourtsoulatze, D. B. Kurka, and D. Gunduz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions onCognitive Communications and Networking, vol. 5, no. 3, pp. 567–579,2019.

[20] D. B. Kurka and D. Gunduz, “Bandwidth-agile image transmissionwith deep joint source-channel coding,” IEEE Transactions on WirelessCommunications, 2021.

[21] ——, “Deepjscc-f: Deep joint source-channel coding of images withfeedback,” IEEE Journal on Selected Areas in Information Theory,vol. 1, no. 1, pp. 178–193, 2020.

[22] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wirelessimage transmission using deep source channel coding with attentionmodules,” IEEE Transactions on Circuits and Systems for Video Tech-nology, 2021.

[23] T. Chuman, W. Sirichotedumrong, and H. Kiya, “Encryption-then-compression systems using grayscale-based image encryption for jpegimages,” IEEE Transactions on Information Forensics and security,vol. 14, no. 6, pp. 1515–1525, 2019.

[24] W. Diffie and M. E. Hellman, “Special feature exhaustive cryptanalysisof the nbs data encryption standard,” Computer, vol. 10, no. 6, pp. 74–84, 1977.

[25] J. Daemen and V. Rijmen, “Reijndael: The advanced encryption stan-

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

dard.” Dr. Dobb’s Journal: Software Tools for the Professional Program-mer, vol. 26, no. 3, pp. 137–139, 2001.

[26] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digitalsignatures and public-key cryptosystems,” Communications of the ACM,vol. 21, no. 2, pp. 120–126, 1978.

[27] Z. Hui, “Arnold transformation of n dimensions and its periodicity,”Journal of North China University of Technology, vol. 14, no. 1, pp.21–25, 2002.

[28] K. Yano and K. Tanaka, “Image encryption scheme based on a trun-cated baker transformation,” IEICE Transactions on Fundamentals ofElectronics, Communications and Computer Sciences, vol. 85, no. 9,pp. 2025–2035, 2002.

[29] J. Zou, R. K. Ward, and D. Qi, “A new digital image scrambling methodbased on fibonacci numbers,” in 2004 IEEE International Symposium onCircuits and Systems (IEEE Cat. No. 04CH37512), vol. 3. IEEE, 2004,pp. III–965.

[30] W. Zhong, Y. H. Deng, and K. T. Fang, “Image encryption by usingmagic squares,” in 2016 9th International Congress on Image andSignal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2016, pp. 771–775.

[31] R. Mathews, A. Goel, P. Saxena, and V. P. Mishra, “Image encryptionbased on explosive inter-pixel displacement of the rgb attributes ofa pixel,” in Proceedings of the World Congress on Engineering andComputer Science, vol. 1. Citeseer, 2011, pp. 19–22.

[32] M. Prasad, K. Sudha et al., “Chaos image encryption using pixelshuffling,” CCSEA, vol. 1, pp. 169–179, 2011.

[33] S. S. Maniccam and N. G. Bourbakis, “Image and video encryptionusing scan patterns,” Pattern Recognition, vol. 37, no. 4, pp. 725–737,2004.

[34] S. F. El-Zoghdy, Y. A. Nada, and A. Abdo, “How good is the desalgorithm in image ciphering?” International Journal of AdvancedNetworking and Applications, vol. 2, no. 5, pp. 796–803, 2011.

[35] B. Acharya, S. K. Panigrahy, S. K. Patra, and G. Panda, “Imageencryption using advanced hill cipher algorithm,” International Journalof Recent Trends in Engineering, vol. 1, no. 1, pp. 663–667, 2009.

[36] N. K. Pareek, V. Patidar, and K. K. Sud, “Image encryption using chaoticlogistic map,” Image and Vision Computing, vol. 24, no. 9, pp. 926–934,2006.

[37] J. Jun, “Image encryption method based on elementary cellular au-tomata,” in IEEE Southeastcon 2009. IEEE, 2009, pp. 345–349.

[38] Z. H. Guan, F. Huang, and W. Guan, “Chaos-based image encryptionalgorithm,” Physics Letters A, vol. 346, no. 1-3, pp. 153–157, 2005.

[39] X. Tong and M. Cui, “Image encryption with compound chaoticsequence cipher shifting dynamically,” Image and Vision Computing,vol. 26, no. 6, pp. 843–850, 2008.

[40] A. Kanso and M. Ghebleh, “A novel image encryption algorithm basedon a 3d chaotic map,” Communications in Nonlinear Science andNumerical Simulation, vol. 17, no. 7, pp. 2943–2959, 2012.

[41] B. Ferreira, J. Rodrigues, J. Leitao, and H. Domingos, “Privacy-preserving content-based image retrieval in the cloud,” in 2015 IEEE34th symposium on reliable distributed systems (SRDS). IEEE, 2015,pp. 11–20.

[42] M. Johnson, P. Ishwar, V. Prabhakaran, D. Schonberg, and K. Ramchan-dran, “On compressing encrypted data,” IEEE Transactions on SignalProcessing, vol. 52, no. 10, pp. 2992–3006, 2004.

[43] W. Liu, W. Zeng, L. Dong, and Q. Yao, “Efficient compression ofencrypted grayscale images,” IEEE Transactions on Image Processing,vol. 19, no. 4, pp. 1097–1102, 2010.

[44] X. Kang, A. Peng, X. Xu, and X. Cao, “Performing scalable lossycompression on pixel encrypted images,” EURASIP Journal on Imageand Video Processing, vol. 2013, no. 1, pp. 1–6, 2013.

[45] J. Zhou, X. Liu, O. C. Au, and Y. Y. Tang, “Designing an efficient imageencryption-then-compression system via prediction error clustering andrandom permutation,” IEEE Transactions on Information Forensics andSecurity, vol. 9, no. 1, pp. 39–50, 2014.

[46] X. Zhang, “Lossy compression and iterative reconstruction for encryptedimage,” IEEE Transactions on Information Forensics and Security,vol. 6, no. 1, pp. 53–58, 2011.

[47] T. Maekawa, A. Kawamura, Y. Kinoshita, and H. Kiya, “Privacy-preserving svm computing in the encrypted domain,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summitand Conference (APSIPA ASC). IEEE, 2018, pp. 897–902.

[48] M. Tanaka, “Learnable image encryption,” in 2018 IEEE InternationalConference on Consumer Electronics-Taiwan (ICCE-TW). IEEE, 2018,pp. 1–2.

[49] W. Sirichotedumrong, Y. Kinoshita, and H. Kiya, “Pixel-based imageencryption without key management for privacy-preserving deep neuralnetworks,” IEEE Access, vol. 7, pp. 177 844–177 855, 2019.

[50] W. Sirichotedumrong, T. Maekawa, Y. Kinoshita, and H. Kiya, “Privacy-preserving deep neural networks with pixel-based image encryptionconsidering data augmentation in the encrypted domain,” in 2019 IEEEInternational Conference on Image Processing (ICIP). IEEE, 2019, pp.674–678.

[51] W. Sirichotedumrong and H. Kiya, “A gan-based image transformationscheme for privacy-preserving deep neural networks,” in 2020 28thEuropean Signal Processing Conference (EUSIPCO). IEEE, 2021, pp.745–749.

[52] H. Ito, Y. Kinoshita, M. Aprilpyone, and H. Kiya, “Image to perturba-tion: An image transformation network for generating visually protectedimages for privacy-preserving deep neural networks,” IEEE Access,vol. 9, pp. 64 629–64 638, 2021.

[53] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A survey onhomomorphic encryption schemes: Theory and implementation,” ACMComputing Surveys (CSUR), vol. 51, no. 4, pp. 1–35, 2018.

[54] Y. Aono, T. Hayashi, L. Wang, S. Moriai et al., “Privacy-preserving deeplearning via additively homomorphic encryption,” IEEE Transactions onInformation Forensics and Security, vol. 13, no. 5, pp. 1333–1345, 2018.

[55] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, andJ. Wernsing, “Cryptonets: Applying neural networks to encrypted datawith high throughput and accuracy,” in International conference onmachine learning. PMLR, 2016, pp. 201–210.

[56] E. Hesamifard, H. Takabi, and M. Ghasemi, “Cryptodl: Deep neuralnetworks over encrypted data,” arXiv preprint arXiv:1711.05189, 2017.

[57] R. Xu, J. B. Joshi, and C. Li, “Cryptonn: Training neural networksover encrypted data,” in 2019 IEEE 39th International Conference onDistributed Computing Systems (ICDCS). IEEE, 2019, pp. 1199–1209.

[58] G. K. Wallace, “The jpeg still picture compression standard,” IEEETransactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv,1992.

[59] M. Rabbani, “Jpeg2000: Image compression fundamentals, standardsand practice,” Journal of Electronic Imaging, vol. 11, no. 2, p. 286,2002.

[60] K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, and F. Ren, “Learning inthe frequency domain,” in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition, 2020, pp. 1740–1749.

[61] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,2016.

[62] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in 2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.2326–2330.

[63] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neuraljoint source-channel coding,” in International Conference on MachineLearning. PMLR, 2019, pp. 1182–1192.

[64] Y. M. Saidutta, A. Abdi, and F. Fekri, “Joint source-channel coding forgaussian sources over awgn channels using variational autoencoders,”in 2019 IEEE International Symposium on Information Theory (ISIT).IEEE, 2019, pp. 1327–1331.

[65] ——, “Joint source-channel coding of gaussian sources over awgnchannels via manifold variational autoencoders,” in 2019 57th AnnualAllerton Conference on Communication, Control, and Computing (Aller-ton). IEEE, 2019, pp. 514–520.

[66] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-endcompression framework based on convolutional neural networks,” IEEETransactions on Circuits and Systems for Video Technology, vol. 28,no. 10, pp. 3007–3018, 2018.

[67] D. Mishra, S. K. Singh, and R. K. Singh, “Wavelet-based deep autoencoder-decoder (wdaed)-based image compression,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 31, no. 4, pp. 1452–1462, 2021.

[68] P. Mohammadi, A. Ebrahimi-Moghadam, and S. Shirani, “Subjectiveand objective quality assessment of image: A survey,” arXiv preprintarXiv:1406.7799, 2014.

[69] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference onComputer Vision. Springer, 2016, pp. 694–711.

[70] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in International Conference onMedical image computing and computer-assisted intervention. Springer,2015, pp. 234–241.

[71] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

machine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467, 2016.

[72] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[74] A. H. Chang and B. M. Case, “Attacks on image encryptionschemes for privacy-preserving deep neural networks,” arXiv preprintarXiv:2004.13263, 2020.

[75] W. Sirichotedumrong and H. Kiya, “Visual security evaluation oflearnable image encryption methods against ciphertext-only attacks,” in2020 Asia-Pacific Signal and Information Processing Association AnnualSummit and Conference (APSIPA ASC). IEEE, 2020, pp. 1304–1309.