deep learning control model for adaptive optics...

Deep learning control model for adaptiveoptics systemsZHENXING XU,1,2,3,4,5 PING YANG,1,3,4,* KE HU,1,3,4 BING XU,1,3,4 AND HEPING LI21Key Laboratory on Adaptive Optics, Chinese Academy of Sciences, Chengdu 610209, China2School of Optoelectronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China3The Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China4University of Chinese Academy of Sciences, Beijing 100039, China5e-mail: [email protected]*Corresponding author: [email protected]

Received 13 September 2018; revised 20 January 2019; accepted 20 January 2019; posted 24 January 2019 (Doc. ID 345676);published 7 March 2019

To correct wavefront aberrations, commonly employing proportional-integral control in adaptive optics (AO)systems, the control process depends strictly on the response matrix of the deformable mirror. The alignmenterror between the Hartmann–Shack wavefront sensor and the deformable mirror is caused by various factors inAO systems. In the conventional control method, the response matrix can be recalibrated to reduce the impact ofalignment error, but the impact cannot be eliminated. This paper proposes a control method based on a deeplearning control model (DLCM) to compensate for wavefront aberrations, eliminating the dependence on thedeformable mirror response matrix. Based on the wavefront slope data, the cost functions of the model networkand the actor network are defined, and the gradient optimization algorithm improves the efficiency of thenetwork training. The model network guarantees the stability and convergence speed, while the actor networkimproves the control accuracy, realizing an online identification and self-adaptive control of the system. A param-eter-sharing mechanism is adopted between the model network and the actor network to control the system gain.Simulation results show that the DLCM has good adaptability and stability. Through self-learning, it improvesthe convergence accuracy and iterations, as well as the adjustment tolerance of the system. © 2019 Optical Societyof America

https://doi.org/10.1364/AO.58.001998

1. INTRODUCTION

In a general adaptive optics (AO) system, a Hartmann–Shack(HS) wavefront sensor is used to measure the aberration. Afterthe aberration information is obtained, the control system willdrive the wavefront correction device to compensate for theaberrations, thus controlling the beam quality [1–3]. Of allthese systems, the proportional-integral (PI) control and thecorresponding algorithms [4] are often direct and obtain goodpractical results. However, traditional PI control strictly de-pends on the response matrix of the deformable mirror, andthe response matrix of the deformable mirror needs to be cali-brated [5]. The alignment error between the HS wavefront sen-sor and the deformable mirror is caused by various factors inAO systems, so the model calibrated at a certain time may notbe as accurate as when it is calibrated at another time [6,7].In this case, the stability of the system will be greatly affectedif there is no parameter self-tuning using good adaptivealgorithms.

AO control systems can use the online system identificationof a neural network because a neural network can fit complex

nonlinear functions and is able to learn online. However, a shal-low neural network is prone to be affected by problems, such asa local optimum during training, so they sometimes cannot ac-curately describe dynamic systems and are difficult to train. Thedevelopment of deep learning has brought inspiration for solv-ing this problem. Some studies have focused on the use of deeplearning methods for system identification. The system modelhas been replaced by a deep neural network. The task of systemidentification has been transformed into parameter optimiza-tion of a deep neural network. There have been reports aboutusing a deep learning model to fit a proportional-integral-differ-ential controller [8]. This study has proved that deep learningcan be used to control strategy calculations. Some studies havealso used single-layer networks, multilayer networks, and recur-rent neural networks (RNNs) for rough terrain motion systems[9]. In these studies, the user mode of the network is regardedas the input, and the action is the output. The training is car-ried out with the guided strategy search method, and the super-vision signals are from the existing paradigms. The networktakes state-action samples from the existing successful motion

1998 Vol. 58, No. 8 / 10 March 2019 / Applied Optics Research Article

1559-128X/19/081998-12 Journal © 2019 Optical Society of America

https://orcid.org/0000-0002-4135-6762https://orcid.org/0000-0002-4135-6762https://orcid.org/0000-0002-4135-6762mailto:[email protected]:[email protected]:[email protected]:[email protected]://doi.org/10.1364/AO.58.001998https://crossmark.crossref.org/dialog/?doi=10.1364/AO.58.001998&domain=pdf&date_stamp=2019-03-05

paradigms for network training. In some studies, the systemdynamic models [10–12] are fitted with a deep rectified linearunit (ReLU) network model, whose main idea is to use a sec-tion of historical time data to predict the system output in thefuture.

In this study, a deep neural network is applied for the devel-opment of a wavefront correction scheme. This scheme is calledthe deep learning control model (DLCM). This scheme im-proves the convergence accuracy, speed, and adjustment toler-ance of the system through online learning. The DLCMconsists of a model network, actor network, cost function,and decision sample space. The model network is used to sta-bilize the actor network and accelerate convergence. The actornetwork is used to directly output the control voltage as close aspossible to the optimal control amount.

This paper is organized as follows. Section 2 illustrates thesystem model used in this study. After consideration of existingmethods, a DLCM correction method is proposed in Section 3.Numerical simulations are given in Section 4, and conclusionsare drawn in Section 5.

2. DLCM AND TRADITIONAL CONTROLMETHOD

In an AO system with a wavefront sensor, the DLCM aims tominimize the wavefront residual error measured by the HS sen-sor, and it drives the deformable mirror to produce the optimalphase compensation amount by changing the control voltage,thus achieving the correction of wavefront distortion. The PIand the DLCM control model are shown in Figs. 1(a) and 1(b).The centroid slope S0 detected by the HS sensor is affectedby various disturbances. The Gn is the detection noise, Sn isthe wavefront slope, and V is the control voltage vector of thedeformable mirror. Z −1A and Z

−1B are the system’s HS sensor

delay and control calculation delay, respectively. The DLCMconsists of a model network M�S,V ejθm�, an actor networkA�S,V jθa�, a decision sample space R, and two cost functions;the cost functions of the model network are J and J0, the costfunction of the actor network is J0, and the thick dotted line isa gradient data stream. J can perform the supervised trainingof historical optimal control decisions for the model networkand update the actor network through parameter sharing.J0 can guide the actor network to output control voltage, tokeep the DLCM updated online in real time.

Generating wavefront to voltage (GWV) is the voltage sol-ution link during PI control. Sn is the centroid slope obtainedby the HS sensor. The voltage solution link GWV solves thevoltage and inputs the same to the PI controller to obtainthe control voltage V . If the detection noise Gn is omitted, therelationship between the slope of the subaperture and the con-trol voltage can be described by the matrix as follows:

Sn � BV , (1)where B is the slope response matrix between the deformablemirror and the HS sensor, which can be obtained by sequen-tially loading the unit control voltage for each actuator unit ofthe deformable mirror and recording the centroid slope re-sponse of the HS sensor subaperture. The centroid slope vectoris Sn and can be derived directly from the HS sensor measure-ment. Using singular value decomposition to solve the gener-alized inverse matrix B� of the matrix B, the control voltagevector in the sense of least squares can be obtained:

V � B�Sn: (2)After the residual wavefront control voltage V is obtained, thetotal control voltage loaded to the wavefront corrector can besolved using the PI control algorithm, and then B�Sn canbe solved in real time. It can be seen from Eqs. (1) and (2)that the operational stability and error propagation of the AOsystem are closely related to the wavefront recovery algorithmused. The input of the PI controller strictly depends on theresponse matrix B of the deformable mirror, and the matrix Bneeds to be determined by accurate calibration of the deform-able mirror.

In a typical AO system, the deformable mirror and the HSwavefront sensor are required to be in two different conjugatepositions. However, in the actual application process, align-ment accuracy between the two is often affected by various fac-tors [5], as shown in Fig. 2. At one moment, the measuredresponse matrix B succeeds, but fails at another moment, whichaffects the performance of the PI controller.

In Fig. 2, α is the rotation angle relative to the z axis; Δx,Δy, and Δz are relative translations in the x, y, and z direction,respectively. Δz does not affect the space-matching relation-ship, but it does affect the conjugate relationship betweenthe deformable mirror and the HS wavefront sensor. In general,we can ignore the adjustment deviation ofΔz. Therefore, align-ment errors caused by environmental factors primarily include

Fig. 1. Controller for an AO system. (a) PI; (b) DLCM.

Research Article Vol. 58, No. 8 / 10 March 2019 / Applied Optics 1999

horizontal deviation Δx and vertical deviation Δy of the centerof the deformable mirror relative to the center of the HS de-tector, as well as the rotation error α of the deformable mirror.Alignment error will change the design-matching relation-ship between the deformable mirror and HS sensor. Therefore,the alignment error will lead to a large error in the responsematrix. Because the PI controller strictly relies on the responsematrix B, this directly affects the performance of the PI con-troller [7,13,14].

3. IMPLEMENTATION PRINCIPLE OF THE DLCM

The DLCM consists of model network M�S,V ejθm�, actornetwork A�S,V jθa�, decision sample space R, and cost func-tions (J and J0), as defined in Eqs. (4) and (9). The modelnetwork and the actor network have different roles. The modelnetwork has three functions: (1) shares the parameters with theactor network; (2) stabilizes the output of the actor network;(3) improves the convergence speed of the actor network. Theactor network has two functions: (1) updates the decision sam-ple space and guides the update of the model network; (2) im-proves control accuracy of the DLCM. The decision samplespace is a queue structure where the optimal and up-to-datedecision samples are stored. The cost function J is used to trainthe model network and learn the control rules from the deci-sion samples. The cost function J0 is used to update the actornetwork and the model network in real time. The calculation ofthe cost function J0 is the key to the system.

A. Model NetworkThe model network adopts a deep natural network (DNN[15]). Essentially, the DNN implements an input-to-outputmapping relationship, and it is capable of learning a large num-ber of input-to-output mapping relationships, with no need forany accurate mathematical expressions between the input andthe output. The network will have input-to-output mappingcapability as long as the deep neural network is trained. It isthe deep neural network that implements the supervised train-ing. Before starting the training, the network’s parameters areinitialized with various small random numbers. The training ofthe network is divided in two phases:

(a) Forward propagation phase. �S,V � is extracted from thedecision sample space R, and S is input to the deep neural

network. The information is transmitted from the input layerto the output layer through a step-by-step transformation.The corresponding actual output of the deep neural networkis calculated as

V e � f n�� f 2�f 1�Sθm1�θm2� � � ��θmn�: (3)R is the decision sample space. If there are n samples,

R �

2664S11 V 12S21 V 22� � � …Sn1 V n2

3775, where �f 1, f 2, f 3,…, f n� are activation

functions for each layer and �θm1, θm2, θm3,…, θmn� are net-work parameters for each layer.(b) Backward error propagation phase. The difference betweenthe actual output V e and the ideal output V of the deep neuralnetwork is calculated as

J�θm� �1

2

Xj

�V j − V ej�2: (4)

The model network parameter θm is adjusted by minimiz-ing J�θm�.

For an AO system, the voltage and slope have the followingequation:

S � BV : (5)When the disturbance voltage ΔV is added, we have thefollowing equation:

ΔS � S � B�ΔV � V �: (6)Then, Eq. (7) can be obtained by Eqs. (5) and (6):

S � BΔV : (7)Equation (7) can obtain the training sample �ΔS ,ΔV � of themodel network M �S,V ejθm�. In the actual system, �ΔS,ΔV �is easily obtained, so the training sample can be obtained effi-ciently by the above method. Even if the training sample�ΔS ,ΔV � is not optimal, the actor network improves the ac-curacy of the control voltage because the cost function of theactor network is defined by the wavefront residuals measuredby the HS sensor. In Section 3.B, we will define the cost func-tion for the actor network in detail.

B. Actor NetworkFigure 3 shows the structure of the model network on the leftand the structure of the actor network on the right. It also in-dicates that the two networks have the same structures, whichoutput the voltage and input the slope data. Because the modelnetwork obtains the estimated voltage, the actor network isneeded to improve the precision of the output voltage. Theactor network aims at minimizing the wavefront residual errormeasured by the HS sensor. It also drives the deformable mirrorto produce the optimal compensation by changing the controlvoltage, thus achieving real-time correction of wavefront distor-tion. We can directly calculate the cost function J0 based on thewavefront slope information.

The HS wavefront sensor is currently the most widelyused wavefront sensor. It performs the tile sampling on the

Fig. 2. Spatial relationship of deformable mirror and HS sensor.


wavefront using an array lens. Each sublens works as a suba-perture and focuses the beam into a light spot array. The cen-ter coordinates of each spot can be measured with an arraycamera. First, the standard parallel light illumination arraylenses are used to measure the spot center coordinates corre-sponding to each subaperture as a referred datum. When thewavefront distortion occurs in the incident light beam, thewavefront tilt within the subaperture range will cause a deflec-tion of the light spot. After the deflection amounts of the spotcenter in the two directions are measured, the slope in the twodirections of the wavefront within the subaperture range canbe calculated:

8>><>>:

X c �P

X iI iPI i

� λf2πSR R

A∂Φ�x, y�

∂x dxdy �λf2π GX

Y c �P

Y iI iPI i

� λf2πSR R

A∂Φ�x, y�

∂x dxdy �λf2π GY

, (8)

where f is the focal length of the microlens, I i is the intensityof pixel i, and X i and Y i are the coordinates of pixel i. �X c ,Y C �are the coordinates of the spot centroid, �GX ,GY � is the slope,and A is the subaperture area. After the subaperture slope data isobtained, the cost function J0 given in Eq. (9) can be defined:

J0�GX ,GY � �PN

i�1�GX 2i � GY 2i �N

, (9)

where N is the number of subapertures and J0�GX ,GY � has aunique extremum.

The cost function J0 is used to update the actor networkand the model network in real time. The calculation of thecost function J0 is the key to the system. This not only guar-antees the training efficiency of the model network, but alsoimproves the DLCM’s convergence speed and stability. Theactor network aims at minimizing the J0 measured by theHS sensor, so that the control precision of the DLCM canbe guaranteed.

C. Calculation of GradientTo update the network parameter θa of the actor network,we need to know the value of ∇V J0. Because the gradient∇θaV �θa� can be calculated, the gradient information of thenetwork parameters can be obtained through the chain rule:

∇θa J0 � ∇V J0∇θaV �θa�: (10)

The network parameters are then updated by the gradientdescent (GD) method:

θa � θa − α∇θa J0, (11)where α is the learning rate. The accuracy of ∇V J0 directlyaffects the convergence performance of the system and is thecore factor that determines the system performance. There aremany mature methods for gradient estimation in AO systems,from the early hill-climbing method [16] and the multivariatehigh-frequency vibration method [17,18] to the genetic algo-rithm [19,20], the ant colony algorithm [21], the simulatedannealing algorithm [22], and the stochastic parallel gradientdescent (SPGD) algorithm [23] in the past decade. SPGD isused to calculate the gradient ∇V J0. The specific calculationsteps for ∇V J0 are as follows:

The cost function of the current control voltage V : J0�t�Generate the disturbance voltage: ΔVLoad the disturbance voltage: V �t � 1� � V �t� � ΔVThe cost function is loaded with the disturbance voltageΔV : J0�t � 1�Obtain the gradient: ∇V J0 � �J0�t � 1� − J0�t��ΔV

To update the network parameter θm of the model network,the network parameter θm can be adjusted as per the minimizedJ and J0. When the network parameter θm is updated usingthe decision sample space, the cost function J will be used;then, the gradient ∇θmJ�θm� of the network parameter canbe obtained.

When the network parameter is updated with the GDmethod, β1, β2 refer to the learning rates:

θm � θm − β1∇θmJ�θm�: (12)When the network parameter θm is updated online in real time,the cost function J0 will be used, and the gradient of thenetwork parameter can be obtained:

∇θmJ0 � ∇V J0∇θmV e�θm�: (13)The network parameter is updated with the GD method:

θm � θm − β2∇θmJ0: (14)

D. Gradient OptimizationUnder the GD method, the convergence speed is too slow,whereas the gradient optimization method can effectively

Fig. 3. Network structure of the DNN. (a) Model network; (b) actor network.


improve the convergence speed. The Nesterov accelerated gra-dient (NAG) is an effective gradient optimization method [24].The network parameter updating formula is as follows:

vt � γvt−1 � α∇θJ�θ − γvt−1�, (15)

θ � θ − vt , (16)where γ is a parameter of the momentum term and α is thelearning rate. When its gradient points in the actual movingdirection, the momentum term vt will increase; when thegradient is opposite to the actual moving direction, vt willdecrease. This means that the momentum term updates theparameter only for the relevant samples, reducing unnecessaryparameter updates, which results in a faster and more stableconvergence and also shortens the oscillation process.

AdaDelta [25] is also an effective gradient optimizationmethod. To reduce the possibility that the learning rate attenu-ates too quickly, AdaDelta can adapt different learning ratesfor each parameter, which implies proposing three solutions:(1) Use the window; (2) use the mean for the parameter gra-dient history window sequence (excluding the current); (3) thefinal mean is the time-attenuated weighted average of the his-torical window sequence mean and the current gradient.

The historical gradient mean is expressed by E �g2�t,E �g2�t � γE �g2�t−1 � �1 − γ�g2t , (17)

where γ is a parameter of the momentum item, and the param-eter update vector can be expressed as below:

Δθt � −αffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

E �g2�t� ∈p gt : (18)

iReplacing the denominator of Eq. (18) with the mean squareerror RMS�g �t leads to

Δθt � −α

RMS�g �tg t : (19)

The units in the above formula do not match, so the unit ofΔθt is corrected by multiplying the mean of the parameter up-date vector by Eq. (19), which, at the same time, also avoids fastattenuation of the learning rate:

E �Δθ2�t � γE �Δθ2�t−1 � �1 − γ�Δθ2t : (20)The root mean square error of the parameter update vector isas below:

RMS�Δθ�t �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE �Δθ2�t

q: (21)

RMS�Δθ�t is unknown, so RMS�Δθ�t−1 is used for approximateestimation. Then, the learning rate α in Eq. (19) is replaced toobtain the parameter update form of AdaDelta:

Δθt � −RMS�Δθ�t−1RMS�g �t

g t , (22)

θ � θ� Δθt : (23)Equations (20) and (21) avoid the problem of the units of thelearning rate not matching each other and avoid the fast learn-ing rate attenuation. In addition, they can adapt to differentlearning rates for each parameter.

To make each parameter adapt to different learning rates,the second-order derivative information of the objective func-tion is used to increase the convergence speed. Combiningthe advantages of NAG and AdaDelta, an optimization algo-rithm accelerated gradient descent (AGD) is proposed (seeAlgorithm 1):

Algorithm 1. Computing AGD update at time t

Initial decay rate γ1 ∈ �0, 1�, β ∈ �0, 1�, Constant εInitial parameter vector θ:J�θ� Cost function with parameters θInitialize accumulation variables E �g2�0 � 0, E �Δθ2�0 � 0, v0 � 0for t � 1:TAccumulate Moment:

vt � γ1vt−1 � �1 − γ1��∇θt J�θt� − ∇θt−1 J�θt−1��Accumulate Moment:

E �v2�t � βE �v2�t−1 � �1 − β�v2tCompute Update: Δθt � − RMS�Δθ�t−1RMS�v�t�ε vtAccumulate Updates:E �Δθ2�t � βE �Δθ2�t−1 � �1 − β�Δθ2tApply Update: θ � θ� Δθtend for

In the AGD algorithm, the first step is to initialize the hy-perparameters γ1 ∈ �0, 1�, β ∈ �0, 1� and set the constant ε (ε isgenerally a very small value, to avoid RMS�v�t equal to zero).Second, the cumulative variables vt , which contain the gradientvariation, are calculated, so that the gradient variation trend canbe estimated more accurately. The historical momentums meanis expressed by �v2�t, which is obtained through the current vtand the previous momentum mean. The cumulative updateamount E �Δθ2�t is obtained from the mean of the current up-date amount Δθt and the historical cumulative update amount;finally, the network parameter is updated through AGD. Themain idea of the AGD optimization algorithm is to use thesecond-order derivative information of the objective function.The significance of the second-order derivative is that if thisgradient is larger than the previous gradient, then there is a rea-son to believe it will continue to increase, and the part becom-ing larger needs to be added in advance; if this gradient issmaller than the previous one, the case is similar. AGD canadapt to different learning rates for each parameter and predictthe update position, thereby increasing the convergence speedof the deep neural network.

E. Algorithm FlowTo avoid system oscillation caused by the excessive gain ofthe system, the parameter sharing mechanism is adopted.Parameter sharing between the model network and the actornetwork is as follows:

θa ← τθm � �1 − τ�θa: (24)The hyperparameter τ�0 ≤ τ ≤ 1� can control the stability ofthe actor network. The hyperparameter τ can adjust the systemgain. A suitable τ can accelerate the convergence of the actornetwork. Lower values of τ provide slower but more stablecontrol. Parameter sharing enables the model network and theactor network to train independently, perform parallel compu-tation, and improve operation speed.


Based on the above introduction of principles, the flow ofthe deep learning control algorithm is as follows (seeAlgorithm 2):

Algorithm 2. DLCM

Initialize the model network M�S,V ejθm� and the actor networkA�S,V jθa�.Initialize the decision sample space RFor episode = 1, M doObtain the wavefront slope information S0Share the network parameter: θa ← τθm � �1 − τ�θaFor t � 0, T doThe actor network A�S t ,V t jθa� outputs the control voltage V tEstimate the gradient: ∇θa J0 � ∇V J0∇θaV �θa�, ∇θmJ0 �∇V J0∇θmV e�θm�Obtain a sample: �ΔS,ΔV �Calculate the Δθa of the actor network according to AGDUpdate the actor network A�S0,V jθa�:

θa � θa − ΔθaCalculate the Δθm of the model network according to AGDUpdate the model network M�S0,V ejθm�:

θm � θm − ΔθmStore the sample R ← �ΔS,ΔV �End ForSample a random minibatch from RUpdate the model network by AGD minimizing the loss J :End For

4. SIMULATION RESULTS AND ANALYSIS

According to the methods and steps described in Section 3, theDLCM was applied to a 61-element AO system to performthe simulation calculation. The HS sensor is composed of a12 × 12 lenslet array (112 subapertures used) and a 288 pixel ×288 pixel CCD camera. Each subaperture has 24 pixels. Thecorrection time delay is three sampling cycles The DLCM isterminated by the cost function when they are lower than athreshold, and the model is terminated by the number of iter-ations if the iterations do not converge. The model networkand the actor network have the same structure; the number ofhidden layers was two; the activation function of the hiddenlayer was ReLU; the number of the neurons in each hiddenlayer was 224; and the activation function of the output layerwas linear. The most basic principle to determine the numberof hidden layer nodes was: take a structure as compact as pos-sible under the premise of satisfying the accuracy requirement,that is, take as few hidden layer nodes as possible. The hyper-parameter τ was 0.98; the capacity of the decision sample spacewas 900. We found that the training performance tended to bestable when the decision sample space was greater than 900when the model net was trained by the method in this paper.The model network was trained with the guided strategymethod, and the supervisory signal was derived from the wave-front slope ΔS and the optimal control voltage ΔV obtainedfrom the online sampling of the actor network, and the AGDmethod was used to optimize the gradient.

In the simulation, the wavefront aberration W to be com-pensated for is generated randomly by the first 35-orderZernike polynomial terms in the cyclotomic field [26,27].A total of 100 initial wavefronts (W 1,W 2,…,W 100) generated

in the simulation is used to verify the DLCM. The iterationtimes, convergence accuracy, and tolerance of the system arecompared by correcting various aberrations. The evaluation in-dex J0 defined in Eq. (9) and the Strehl ratio (SR) are used assystem performance indices.

We usually calculate the energy concentration of the imageof a spot to assess SR by

SR ≈jmax�I �i��j��PNi�1 �I �i��2

��, (25)

where I�i� is the gray value of the ith pixel andN is the numberof pixels.

Now, we change the alignment error and use it for closed-loop correction under the new conditions. For this purpose, theactuator spacing of a 61-element deformable mirror is set as d.The position (Δx, Δy, α) of the deformable mirror is adjustedas follows:

(1) Under the condition of (Δy � 0, α � 0) alignment,make movement of −4 d∕10, −2 d∕10, 0, 2 d/10, and 4 d/10along the x direction.

(2) Under the condition of (Δx � 0, α � 0) alignment,make movement of −4 d∕10, −2 d∕10, 0, 2 d/10, and 4 d/10along the y direction.

(3) Under the condition of (Δx � 0, Δy � 0) alignment,make rotation of −6°, −3°, 0°, 3°, and 6° relative to the rotationangle α of the z axis.

First, the PI response matrix is not recalibrated for eachposition adjustment. PI and DLCM are used to make the AOsystem work in a closed loop and aberration W1 is corrected,and the iteration times, convergence accuracy, and tolerancebetween them are compared. The correction results are shownin Section 4.A.

Second, the PI response matrix is recalibrated for eachposition adjustment. PI control is used to make the AO systemwork in a closed loop. Under the condition that the controlparameters remain unchanged, the iteration times, convergenceaccuracy, and tolerance between PI and DLCM are compared.The correction results are shown in Section 4.B.

A. Keep PI’s Response Matrix ConstantThe closed-loop correction results of aberration W1 with vari-ous alignment errors are shown in Figs. 4–6. The closed-loopiteration performs 30 steps at a time.

To compare the convergence accuracy, semilogarithmic co-ordinates are used in Figs. 4(a)–4(e). Figures 4(a)–4(e) showthat DLCM has high correction accuracy and convergencespeed under various rotations α; as the aberration compensa-tion gets better, the evaluation index J0 decreases. Rotationmisalignment has a great influence on PI. If the misalignmentis too large, this can lead to closed-loop instability [Figs. 4(a)and 4(e)]. The DLCM eliminates the influence brought bythe response matrix through self-learning and maintains thecontrol stability under various rotation conditions. Accordingto Fig. 4(f ), the DLCM has high control stability and toleranceunder various α. It is worth noting that when the rotation mis-alignment α � 0, the SR of PI is 0.971, and the SR of theDLCM is 0.985. As the aberration compensation gets better,


Fig. 4. Correction capability of PI and DLCM under various rotations α. (a)–(e) System convergence process; (f ) SR value after systemconvergence.

Fig. 5. Correction capability of PI and DLCM under various translations Δx. (a)–(e) System convergence process; (f ) SR value after systemconvergence. When Δx � 0, the SR of PI is 0.971. The SR of the DLCM is 0.985; under other Δx conditions, the SR value of PI is obviouslyless than that of the DLCM.


the SR increases, indicating that the effect of correcting theDLCM is better than that of correcting PI.

Figures 5(a)–5(e) and Figs. 6(a)–6(e) show that the DLCMhas better convergence accuracy and speed than PI under vari-ous Δy and Δx, and translation misalignment has a greatimpact on PI. Too much misalignment leads to closed-loopinstability, which is caused by the failure of the responsematrix [Figs. 5(a), 5(e), 6(a), 6(e)]. The DLCM maintainsits adaptability under various conditions through self-learning.Figures 5(f ) and 6(f ) show that the DLCM has better systemtolerance than PI, and a better SR can be obtained. Obviously,as the translation misalignments Δx and Δy gradually increase,the control system’s capability to correct aberration becomesworse and worse, which is consistent with Ref. [5].

To further compare the capability of the DLCM and PI tocorrect various aberrations, the Monte Carlo method is usedto compare the mean value of SR after 100 aberrations are cor-rected under various conditions. The mean of the SR is calcu-lated as SR � mean�SR1,…, SR100�.

Figure 7 shows that the DLCM has better tolerance and con-vergence than PI through a 100-aberration correction, and theSR can be obtained. It can be seen that there is a similar rule withthe result after the correction of aberration W1 (Figs. 4–6).

Table 1 shows the comparison result of the iteration timesand accuracy of the control model without alignment error.The iteration is terminated when J0 becomes stable. Whenthe trained model network corrects the wavefront aberrations,J0 decreases from 9.38 to 0.323 after one iteration. As there is

Fig. 6. Correction capability of PI and DLCM under various translations Δy. (a)–(e) System convergence process; (f ) SR value after systemconvergence. When Δy � 0, the SR of PI is 0.971. The SR of the DLCM is 0.985.

Fig. 7. Mean of the SR after 100 aberrations are corrected. (a)–(c) SR mean after system convergence.


no actor network to continue correcting the aberrations, theaberrations can only be corrected to 0.323 (hyperparameter τdecreases as this value increases), which indicates that themodel network can stabilize the control voltage, so that theoutput voltage will not increase too much in a single step.The convergence speed is faster than that of the actor network,but the control precision is lower than that of the actor net-work; the advantage of the model network is the convergencespeed. When no parameters are shared, the actor network cancorrect J0 of the aberrations to 0.237 through 68 iterations,indicating that the convergence speed is slower than the modelnetwork, but the control precision is higher than the modelnetwork; therefore, the advantage of the actor network is con-trol precision.

The DLCM combines both advantages through theparameter-sharing mechanism; thus, the control accuracyand iteration times can be improved. Because the calculationamount of DLCM is more than PI, so with a well-chosen net-work structure, fast hardware can reduce the time cost of asingle iteration of the DLCM.

B. Recalibrate the Response MatrixThe closed-loop correction results for aberration W1 with vari-ous alignment errors are shown in Figs. 8–9. The closed-loopiteration performs 30 steps at a time.

Figures 8(a)–8(f ) show that for various rotations α, the con-trol performance of PI after recalibrating the response matrixis significantly improved. The influence of alignment error isreduced, but the influence cannot be eliminated. Meanwhile,the PI has poor stability. Obviously, the DLCM maintains sta-ble performance through self-learning and can obtain betterSR. What is important is that the DLCM does not need toremeasure the response matrix, so it can achieve stability undervarious rotations α.

Figures 10(a)–10(e) and 9(a)–9(e) show that, under variousΔx andΔy, the control performance of PI after recalibrating theresponse matrix is significantly improved, but the convergenceaccuracy and speed of the DLCM are better than those of PI.Figures 10(f ) and 9(f ) show that the DLCM has better con-vergence and tolerance than PI and can obtain a better SR. Asthe translation misalignments Δx and Δy gradually increase,the control system’s ability to correct aberration becomes in-creasingly worse. By remeasuring the response matrix, the con-trol stability of PI is maintained, and the convergence accuracyis improved. The DLCM is slightly better than PI. Numericalcalculation under various conditions shows that as long as theeffective light spots are within the effective range of the actua-tor, the stability of the DLCM and PI can be guaranteed, andthe control process is basically stable.

Figure 11 shows that after recalibrating the response matrixof PI, the mean value of the SR after 100 aberrations arecorrected is slightly lower than that of the DLCM. DLCM

Table 1. Comparison Result of the Control Model(Δx � 0, Δy � 0, α � 0)a

Aberration�λ � 632.8 nm�

ControlModel J 0-Mean

IterationTimes-Mean

W 1,W 2,…,W 100 model net 0.323 1actor net 0.237 68DLCM 0.109 8

PI 0.221 23aThe initial mean of J 0 is 9.38.

Fig. 8. Correction capability of PI and DLCM under various rotations α. (a)–(e) System convergence process; (f ) SR value after systemconvergence.


Fig. 9. Correction capability of PI and DLCM under various translations Δy. (a)–(e) System convergence process; (f ) SR value after systemconvergence.

Fig. 10. Correction capability of PI and DLCM under various translations Δx. (a)–(e) System convergence process; (f ) SR value after systemconvergence.


improves the performance under various alignment errors.There is no need to remeasure the response matrix, and thecontrol performance improves to a certain extent. Throughself-learning, it brings better control performance and systemadjustment tolerance.

Figure 12 shows the convergence curve when the modelnetwork is trained through decision samples. AGD basicallyconverges after 120 iterations, and GD basically converges after835 iterations; the convergence speed is increased by 7 times.This shows that AGD can significantly improve the learningefficiency. This is because AGD adapts to different learningrates for each parameter and carries the second-order derivativeinformation of the cost function; it accelerates the networkconvergence and improves training efficiency. At the same time,AGD can improve the efficiency of parallel computing.

5. CONCLUSION

The DLCM features good convergence performance and con-trol precision. Adopting better gradient optimization and gra-dient estimation methods is the key to improving the efficiencyof the algorithm. The actor network realizes adaptive adjust-ment of the deformable mirror and effectively improves the

control precision. The model network learns the control rulesfrom the decision samples sampled online. The model networkimproves the convergence speed and stabilizes the output ofthe actor network through parameter sharing. The DLCMdoesnot require calibration of the response matrix of the deformablemirror, thus eliminating the need for calibration of its responsematrix. The cost function defined by the centroid slope isvery important and is related to the control precision and sta-bility. The DLCM achieves online identification of the systemmodel and adaptive control, making the system more adapt-able. An obvious shortcoming of the DLCM is the need forcollection and estimation, but this does not affect the onlineoperation because the appropriate network structure can relievethe computational burden. A well-chosen network structure,fast hardware, and parallel-distributed computing can simplifythe data collection and estimation.

Funding. National Natural Science Foundation of China(NSFC) (60978049); Youth Innovation Promotion Associationof the Chinese Academy of Sciences (2012280); Sciences Innova-tion Fund, Chinese Academy of Sciences (CAS) (CXJJ-16M208).

REFERENCES1. Z. Y. Zheng, C. W. Li, B. M. Li, and S. J. Zhang, “Analysis and dem-

onstration of PID algorithm based on arranging the transient processfor adaptive optics,” Chin. Opt. Lett. 11, 110101 (2013).

2. Z. J. Yan, X. Y. Li, and C. H. Rao, “Multi channel adaptive controlalgorithm for closed-loop adaptive optics system,” Acta Opt. Sin.33, 0301002 (2013).

3. J. Tesch, T. Truong, R. Burruss, and S. Gibson, “On-sky demonstra-tion of optimal control for adaptive optics at Palomar Observatory,”Opt. Lett. 40, 1575–1578 (2015).

4. Y. M. Guo, X. Y. Ma, and C. H. Rao, “Optimal closed-loop bandwidthof tip-tilt correction loop in adaptive optics system,” Acta Phys. Sinca63, 069502 (2014).

5. Y. M. Guo, C. H. Rao, H. Ba, A. Zhang, and K. Wei, “Direct compu-tation of the interaction matrix of adaptive optical system,” Acta Phys.Sinca 63, 149501 (2014).

6. H. Xiaochuan, P. Jiaqi, and Z. Bin, “Thermal distortion of deformablemirror and its influence on beam quality,” Chin. J. Lasers 42, 45–53(2015).

Fig. 11. Mean of the SR after 100 aberrations are corrected.

Fig. 12. Comparison result of convergence speed of AGD and GDfor deep network.


https://doi.org/10.3788/COLhttps://doi.org/10.3788/AOS201333.0301002https://doi.org/10.3788/AOS201333.0301002https://doi.org/10.1364/OL.40.001575https://doi.org/10.7498/aps.63.069502https://doi.org/10.7498/aps.63.069502https://doi.org/10.7498/aps.63.149501https://doi.org/10.7498/aps.63.149501

7. N. T. Gu, Z. P. Yang, L. H. Huang, and C. H. Rao, “Measurementmethod of alignment error between Hartmann-Shack sensor and de-formable mirror in adaptive optics system,” J. Opt. 12, 095504 (2010).

8. K. Cheon, J. Kim, M. Hamadache, and D. Lee, “On replacing PIDcontroller with deep learning controller for DC motor system,” J.Automation Control Eng. 3, 452–456 (2015).

9. S. Levine, “Exploring deep and recurrent architectures for optimalcontrol,” arXiv: 1311.1761 (2013).

10. A. Punjani and P. Abbeel, “Deep learning helicopter dynamics mod-els,” in IEEE International Conference on Robotics and Automation(ICRA) (IEEE, 2015), pp. 3223–3230.

11. I. Lenz, R. Knepper, and A. Saxena, “Deep MPC: learning deep latentfeatures for model predictive control,” in Robotics: Science andSystems (RSS), Rome, Italy (2015).

12. C. W. Anderson, M. Lee, and D. L. Elliott, “Faster reinforcement learn-ing after pretraining deep networks to predict state dynamics,” inInternational Joint Conference on Neural Networks (IJCNN) (IEEE,2015), pp. 1–7.

13. F. Cao, L. Z. Chen, C. Long, and Y. Li, “Analysis of the influence ofspatial mismatch between deformable mirror and wavefront sensor onfitting accuracy,” Mod. Appl. Phys. 4, 5–8 (2013).

14. H. Jing, J. Wenhan, and L. Ning, “The misalignment errors ofHartmann-Shack wavefront sensors and deformable mirror in thetwo kinds of adaptive optics systems,” J. Opt. 23, 750–755 (2003).

15. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521,436–444 (2015).

16. W. Jiang, S. Huang, N. Ling, and X. Wu, “Hill-climb–ing wavefront cor-rection system for large laser engineering,” Proc. SPIE 965, 266–272(1988).

17. R. K. Tyson, Principles of Adaptive Optics (Academic, 1991).18. T. R. O’Meara, “The multidither principle in adaptive optics,” J. Opt.

Soc. Am. 67, 306–315 (1977).19. P. Yang, Y. Liu, M. Ao, S. Hu, and B. Xu, “A wavefront sensor-less

adaptive optical system for a solid-state laser,” Opt. Commun. 278,377–381 (2007).

20. P. Yang, M. Ao, Y. Liu, B. Xu, and W. Jiang, “Intracavity transversemodes control by an genetic algorithm based on Zernike mode coef-ficients,” Opt. Express 15, 17051–17062 (2007).

21. L. Dong, P. Yang, and B. Xu, “Adaptive aberration correction based onant colony algorithm for solid-state lasers: numerical simulations,”Appl. Phys. B 96, 527–533 (2009).

22. S. Zommer, E. N. Ribak, S. G. Lipson, and J. Adler, “Simulatedannealing in ocular adaptive optics,” Opt. Lett. 31, 939–941(2006).

23. M. Voronstov, J. Riker, G. Carhart, V. S. R. Gudimetla, L. Beresnev,T. Weyrauch, and L. C. Roberts, “Deep turbulence effects compensa-tion experiments with a cascaded adaptive optics system using3.63 m telescope,” Appl. Opt. 48, A47–A57 (2009).

24. Y. Nesterov, “A method of solving a convex programming problemwith convergence rate O(1/k 2),” Sov. Math. Dokl. 27, 372–376(1983).

25. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,” J. Mach. Learn. Res.12, 2121–2159 (2011).

26. W. Jang, S. Huang, and B. Xu, “Hill-climbing adaptive optics wave-front correction system,” Chin. Phys. Lasers 15, 27–31 (1988).

27. https://github.com/rconan/OOMAO.


https://doi.org/10.1088/2040-8978/12/9/095504https://doi.org/10.1038/nature14539https://doi.org/10.1038/nature14539https://doi.org/10.1117/12.948042https://doi.org/10.1117/12.948042https://doi.org/10.1364/JOSA.67.000306https://doi.org/10.1364/JOSA.67.000306https://doi.org/10.1016/j.optcom.2007.06.043https://doi.org/10.1016/j.optcom.2007.06.043https://doi.org/10.1364/OE.15.017051https://doi.org/10.1007/s00340-009-3584-yhttps://doi.org/10.1364/OL.31.000939https://doi.org/10.1364/OL.31.000939https://doi.org/10.1364/AO.48.000A47https://github.com/rconan/OOMAOhttps://github.com/rconan/OOMAO

XML ID funding

deep learning control model for adaptive optics...

Documents