Abstract
The prevalence of machine learning (ML) opens up new directions for plenty of scientific fields. The development of optics technologies also benefits from it. However, due to the complex properties of nonlinear and dynamic optical systems, optical system control with ML is still in its infancy. In this manuscript, to demonstrate the feasibility of optical system control using reinforcement learning (RL), i.e., a branch of ML, we solve the linearization problem in the frequency modulated continuous wave (FMCW) generation with the model-based RL method. The experiment results indicate an excellent improvement in the linearity of the generated FMCW, showing a sharp peak in the frequency spectrum. We confirm that the RL method learns the implicit physical characteristics very well and accomplishes the goal of the linear FMCW generation effectively, indicating that the marriage of ML and optics systems could have the potential to open a new era for the development of optical system control.
© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement
1. Introduction
Modern optics technology is undergoing a revolution from first-principles models to data-driven approaches. The photonics world represented by complex physics equations remains elusive and data produced by simulation and experiments is abundant and waiting for mining something valuable. Therefore, there is a trend that some approaches of machine learning (ML) have been merged into the photonics world, demonstrated by optical microscopy [1], photonic inverse design [2], optical communication [3] and ultrafast optics [4]. However, compared with the aforementioned successful examples, the application of optical system control is still in its infancy because interfering and managing optical systems with the properties of nonlinear and dynamic is even more complicated [5–7]. Is there any possible to use ML as a potential alternative approach to accomplish optical system control successfully? In this manuscript, we accomplish this mission by generating linear frequency modulated continuous wave (FMCW) in light detection and ranging (LiDAR) from the perspective of the reinforcement learning (RL) method, which is currently the state of the art for control as a branch of ML.
FMCW LiDAR is the cutting-edge technology to achieve high-precision detection finding many applications ranging from medical imaging, biometric, and precision instrument manufacture to surveying and mapping. Generally, how to realize a frequency-swept laser (FSL) is the core of FMCW LiDAR system. With an FSL, the generated optical signal is split into the reference and the detection signals, and they interfere with each other after passing through different optical paths. Based on the optical FMCW interference principle, the frequency of the beat signal is proportional to the detection distance in the stationary target detection. Unfortunately, the inherent nonlinearity of tunable lasers twists the linear mapping relation between the frequency response of the FSL and the modulation signal, which causes spectral expansion and degradation of the accuracy of distance measurement considerably. Consequently, the laser nonlinearity is always deemed as one of the key factors limiting the development of FMCW LiDAR.
Traditional FMCW linearization can be achieved mainly through either active correction of the FSL or passive linearization of the sampled data. The active methods mainly depend on the optical phase-locked loop (OPLL) [8,9] or other negative feedback methods [10] to achieve real-time and high precise linearization. Since the active methods rely on a very complex optical system, many researchers have been seeking passive linearization methods in post-processing, such as frequency resampling [11] and iteration methods [12], to decrease the system complexity effectively. However, passive methods are underpinned by physical formulas being short of the ability to learn dynamic system characteristics generally. Faced with the complex dynamic system, it would be quite difficult to obtain a stable control policy. Hence, a feasible and effective approach is demanded to learn the dynamic characteristics in the system to improve the linearity of FMCW effectively.
As a branch of ML, RL is an emerging control method for the optimization problem formalized as a Markov decision process (MDP). With the help of RL, the agent learns the characteristics of the environment in the process of interacting with the environment. And the neural network (NN) structure makes it powerful in optimizing the control policy. Therefore, compared with traditional methods, RL has more advantages in capturing and learning dynamic characteristics, and the learned control policy can adapt to the complex dynamic environment, such as the games of GO and StarCraft [13,14], autonomous robots [15], and self-driving cars [16]. In this work we present a model-based RL control method capable of generating a linear FMCW signal. We design a light-weight NN in the method to ensure the efficiency of the method in physical experiments. The model based on the FMCW generation system is used as the external environment to avoid possible damage to the system during the control process. And it is also beneficial to improve training efficiency and reduce the requirements in computation and signal synchronization. In the training process, we apply the pre-training process to approach a good initial guess to improve the learning efficiency and stability of the control model. The proposed method controls the slope of the modulation signal of the FSL in the continuous space under the real-time conditions of the system model. The experimental results confirm effective improvement in linearization and stability of the generated FMCW. We believe that the proposed model-based RL will enable a new research routine for optical system control.
2. Methodology
We divide the schematic diagram of the model-based RL control method into two main parts, as depicted in Fig. 1. The first part is the environment consisting of the FMCW generation system and its model, and the second part is the RL agent optimizing the modulation signal control of the FSL according to the real-time FMCW state. With the current state $s_t$, the agent guided by a policy $\pi \left (s\right )$ selects an action $a_t$ related to the modulation signal, and interacts with the environment to modify its internal state and performance feedback. Then, the agent receives the updated internal state in the form of $s_{t+1}$ and a reward $r_t$ representing the performance feedback. In this section, we introduce the construction of the environment consisting of the FMCW generation system and its model, the RL agent, and the training process.
2.1 FMCW generation system and model
In terms of the environment, the FMCW generation system serves as the light source in the FMCW LiDAR system as shown in Fig. 1(c). The FMCW generation system starts with an FSL driven by the scanning modulation signal to provide a chirp optical signal periodically. An isolator guarantees the unidirectional transmission of the light, followed by an amplitude controller in order to finely equalize the dynamic of output power caused by the modulation signal and surrounding noise. A small part of the output is coupled into the FMCW generation system through the 10/90 fiber coupler 1. The chirp signal passes through a fiber Mach-Zehnder interferometer (MZI) with a delay time $\tau$. After interference, a photodetector (PD) receives the beat signal, and the photocurrent is sent to a field-programmable gate array (FPGA) as the measurement of the laser frequency sweep.
Since RL methods often require hundreds of thousands or even millions of times of training to explore the optimal control policy, working directly with experiment data is quite expensive. Therefore, we utilize a RL algorithm with a simplified model of the system to enhance the data efficiency, as shown in Fig. 1(d). In this way, the training of RL is not related to the real environment, which avoids the influence of the agent on the actual physical system and fundamentally eliminates the risk of system damage during the training process. Particularly, in our system, the FSL is extremely sensitive to the modulated current. Once the modulated current, corresponding to the action in RL algorithm, exceeds the safe range of the FSL, the laser could be damaged immediately.
Based on the principle of optical FMCW interference and the nonlinear nature of the FSL, the frequency of FMCW can be defined as:
where $f_0$ is the central frequency, and $F\left (\cdot \right )$ describes the nonlinear relationship between the laser frequency vs. the modulation signal $u\left ( t \right )$. In addition, assuming the optical delay of MZI’s arms is small enough, the detected photocurrent generated from the MZI is given asBased on the proposed model, we define the experience composed by states $s_t$, actions $a_t$, and rewards $r_t$ to formulate the linear FMCW generation problem as an MDP. Especially, because of the nonlinearity of the laser, the same observed beat signal frequency may demand a different degree of correction. Thus, it is not enough to represent the system state relying only on the output of the model. Given the influencing factors of nonlinearity, we introduce the modulation signal as part of the state description. To enrich the system information obtained in the state, we invite the frequency difference into the state description. Therefore, the final state $s_t$ is designed as
2.2 Network model of model-based RL
As shown in Fig. 1(b), the RL agent is adopted to control the linear FMCW generation with an "actor-critic" architecture. Generally, deep RL methods based on actor-critic employ a pair of neural networks (NNs) trained with different objectives: the actor network optimizes the policy $\pi \left ( s \right )$ which determines the probability distribution of state-to-action mapping; while the critic network optimizes the objective state-action value function $Q\left ( s,a \right )$, representing the discount cumulative reward of the state-action pair $\left ( s,a \right )$ following the current policy $\pi$.
where $R_t=\sum _{t^{'}=t}^T{\gamma _r ^{t^{'}-t}r_{t^{'}}}$ describes the discount accumulation for future rewards of a state with the discount factor of the reward $\gamma _ r$. Here, it is set as $0.8$. And the optimal action-value function is defined as $Q ^{*}\left ( s_t,a_t \right )$. Bellman equation is an important identity of it, which is represented asFurthermore, we introduce auxiliary NNs namely target networks, the copies of the previous NNs, to reduce the network shock caused by the $Q\left ( s,a \right )$ update during the training process, as shown in Fig. 1(b). The actor evaluation NN and the actor target NN are parameterized as $\mu \left ( s|\theta ^{\mu } \right )$ and $\mu ^{'} \left ( s|\theta ^{\mu ^{'}} \right )$. The critic evaluation NN and the critic target NN are parameterized as $Q \left ( s, a|\theta ^{Q} \right )$ and $Q^{'} \left ( s, a|\theta ^{Q^{'}} \right )$. The network structures of two kinds of NNs are shown in Fig. 1(e). The actor NN is composed of three fully connected layers. The neuron numbers of the input and the output layers are related to dimensions of $s_t$ and $a_t$ separately, while that of the hidden layer is a hyper-parameter. The input layer and the hidden layer are followed by a rectified linear unit (ReLU) and a hyperbolic tangent function as the activation function separately. Considering the critic NN, it is composed of three full-connect layers. The neuron number of the input layer is the sum of dimensions of $s_t$ and $a_t$, and that of the output layer is $1$. The same as the actor NN, the neuron number of the hidden layer is a hyper-parameter. Except for the final layer, other layers are followed by a ReLU function as activation functions.
So as to the goal of the agent, we invite the behavior objective as $J\left ( \theta ^\mu \right )=\mathbb {E}\left [ R_t \right ]$. And the actor evaluation NN is updated by the gradient of the behavior objective:
2.3 Training method
The training procedure of the proposed model-based RL consists of episodes of training, each of which starts with a random state $s_t$ and continues for 3000 time steps corresponding to the whole modulation period. Especially, before the formal training, we design a pre-training step to initialize the parameters of actor and critic NNs for the purpose of accelerating the optimization process and improving the network stability. The pre-training process includes 500 episodes and lasts for 200 time steps related to the strong nonlinear region in the modulation period. The weights and biases of the NNs are initialized randomly in the pre-training process. In addition, we introduce a replay buffer in the algorithm as in the standard RL since the data explored from the environment has inconsistent distribution and strong correlation. At the beginning of training, the agent adopts actions based on a random policy and the corresponding experiences are sent to the replay buffer. The actor and critic NNs will not be trained until the replay buffer accumulates enough experience. During the training process, to maximize the reward, the agent tends to repeat actions tried in the past and received good rewards. However, in order to find these actions, the agent has to try out different actions not previously selected. This is called the exploration-exploitation trade-off. To guarantee the exploration-exploitation trade-off and avoid the risk of being stuck in a local optimal policy, we introduce the action noise that obeys the normal distribution to the actions obtained by the actor network. The variance of the action noise, ${V}_a$, is updated by the discount factor $\gamma _a = 0.9999$. And the final variance, i.e., the exploration rate (ER), is a hyper-parameter. We summarize the training process of the model-based RL algorithm as Algorithm 1.
3. Results and discussion
As a proof of concept, we establish the FMCW generation experiment system as the real environment, which employs a commercial distributed feedback (DFB) laser with a 1550nm center wavelength. The optical frequency is modulated by a time-variant signal $u\left (t\right )$ superimposed on a 300mA DC bias. The laser output is sent to a MZI with a relative delay $\tau =5$ns, and the beat signal is received by a PD (PDA10CS-EC, Thorlab). The data is fed into through a FPGA (ALTERA CYCLONE IV EP4CE6F17C8, ALINX), and the model-based RL is produced by a laptop computer (MI Air 13.3). The modulation signal is a sawtooth waveform with 200Hz frequency ($T_m=5$ms). During the linearization process, a 3ms region of interest (ROI), i.e., $60\%$ of the period, is used to evaluate the linearity. The corresponding laser frequency excursion is around $B=58.5$GHz in each ROI. According to the Eq. (7), we calculate the beat frequency $f_{B}$ as the goal of the control, and the result is $97.5$kHz.
Before the training process, we build the model of the environment based the experimental data. The entire training process is done in the model instead of the real experimental environment to enhance the data efficiency effectively. We list all the hyper-parameters used in the training process in detail, as shown in Table 1. For the hyper-parameters effecting the linearization significantly, we provide the corresponding analysis in Section 3.1. The convergence of the NN is shown in Section 3.2. After the training process, we apply the well-trained agent to the experiment system. The performance of the agent is reported in Section 3.3, and a classical iteration method is introduced as a comparison.
The simulation platform is built up on the Windows 10 operating system with Python 3.6.13 and Pytorch environment. The CPU of the computer is Intel i5 processor with dominant frequency of $1.6$GHz, the GPU is NVIDIA GeForce MX150, and the running memory is 8G. GPU is applied to speed up the training process.
3.1 Parameter setting
To ensure the model-based RL network achieves the best performance, we investigate the effects of different hyper-parameters including the size of replay buffer (SRB), the number of hidden neurons (HN), learning rates (LR), and ER, as shown in Fig. 2. Figure 2(a) represents the performance curves of the agent with different SRB. The replay buffer is used to store the obtained experience during the training process. In the training process, the batch of samples is randomly sampled from the buffer to train the NNs. Thus, the buffer size affects the mean reward. The smaller the size is, the greater the impact of the new data on the sample distribution, which means that the buffering effect is worse and results in network instability during training. As shown in Fig. 2(a), when the SRB equals to 5e3, the fluctuation of the reward curves is apparent. However, the excessive buffer size increases resource occupancy rate and reduces data efficiency, leading to slower network convergence. Among the reward curves shown in Fig. 2(a), the final reward is smaller than the others when the SRB equals to 7e4. So we set the SRB to 5e4. Figure 2(b) shows the influence of different numbers of HN on network training. Generally, a larger number of HN means more powerful fitting ability of the network to learn abundant features. After 300 episodes, the average reward curves of the model-based RL gradually converge. With the enhancement of the number of HN, the average reward is improved. And for networks with 64 and 128 HN, the average rewards are close to −0.016, which are obviously better than other HNs. Thus, further considering the network complexity and the possibility of over-fitting, we set the HN as 64 to construct a light-weight NN. Furthermore, we investigate the effect of LR on network convergence. As shown in Fig. 2(c), after 300 episodes, networks with different LR access convergence state. The LR determines the time step in the optimization process of the loss function. A higher LR indicates a greater influence of new experience on the network training, which may lead to network divergence. Therefore, as shown in the detailed diagram of 350-500 episodes, when the LR of the critic NNs (LR-c) equals to 0.01 and the LR of the actor NNs (LR-a) equals to 0.001, the average reward curve represents obvious fluctuations compared with those with other parameters. However, a smaller learning rate reduces the convergence efficiency of the network, as the curve shows in Fig. 2(c), where the LR-c equals to $0.0001$ and the LR-a equals to $0.001$. Considering the comprehensive performance of the network under different LR-c and LR-a, we set both LR-c and LR-a as 0.001. Moreover, the ER needs to be specified to determine the trade-off between exploration and utilization of the network in the training process. A higher ER indicates a greater probability that the agent determines the action at random rather than the explored experience, which may lead to more or less reward. While, with a too lower ER, the optimal policy may be omitted. As shown in Fig. 2(d), the network performs best when the ER equals to $0.001$. Thus, the ER is set to 0.001.
3.2 Convergence of model-based RL network
After the training process, we plot the training reward of the model-based RL in terms of the training episodes as shown in Fig. 3, where the solid line is the average of multiple training processes, and the shaded area indicates the $95\%$ confidence interval. According to Eq. (9), a greater reward means better linearity of the generated FMCW following the modulation signal control policy. Clearly, in the beginning episodes, the average reward is very small, indicating that the frequency of the beat signal at this point has a large deviation from the ideal value due to the significant sweep nonlinear characteristic of the model. Moreover, since the action noise has a great influence on the action selection in the early training period, a large proportion of bad data with low rewards is stored to the replay buffer. Therefore, the orange line in Fig. 3 shows that even after pre-training, the networks still learn the bad policy in the early stage of formal training causing the deterioration of the control effect. But with the on-going training, the action noise decreases rapidly and the samples are updated, making the reward performance improved quickly from −1.17 to −0.0157 within almost 30 episodes. Although the average reward curve shows slight fluctuations after convergence, the agent finds the policy for linear FMCW generation at the end of the training anyway. On the contrary, the blue line in Fig. 3 shows that the algorithm without pre-training diverges from the one with pre-training after almost 100 episodes. Thus, in this situation, the agent cannot perform appropriate actions to generate linear FMCW. Considering that the nonlinearity is relatively different for different parts of the modulation period, the pre-training process provides more opportunities for the agent to learn about the strong nonlinear parts, which means the formal training process begins with a better initial policy. Therefore, during the formal training process, the chance of exploring worse experiences is greatly reduced, and then in the replay buffer, the proportion of experiences following different policies is constrained to those beneficial to the convergence of the network. The reward curve demonstrates that the design of the agent structure is capable of learning the nonlinear characteristics in the process of interacting with the model and finding an efficient policy in the optimization control. And compared with training the whole period directly, the pre-training step in our scheme confirms its validity in accelerating network convergence.
3.3 Linearization performance
We apply the modulated signal obtained from the proposed model-based RL and the iterative method to drive the FSL in the FMCW generation experiment system. Then we investigate the experimental spectra of beat signals with and without control algorithms. Figure 4 shows the frequency vs. time curves of the beat signals. In the beginning, a linear modulation signal is employed as the driver of the laser, and the beat frequency is expected to be time-invariant if the system is linear. As shown in Fig. 4, the extracted frequency without control is time-variant with a large fluctuation range indicating that the system is highly nonlinear. While the fluctuation range of the frequency is greatly reduced by our model-based RL, suggesting an improvement in the linearity of the system. To evaluate the control capacity, we define the linearity as:
4. Conclusion
In conclusion, this work investigated the linearization of the FMCW LiDAR system using our model-based RL. We constructed a light-weight NN to guarantee the real-time performance of the system. And the established system model provided a more convenient way to obtain data improving the network optimization efficiency and system safety in the training process. From the limited experimental data provided by the model, the proposed agent learned the implicit physical model of the system and controlled the system to generate the desired linear frequency sweep. It is demonstrated that RL, as a branch of ML, has the potential for controlling complicated optical systems to achieve different optimization goals for dozens of application scenarios. However, from the perspective of practical scenarios, under the control of model-based RL, the generated FMCW is not perfectly linear. It is probably because the precision-limited model provides feedback to the agent with a certain indistinctness. So the development of a dynamic systems model with complete statistical regulations via sufficient data analysis would be the future research direction.
Funding
Special Science Foundation of Quzhou (2021D010, 2021D017); Yangzhou Lvyang Jinfeng Grant 2019; Funding Scheme of Sichuan Province to Outstanding Scientific and Technological Programs by Chinese Students Abroad 2019.
Disclosures
The authors declare no conflicts of interest.
Data availability
Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.
References
1. Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017). [CrossRef]
2. Z. Liu, D. Zhu, L. Raju, and W. Cai, “Tackling photonic inverse design with machine learning,” Adv. Sci. 8(5), 2002923 (2021). [CrossRef]
3. B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universal linear embeddings of nonlinear dynamics,” Nat. Commun. 9(1), 4950 (2018). [CrossRef]
4. G. Genty, L. Salmela, J. M. Dudley, D. Brunner, A. Kokhanovskiy, S. Kobtsev, and S. K. Turitsyn, “Machine learning and applications in ultrafast photonics,” Nat. Photonics 15(2), 91–101 (2021). [CrossRef]
5. C. M. Valensise, A. Giuseppi, G. Cerullo, and D. Polli, “Deep reinforcement learning control of white-light continuum generation,” Optica 8(2), 239–242 (2021). [CrossRef]
6. J. Nousiainen, C. Rajani, M. Kasper, and T. Helin, “Adaptive optics control using model-based reinforcement learning,” Opt. Express 29(10), 15327–15344 (2021). [CrossRef]
7. H. Tünnermann and A. Shirakawa, “Deep reinforcement learning for coherent beam combining applications,” Opt. Express 27(17), 24223–24230 (2019). [CrossRef]
8. C. Lu, Y. Xiang, Y. Gan, B. Liu, F. Chen, X. Liu, and G. Liu, “Fsi-based non-cooperative target absolute distance measurement method using pll correction for the influence of a nonlinear clock,” Opt. Lett. 43(9), 2098–2101 (2018). [CrossRef]
9. Y. Feng, W. Xie, Y. Meng, L. Zhang, Z. Liu, W. Wei, and Y. Dong, “High-performance optical frequency-domain reflectometry based on high-order optical phase-locking-assisted chirp optimization,” J. Lightwave Technol. 38(22), 6227–6236 (2020). [CrossRef]
10. H. Tsuchida, “Waveform measurement technique for phase/frequency-modulated lights based on self-heterodyne interferometry,” Opt. Express 25(5), 4793–4799 (2017). [CrossRef]
11. G. Shi, F. Zhang, X.-H. Qu, and X. Meng, “High-resolution frequency-modulated continuous-wave laser ranging for precision distance metrology applications,” Opt. Eng. 53(12), 122402 (2014). [CrossRef]
12. X. Zhang, J. Pouls, and M. C. Wu, “Laser frequency sweep linearization by iterative learning pre-distortion for fmcw lidar,” Opt. Express 27(7), 9965–9974 (2019). [CrossRef]
13. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature 550(7676), 354–359 (2017). [CrossRef]
14. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature 575(7782), 350–354 (2019). [CrossRef]
15. X. Da, Z. Xie, D. Hoeller, B. Boots, A. Anandkumar, Y. Zhu, B. Babich, and A. Garg, “Learning a contact-adaptive controller for robust, efficient legged locomotion,” arXiv preprint arXiv:2009.10019 (2020).
16. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971 (2015).
17. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” Advances in neural information processing systems 31, (2018).