Linearization of nonlinear frequency modulated continuous wave generation using model-based reinforcement learning

Haohao Zhao; Haohao Zhao; Guohui Yuan; Guohui Yuan; Jian Xiao; Jian Xiao; Junfeng Li; Junfeng Li; Hai Zhang; Hai Zhang; Kai Fang; Zhuoran Wang; Zhuoran Wang; Zhuoran Wang

doi:10.1364/OE.458924

1. Introduction

Modern optics technology is undergoing a revolution from first-principles models to data-driven approaches. The photonics world represented by complex physics equations remains elusive and data produced by simulation and experiments is abundant and waiting for mining something valuable. Therefore, there is a trend that some approaches of machine learning (ML) have been merged into the photonics world, demonstrated by optical microscopy [1], photonic inverse design [2], optical communication [3] and ultrafast optics [4]. However, compared with the aforementioned successful examples, the application of optical system control is still in its infancy because interfering and managing optical systems with the properties of nonlinear and dynamic is even more complicated [5–7]. Is there any possible to use ML as a potential alternative approach to accomplish optical system control successfully? In this manuscript, we accomplish this mission by generating linear frequency modulated continuous wave (FMCW) in light detection and ranging (LiDAR) from the perspective of the reinforcement learning (RL) method, which is currently the state of the art for control as a branch of ML.

FMCW LiDAR is the cutting-edge technology to achieve high-precision detection finding many applications ranging from medical imaging, biometric, and precision instrument manufacture to surveying and mapping. Generally, how to realize a frequency-swept laser (FSL) is the core of FMCW LiDAR system. With an FSL, the generated optical signal is split into the reference and the detection signals, and they interfere with each other after passing through different optical paths. Based on the optical FMCW interference principle, the frequency of the beat signal is proportional to the detection distance in the stationary target detection. Unfortunately, the inherent nonlinearity of tunable lasers twists the linear mapping relation between the frequency response of the FSL and the modulation signal, which causes spectral expansion and degradation of the accuracy of distance measurement considerably. Consequently, the laser nonlinearity is always deemed as one of the key factors limiting the development of FMCW LiDAR.

Traditional FMCW linearization can be achieved mainly through either active correction of the FSL or passive linearization of the sampled data. The active methods mainly depend on the optical phase-locked loop (OPLL) [8,9] or other negative feedback methods [10] to achieve real-time and high precise linearization. Since the active methods rely on a very complex optical system, many researchers have been seeking passive linearization methods in post-processing, such as frequency resampling [11] and iteration methods [12], to decrease the system complexity effectively. However, passive methods are underpinned by physical formulas being short of the ability to learn dynamic system characteristics generally. Faced with the complex dynamic system, it would be quite difficult to obtain a stable control policy. Hence, a feasible and effective approach is demanded to learn the dynamic characteristics in the system to improve the linearity of FMCW effectively.

As a branch of ML, RL is an emerging control method for the optimization problem formalized as a Markov decision process (MDP). With the help of RL, the agent learns the characteristics of the environment in the process of interacting with the environment. And the neural network (NN) structure makes it powerful in optimizing the control policy. Therefore, compared with traditional methods, RL has more advantages in capturing and learning dynamic characteristics, and the learned control policy can adapt to the complex dynamic environment, such as the games of GO and StarCraft [13,14], autonomous robots [15], and self-driving cars [16]. In this work we present a model-based RL control method capable of generating a linear FMCW signal. We design a light-weight NN in the method to ensure the efficiency of the method in physical experiments. The model based on the FMCW generation system is used as the external environment to avoid possible damage to the system during the control process. And it is also beneficial to improve training efficiency and reduce the requirements in computation and signal synchronization. In the training process, we apply the pre-training process to approach a good initial guess to improve the learning efficiency and stability of the control model. The proposed method controls the slope of the modulation signal of the FSL in the continuous space under the real-time conditions of the system model. The experimental results confirm effective improvement in linearization and stability of the generated FMCW. We believe that the proposed model-based RL will enable a new research routine for optical system control.

2. Methodology

We divide the schematic diagram of the model-based RL control method into two main parts, as depicted in Fig. 1. The first part is the environment consisting of the FMCW generation system and its model, and the second part is the RL agent optimizing the modulation signal control of the FSL according to the real-time FMCW state. With the current state $s_t$, the agent guided by a policy $\pi \left (s\right )$ selects an action $a_t$ related to the modulation signal, and interacts with the environment to modify its internal state and performance feedback. Then, the agent receives the updated internal state in the form of $s_{t+1}$ and a reward $r_t$ representing the performance feedback. In this section, we introduce the construction of the environment consisting of the FMCW generation system and its model, the RL agent, and the training process.

Fig. 1. Schematic diagram of the (a) environment where (d) system model is established based on (c) FMCW generation system in FMCW LiDAR system, the (b) RL agent, and the (e) neural network structure of RL agent. RL, reinforcement learning; Opt., optimizer; NN, neural network; Eval., evaluation; MZI, Mach-Zehnder interferometer; PD, photodetector; FPGA, field-programmable gate array.

Download Full Size | PDF

2.1 FMCW generation system and model

In terms of the environment, the FMCW generation system serves as the light source in the FMCW LiDAR system as shown in Fig. 1(c). The FMCW generation system starts with an FSL driven by the scanning modulation signal to provide a chirp optical signal periodically. An isolator guarantees the unidirectional transmission of the light, followed by an amplitude controller in order to finely equalize the dynamic of output power caused by the modulation signal and surrounding noise. A small part of the output is coupled into the FMCW generation system through the 10/90 fiber coupler 1. The chirp signal passes through a fiber Mach-Zehnder interferometer (MZI) with a delay time $\tau$. After interference, a photodetector (PD) receives the beat signal, and the photocurrent is sent to a field-programmable gate array (FPGA) as the measurement of the laser frequency sweep.

Since RL methods often require hundreds of thousands or even millions of times of training to explore the optimal control policy, working directly with experiment data is quite expensive. Therefore, we utilize a RL algorithm with a simplified model of the system to enhance the data efficiency, as shown in Fig. 1(d). In this way, the training of RL is not related to the real environment, which avoids the influence of the agent on the actual physical system and fundamentally eliminates the risk of system damage during the training process. Particularly, in our system, the FSL is extremely sensitive to the modulated current. Once the modulated current, corresponding to the action in RL algorithm, exceeds the safe range of the FSL, the laser could be damaged immediately.

Based on the principle of optical FMCW interference and the nonlinear nature of the FSL, the frequency of FMCW can be defined as:

(1)$$f\left( t \right) =f_0+F\left( u\left( t \right) \right),$$

where $f_0$ is the central frequency, and $F\left (\cdot \right )$ describes the nonlinear relationship between the laser frequency vs. the modulation signal $u\left ( t \right )$. In addition, assuming the optical delay of MZI’s arms is small enough, the detected photocurrent generated from the MZI is given as

(2)$$I\left( \tau ,t \right) \propto cos\left( \phi\left( t \right) -\phi\left( t-\tau \right) \right) \approx cos\left( \tau \frac{d\phi\left( t \right)}{dt}\right) =cos \left(\tau f\left(t\right)\right) =cos\left( \phi _b \left(t\right)\right),$$

where $\phi \left ( t \right )$, $\phi \left ( t-\tau \right )$, $\phi _b\left ( t \right )$ are the phase of the reference signal, detection signal and beat signal separately. In the data processing step shown in Fig. 1, the phase of the beat signal is extracted via a Hilbert Transform (HT). Then, the frequency of the beat signal $f_b\left ( t \right )$ can be obtained as

(3)$$f_b\left( t \right) =\frac{d\phi _b\left( t \right)}{dt}=\tau \frac{df\left( t \right)}{dt}=\tau \frac{du\left( t \right)}{dt}F^{'}\left( u \right) =G\left( u \right) \zeta \left( t \right),$$

where $G\left ( u \right ) =F^{'}\left ( u \right ) \tau$ and $\zeta \left ( t \right ) = \frac {du\left ( t \right )}{dt}$ is the slope function of the modulation signal. As the frequency sweep is nonlinear, the beat signal frequency $f_b\left ( t \right )$ is time-varying. Furthermore, $G\left ( u \right )$ can be used as the system characteristic curve to describe the essential relationship between the beat signal and the modulation signal. With the slope function of modulation signal $\zeta \left ( t \right )$ and the corresponding $f_b\left ( t \right )$, the current $G\left ( u \right )$ can be determined using Eq. (3) in the data process step. This fixed $G\left ( u \right )$ is used to map the nonlinearity property of the system. Furthermore, as the phase noise of lasers obeys the Wiener distribution, we assume the frequency noise is independent and white, following a normal probability distribution. Then, for any given slope of the modulation signal, the beat signal frequency simulated by the model $f_{b,m}\left ( t \right )$ can be expressed as

(4)$$f_{b,m}\left( t \right) =\zeta \left( t \right) G\left( u \right) +n\left( t \right),$$

where $n\left ( t \right )$ is the frequency noise. The established model is able to provide the system state as the environmental feedback to meet the requirement of RL and does not rely on complex physics equations. And the agent is trained to find the slope of the modulation signal leading to the linear FSL frequency sweep in the model. Once the linearization operation is accomplished successfully, the beat signal frequency $f_{b}\left (t\right )$ equals to a constant $f_{B}$ and the target distance $D$ can be inverted as

(5)$$D=\frac{1}{2}c\tau_D =\frac{c}{2 \xi}f_{B},$$

where $\xi =\frac {df\left ( t \right )}{dt}=\frac {B}{T_m}$ denotes the FSL frequency sweep rate. And $T_m$ is the modulation period, $B$ is the bandwidth in the period, $c$ is the speed of light in free space, $\tau _D$ is the propagation time corresponding to $D$. Furthermore, the distance resolution is represented as

(6)$$\delta D = \delta f_{B} \cdot \frac{c}{2 \xi},$$

where $\delta f_{B}$ is the spectral peak width of the beat signal. Notably, based on Eq. (3) and Eq. (5), we can also express the beat signal frequency in the ideal condition with the known $\xi$ and $\tau$:

(7)$$f_{b,i} =\xi \tau,$$

where $f_{b,i}$ is the ideal beat signal frequency, which equals to $f_{B}$ when the perfect linearization is achieved.

Based on the proposed model, we define the experience composed by states $s_t$, actions $a_t$, and rewards $r_t$ to formulate the linear FMCW generation problem as an MDP. Especially, because of the nonlinearity of the laser, the same observed beat signal frequency may demand a different degree of correction. Thus, it is not enough to represent the system state relying only on the output of the model. Given the influencing factors of nonlinearity, we introduce the modulation signal as part of the state description. To enrich the system information obtained in the state, we invite the frequency difference into the state description. Therefore, the final state $s_t$ is designed as

(8)$$s_t=normalization\left[\left[ u\left( t \right) ,f_{b,m}\left( t \right) ,f_{b,m}\left( t \right) -f_{b,m}\left( t-1 \right) \right] \right],$$

where $normalization\left [x \right ]=\left (x-x_{min} \right )/\left (x_{max}-x_{min} \right )$ is the normalization function. $x_{max}$ and $x_{min}$ are the minimum and the maximum of $x$ in all samples. When $x$ is a vector, the function is applied to each dimension of the vector. In addition, we define the slope of the modulation signal as the optional action $a_t$ in the continuous action space. Once the action is determined by the agent, the modulation signal will be calculated. With the known $G\left ( u \right )$, the beat signal frequency can be determined based on the mapping relationship in Eq. (4). In order to improve the optimization efficiency, we expect the beat signal frequency to be as close to the ideal frequency as possible. Therefore, we define the reward function as follows

(9)$$r_t={-}normalization\left[ |f_{b,m}\left( t \right) -f_{b,i} | \right].$$

Equation (9) indicates that the closer the beat signal frequency is to the ideal frequency, the greater the reward is. And when the beat signal frequency achieves the ideal frequency, it will obtain the maximum reward of 0.

2.2 Network model of model-based RL

As shown in Fig. 1(b), the RL agent is adopted to control the linear FMCW generation with an "actor-critic" architecture. Generally, deep RL methods based on actor-critic employ a pair of neural networks (NNs) trained with different objectives: the actor network optimizes the policy $\pi \left ( s \right )$ which determines the probability distribution of state-to-action mapping; while the critic network optimizes the objective state-action value function $Q\left ( s,a \right )$, representing the discount cumulative reward of the state-action pair $\left ( s,a \right )$ following the current policy $\pi$.

(10)$$Q\left( s_t,a_t \right) =\mathbb{E}\left[ R_t|s_t,a_t \right],$$

where $R_t=\sum _{t^{'}=t}^T{\gamma _r ^{t^{'}-t}r_{t^{'}}}$ describes the discount accumulation for future rewards of a state with the discount factor of the reward $\gamma _ r$. Here, it is set as $0.8$. And the optimal action-value function is defined as $Q ^{*}\left ( s_t,a_t \right )$. Bellman equation is an important identity of it, which is represented as

(11)$$Q^{*}\left( s_t,a_t \right) =\mathbb{E}\left[r\left( s_t,a_t \right) +\gamma _r max_{a_{t+1}} Q^{*}\left( s_{t+1},a_{t+1} \right) \right].$$

The basic idea behind the critic network is to estimate $Q\left ( s,a \right )$ by using the Bellman equation as an iterative formula, $Q_{i+1}\left ( s_t,a_t \right ) =\mathbb {E}\left [r\left ( s_t,a_t \right ) +\gamma _r max_{a_{t+1}} Q_{i}\left ( s_{t+1},a_{t+1} \right ) \right ]$, where $i$ is the number of iterations.

Furthermore, we introduce auxiliary NNs namely target networks, the copies of the previous NNs, to reduce the network shock caused by the $Q\left ( s,a \right )$ update during the training process, as shown in Fig. 1(b). The actor evaluation NN and the actor target NN are parameterized as $\mu \left ( s|\theta ^{\mu } \right )$ and $\mu ^{'} \left ( s|\theta ^{\mu ^{'}} \right )$. The critic evaluation NN and the critic target NN are parameterized as $Q \left ( s, a|\theta ^{Q} \right )$ and $Q^{'} \left ( s, a|\theta ^{Q^{'}} \right )$. The network structures of two kinds of NNs are shown in Fig. 1(e). The actor NN is composed of three fully connected layers. The neuron numbers of the input and the output layers are related to dimensions of $s_t$ and $a_t$ separately, while that of the hidden layer is a hyper-parameter. The input layer and the hidden layer are followed by a rectified linear unit (ReLU) and a hyperbolic tangent function as the activation function separately. Considering the critic NN, it is composed of three full-connect layers. The neuron number of the input layer is the sum of dimensions of $s_t$ and $a_t$, and that of the output layer is $1$. The same as the actor NN, the neuron number of the hidden layer is a hyper-parameter. Except for the final layer, other layers are followed by a ReLU function as activation functions.

So as to the goal of the agent, we invite the behavior objective as $J\left ( \theta ^\mu \right )=\mathbb {E}\left [ R_t \right ]$. And the actor evaluation NN is updated by the gradient of the behavior objective:

(12)$$\nabla _{\theta ^{\mu}}J=\mathbb{E}\left[ \nabla _{\theta ^{\mu}}Q\left( s,a|\theta ^Q \right) |s=s_t,a=\mu \left( s_t|\theta ^{\mu} \right) \right].$$

While the critic evaluation NN is optimized by the loss function:

(13)$$L\left( \theta ^Q \right) =\mathbb{E}\left[ \left( Q\left( s_t,a_t|\theta ^Q \right) -y_t \right) ^2 \right],$$

where $y_t=r\left ( s_t,a_t \right ) +\gamma _r \left [ Q^{'}\left ( s_{t+1},\mu ^{'}\left ( s_{t+1}|\theta ^{\mu ^{'}} \right ) |\theta ^{Q^{'}} \right ) \right ]$. The Adam optimizer is employed. For the target NNs, the parameters are updated by

(14)$$\theta ^{'}\gets k \theta +\left( 1-k \right) \theta ^{'},$$

where $\theta$ and $\theta ^{'}$ represent the parameters of the evaluation NNs and the target NNs, separately, and $k=0.01$ represents the update rate.

Algorithm 1. Model-based RL

View Table | View all tables in this article

2.3 Training method

The training procedure of the proposed model-based RL consists of episodes of training, each of which starts with a random state $s_t$ and continues for 3000 time steps corresponding to the whole modulation period. Especially, before the formal training, we design a pre-training step to initialize the parameters of actor and critic NNs for the purpose of accelerating the optimization process and improving the network stability. The pre-training process includes 500 episodes and lasts for 200 time steps related to the strong nonlinear region in the modulation period. The weights and biases of the NNs are initialized randomly in the pre-training process. In addition, we introduce a replay buffer in the algorithm as in the standard RL since the data explored from the environment has inconsistent distribution and strong correlation. At the beginning of training, the agent adopts actions based on a random policy and the corresponding experiences are sent to the replay buffer. The actor and critic NNs will not be trained until the replay buffer accumulates enough experience. During the training process, to maximize the reward, the agent tends to repeat actions tried in the past and received good rewards. However, in order to find these actions, the agent has to try out different actions not previously selected. This is called the exploration-exploitation trade-off. To guarantee the exploration-exploitation trade-off and avoid the risk of being stuck in a local optimal policy, we introduce the action noise that obeys the normal distribution to the actions obtained by the actor network. The variance of the action noise, ${V}_a$, is updated by the discount factor $\gamma _a = 0.9999$. And the final variance, i.e., the exploration rate (ER), is a hyper-parameter. We summarize the training process of the model-based RL algorithm as Algorithm 1.

3. Results and discussion

As a proof of concept, we establish the FMCW generation experiment system as the real environment, which employs a commercial distributed feedback (DFB) laser with a 1550nm center wavelength. The optical frequency is modulated by a time-variant signal $u\left (t\right )$ superimposed on a 300mA DC bias. The laser output is sent to a MZI with a relative delay $\tau =5$ns, and the beat signal is received by a PD (PDA10CS-EC, Thorlab). The data is fed into through a FPGA (ALTERA CYCLONE IV EP4CE6F17C8, ALINX), and the model-based RL is produced by a laptop computer (MI Air 13.3). The modulation signal is a sawtooth waveform with 200Hz frequency ($T_m=5$ms). During the linearization process, a 3ms region of interest (ROI), i.e., $60\%$ of the period, is used to evaluate the linearity. The corresponding laser frequency excursion is around $B=58.5$GHz in each ROI. According to the Eq. (7), we calculate the beat frequency $f_{B}$ as the goal of the control, and the result is $97.5$kHz.

Before the training process, we build the model of the environment based the experimental data. The entire training process is done in the model instead of the real experimental environment to enhance the data efficiency effectively. We list all the hyper-parameters used in the training process in detail, as shown in Table 1. For the hyper-parameters effecting the linearization significantly, we provide the corresponding analysis in Section 3.1. The convergence of the NN is shown in Section 3.2. After the training process, we apply the well-trained agent to the experiment system. The performance of the agent is reported in Section 3.3, and a classical iteration method is introduced as a comparison.

Table 1. Hyper-parameters of model-based RL

View Table | View all tables in this article

The simulation platform is built up on the Windows 10 operating system with Python 3.6.13 and Pytorch environment. The CPU of the computer is Intel i5 processor with dominant frequency of $1.6$GHz, the GPU is NVIDIA GeForce MX150, and the running memory is 8G. GPU is applied to speed up the training process.

3.1 Parameter setting

To ensure the model-based RL network achieves the best performance, we investigate the effects of different hyper-parameters including the size of replay buffer (SRB), the number of hidden neurons (HN), learning rates (LR), and ER, as shown in Fig. 2. Figure 2(a) represents the performance curves of the agent with different SRB. The replay buffer is used to store the obtained experience during the training process. In the training process, the batch of samples is randomly sampled from the buffer to train the NNs. Thus, the buffer size affects the mean reward. The smaller the size is, the greater the impact of the new data on the sample distribution, which means that the buffering effect is worse and results in network instability during training. As shown in Fig. 2(a), when the SRB equals to 5e3, the fluctuation of the reward curves is apparent. However, the excessive buffer size increases resource occupancy rate and reduces data efficiency, leading to slower network convergence. Among the reward curves shown in Fig. 2(a), the final reward is smaller than the others when the SRB equals to 7e4. So we set the SRB to 5e4. Figure 2(b) shows the influence of different numbers of HN on network training. Generally, a larger number of HN means more powerful fitting ability of the network to learn abundant features. After 300 episodes, the average reward curves of the model-based RL gradually converge. With the enhancement of the number of HN, the average reward is improved. And for networks with 64 and 128 HN, the average rewards are close to −0.016, which are obviously better than other HNs. Thus, further considering the network complexity and the possibility of over-fitting, we set the HN as 64 to construct a light-weight NN. Furthermore, we investigate the effect of LR on network convergence. As shown in Fig. 2(c), after 300 episodes, networks with different LR access convergence state. The LR determines the time step in the optimization process of the loss function. A higher LR indicates a greater influence of new experience on the network training, which may lead to network divergence. Therefore, as shown in the detailed diagram of 350-500 episodes, when the LR of the critic NNs (LR-c) equals to 0.01 and the LR of the actor NNs (LR-a) equals to 0.001, the average reward curve represents obvious fluctuations compared with those with other parameters. However, a smaller learning rate reduces the convergence efficiency of the network, as the curve shows in Fig. 2(c), where the LR-c equals to $0.0001$ and the LR-a equals to $0.001$. Considering the comprehensive performance of the network under different LR-c and LR-a, we set both LR-c and LR-a as 0.001. Moreover, the ER needs to be specified to determine the trade-off between exploration and utilization of the network in the training process. A higher ER indicates a greater probability that the agent determines the action at random rather than the explored experience, which may lead to more or less reward. While, with a too lower ER, the optimal policy may be omitted. As shown in Fig. 2(d), the network performs best when the ER equals to $0.001$. Thus, the ER is set to 0.001.

Fig. 2. Performance curves tracking average reward of agents based on different hyper-parameters.

Download Full Size | PDF

3.2 Convergence of model-based RL network

After the training process, we plot the training reward of the model-based RL in terms of the training episodes as shown in Fig. 3, where the solid line is the average of multiple training processes, and the shaded area indicates the $95\%$ confidence interval. According to Eq. (9), a greater reward means better linearity of the generated FMCW following the modulation signal control policy. Clearly, in the beginning episodes, the average reward is very small, indicating that the frequency of the beat signal at this point has a large deviation from the ideal value due to the significant sweep nonlinear characteristic of the model. Moreover, since the action noise has a great influence on the action selection in the early training period, a large proportion of bad data with low rewards is stored to the replay buffer. Therefore, the orange line in Fig. 3 shows that even after pre-training, the networks still learn the bad policy in the early stage of formal training causing the deterioration of the control effect. But with the on-going training, the action noise decreases rapidly and the samples are updated, making the reward performance improved quickly from −1.17 to −0.0157 within almost 30 episodes. Although the average reward curve shows slight fluctuations after convergence, the agent finds the policy for linear FMCW generation at the end of the training anyway. On the contrary, the blue line in Fig. 3 shows that the algorithm without pre-training diverges from the one with pre-training after almost 100 episodes. Thus, in this situation, the agent cannot perform appropriate actions to generate linear FMCW. Considering that the nonlinearity is relatively different for different parts of the modulation period, the pre-training process provides more opportunities for the agent to learn about the strong nonlinear parts, which means the formal training process begins with a better initial policy. Therefore, during the formal training process, the chance of exploring worse experiences is greatly reduced, and then in the replay buffer, the proportion of experiences following different policies is constrained to those beneficial to the convergence of the network. The reward curve demonstrates that the design of the agent structure is capable of learning the nonlinear characteristics in the process of interacting with the model and finding an efficient policy in the optimization control. And compared with training the whole period directly, the pre-training step in our scheme confirms its validity in accelerating network convergence.

Fig. 3. Performance curves tracking average reward of agents based on different training processes. The solid line is the average of multiple training processes, and the shaded indicates the $95\%$ confidence interval.

Download Full Size | PDF

3.3 Linearization performance

We apply the modulated signal obtained from the proposed model-based RL and the iterative method to drive the FSL in the FMCW generation experiment system. Then we investigate the experimental spectra of beat signals with and without control algorithms. Figure 4 shows the frequency vs. time curves of the beat signals. In the beginning, a linear modulation signal is employed as the driver of the laser, and the beat frequency is expected to be time-invariant if the system is linear. As shown in Fig. 4, the extracted frequency without control is time-variant with a large fluctuation range indicating that the system is highly nonlinear. While the fluctuation range of the frequency is greatly reduced by our model-based RL, suggesting an improvement in the linearity of the system. To evaluate the control capacity, we define the linearity as:

(15)$$f_{n}=\frac{1}{f_{B}}\sqrt{\frac{1}{T}\sum_{t=1}^{T}\left( f_{b}\left(t\right)-f_{B} \right) ^{2}},$$

where $T=T_c \cdot f_s$, $T_c$ is the control period and $f_s$ is the sampling rate. The calculated result is $1.64\%$ with the control of the model-based RL, better than $3.02\%$ of the iteration control result, which demonstrates that our proposed method is more efficient in the linearity control. Moreover, Fig. 5 shows an apparent broadening of the beat signal spectrum measured without control. It is hard to distinguish the target frequency information from multiple frequency components. Consequently, it will seriously decrease the inversion accuracy of the detected distance based on Eq. (5). On the contrary, the power spectra of the beat signals with two control strategies show sharp peaks, where the side-mode suppression ratio (SMSR) of our method is $13.31$dB, much better than $5.54$dB of the iteration method. We also introduce the distance resolution to evaluate the linearization effect on the distance ranging quantitatively. Considering that the linear FMCW generation system used in this paper has the same principle as the LiDAR system, we evaluate the performance of the corrected FSL in distance ranging based on the obtained beat frequency signals. The spectral peak width of the beat signal before correction is $6.9$kHz. After being controlled by the model-based RL and the iterative method, the width is narrowed to $1.34$kHz and $1.04$kHz, respectively. According to Eq. (6), the corresponding distance resolutions are $0.053$m, $0.01$m, $0.008$m. The performance of our RL method is not as good as the iterative method, which is mainly because of our simplified environment model with the limited representation ability. As mentioned in [17], the model capacity is a critical ingredient in the success of model-based RL methods. The low-capacity model struggles to represent very complex dynamical systems. Therefore, we will seek a better modeling method in the future to achieve a balance between the model capacity and complexity. From the perspective of RL, the agent captures the dynamic characteristic of the system successfully during the interaction with the model. Thus, with the current situation, the agent is capable of dealing with the possible nonlinearity at the next time step and then outputs the corresponding slope of the modulation signal to maintain linear FMCW generation. Our results also confirm that it is feasible to control the optical system with RL methods.

Fig. 4. Instantaneous frequency of the beat signal vs. time with and without control algorithms.

Download Full Size | PDF

Fig. 5. Power spectra of the beat signals with and without control algorithms.

Download Full Size | PDF

4. Conclusion

In conclusion, this work investigated the linearization of the FMCW LiDAR system using our model-based RL. We constructed a light-weight NN to guarantee the real-time performance of the system. And the established system model provided a more convenient way to obtain data improving the network optimization efficiency and system safety in the training process. From the limited experimental data provided by the model, the proposed agent learned the implicit physical model of the system and controlled the system to generate the desired linear frequency sweep. It is demonstrated that RL, as a branch of ML, has the potential for controlling complicated optical systems to achieve different optimization goals for dozens of application scenarios. However, from the perspective of practical scenarios, under the control of model-based RL, the generated FMCW is not perfectly linear. It is probably because the precision-limited model provides feedback to the agent with a certain indistinctness. So the development of a dynamic systems model with complete statistical regulations via sufficient data analysis would be the future research direction.

Funding

Special Science Foundation of Quzhou (2021D010, 2021D017); Yangzhou Lvyang Jinfeng Grant 2019; Funding Scheme of Sichuan Province to Outstanding Scientific and Technological Programs by Chinese Students Abroad 2019.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017). [CrossRef]

2. Z. Liu, D. Zhu, L. Raju, and W. Cai, “Tackling photonic inverse design with machine learning,” Adv. Sci. 8(5), 2002923 (2021). [CrossRef]

3. B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universal linear embeddings of nonlinear dynamics,” Nat. Commun. 9(1), 4950 (2018). [CrossRef]

4. G. Genty, L. Salmela, J. M. Dudley, D. Brunner, A. Kokhanovskiy, S. Kobtsev, and S. K. Turitsyn, “Machine learning and applications in ultrafast photonics,” Nat. Photonics 15(2), 91–101 (2021). [CrossRef]

5. C. M. Valensise, A. Giuseppi, G. Cerullo, and D. Polli, “Deep reinforcement learning control of white-light continuum generation,” Optica 8(2), 239–242 (2021). [CrossRef]

6. J. Nousiainen, C. Rajani, M. Kasper, and T. Helin, “Adaptive optics control using model-based reinforcement learning,” Opt. Express 29(10), 15327–15344 (2021). [CrossRef]

7. H. Tünnermann and A. Shirakawa, “Deep reinforcement learning for coherent beam combining applications,” Opt. Express 27(17), 24223–24230 (2019). [CrossRef]

8. C. Lu, Y. Xiang, Y. Gan, B. Liu, F. Chen, X. Liu, and G. Liu, “Fsi-based non-cooperative target absolute distance measurement method using pll correction for the influence of a nonlinear clock,” Opt. Lett. 43(9), 2098–2101 (2018). [CrossRef]

9. Y. Feng, W. Xie, Y. Meng, L. Zhang, Z. Liu, W. Wei, and Y. Dong, “High-performance optical frequency-domain reflectometry based on high-order optical phase-locking-assisted chirp optimization,” J. Lightwave Technol. 38(22), 6227–6236 (2020). [CrossRef]

10. H. Tsuchida, “Waveform measurement technique for phase/frequency-modulated lights based on self-heterodyne interferometry,” Opt. Express 25(5), 4793–4799 (2017). [CrossRef]

11. G. Shi, F. Zhang, X.-H. Qu, and X. Meng, “High-resolution frequency-modulated continuous-wave laser ranging for precision distance metrology applications,” Opt. Eng. 53(12), 122402 (2014). [CrossRef]

12. X. Zhang, J. Pouls, and M. C. Wu, “Laser frequency sweep linearization by iterative learning pre-distortion for fmcw lidar,” Opt. Express 27(7), 9965–9974 (2019). [CrossRef]

13. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature 550(7676), 354–359 (2017). [CrossRef]

14. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature 575(7782), 350–354 (2019). [CrossRef]

15. X. Da, Z. Xie, D. Hoeller, B. Boots, A. Anandkumar, Y. Zhu, B. Babich, and A. Garg, “Learning a contact-adaptive controller for robust, efficient legged locomotion,” arXiv preprint arXiv:2009.10019 (2020).

16. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971 (2015).

17. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” Advances in neural information processing systems 31, (2018).

Symbol	Description	Value in pre-training / formal training
$N$	Number of episodes	500 / 500
$M$	Number of time steps	200 / 3000
BS	Batch size of training samples	512 / 1024
SRB	Size of replay buffer	5000 / 50000
LR-a	Learning rate of actor NN	0.001 / 0.001
LR-c	Learning rate of critic NN	0.001 / 0.001
$γ_{r}$	Discount factor of the reward	0.8 / 0.8
$γ_{a}$	Discount factor of the action noise	0.9999 / 0.9999
ER	Final variance of the action noise	0.001 / 0.001
$V_{a, m a x}$	Initial variance of the action noise	3 / 3
$k$	Update rate of target NNs	0.01 / 0.01
HN	Number of hidden neurons	64 / 64

Symbol	Description	Value in pre-training / formal training
$N$	Number of episodes	500 / 500
$M$	Number of time steps	200 / 3000
BS	Batch size of training samples	512 / 1024
SRB	Size of replay buffer	5000 / 50000
LR-a	Learning rate of actor NN	0.001 / 0.001
LR-c	Learning rate of critic NN	0.001 / 0.001
$γ_{r}$	Discount factor of the reward	0.8 / 0.8
$γ_{a}$	Discount factor of the action noise	0.9999 / 0.9999
ER	Final variance of the action noise	0.001 / 0.001
$V_{a, m a x}$	Initial variance of the action noise	3 / 3
$k$	Update rate of target NNs	0.01 / 0.01
HN	Number of hidden neurons	64 / 64

Linearization of nonlinear frequency modulated continuous wave generation using model-based reinforcement learning

Abstract

1. Introduction

2. Methodology

2.1 FMCW generation system and model

2.2 Network model of model-based RL

2.3 Training method

3. Results and discussion

3.1 Parameter setting

3.2 Convergence of model-based RL network

3.3 Linearization performance

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (5)

Tables (2)

Equations (15)

Optics Express