Designing high efficiency asymmetric polarization converter for blue light: a deep reinforcement learning approach

Chuqiao Yi; Chuqiao Yi; Zhiwei Chen; Zhiwei Chen; Yayu Gao; Yayu Gao; Qingguo Du; Qingguo Du

doi:10.1364/OE.449051

1. Introduction

Manipulation of the polarization state of light plays an important role in a variety of applications ranging from communications to imaging [1–3]. Conventional polarization converters (or polarizers) selectively preserve the required polarization state by absorbing, reflecting or refracting light with unwanted polarization state. In these polarization converters, the theoretical maximum transmittance of the desired polarization is limited to $0.5$ for unpolarized incident light [4–6]. Practically, commercial polarization converters normally demonstrate even lower transmittance than the theoretical limit. To break the $0.5$ limit of the transmittance of one linearly polarized light with unpolarized incident light, it is desirable to design a polarization converter that not only rotate light with polarization perpendicular to its principal axis by $90^\circ$ but keeps the transmission of the light with polarization parallel to its principal axis. However, such a polarization converter, regarded as asymmetric polarization converter, is difficult to design, especially for blue light where the absorption loss of materials is not neglectable.

In general, the objective of nanophotonic structure design is to optimize optical properties such as absorption, reflection or transmission spectra, where the key lies in an accurate and efficient mapping from structures to properties. In the past many years, the design of nanophotonic structures for specific requirements often involves brute force calculations [7], or solving inverse scattering problems via Maxwell formulation [8]. Due to the high-dimensional structure parameter space, nevertheless, it is often quite time-consuming to calculate properties for certain device structure using numerical simulation methods [9], which makes it extremely difficult to obtain the optimal structure parameters. To reduce the time consumption, typical control variable methods have long been adopted to design a preferable structure with satisfactory performance, which only leads to a local optimal solution and is closely dependent on human experts’ experience and manual intervention.

In the past a few years, deep learning methods have received growing attention in the fields of nanophotonics, metasurfaces and material science, which show great potential to intelligent and data-driven approaches to the design and optimization of nanophotonic devices. Specifically, supervised learning and unsupervised learning methods were adopted in [10–28] to inversely design various device structure parameters so as to satisfy a given target requirement. For instance, the thickness of nanomaterial shell was inversely designed to obtain a target spectrum of nanoparticles shell in [10,11]; a bidirectional deep neural network was proposed to design plasmonic nanostructure with given transmission spectrum in [12]; the parameters of chiral metamaterial were inversely designed to obtain target reflection spectra in [13]; two-dimensional parameters of all-dielectric isotropic shell were inversely designed by unsupervised learning with given target scattered fields amplitudes in [27]. It can be clearly seen that to apply inverse design methods, the target requirement needs to be given and pre-specified as prior knowledge. In this paper, nevertheless, our objective is to maximize the maximum average transmittance of $y-y$ and $y-x$ direction, without knowing the spectrum curve in prior, and therefore the inverse design methods cannot be directly applied.

Recently, reinforcement learning (RL) has been considered as a new and promising approach for optimal nanophotonic structure design. Specifically, RL algorithms have been adopted to optimization for dielectric nanostructures of color generation, metasurface holograms, ultra-broadband absorbers and optical thin film [29–34]. In contrast to traditional methods, RL-based agents can reach to an optimal strategy via iterative interactions with an environment, which requires little manual intervention and can search throughout the entire parameter space. Nevertheless, due to the requirement of real-time iterative interaction with the environment, the training of RL-agents can be very time-consuming. In the existing studies, transfer matrix method (TMM) [32–35], Finite Element Method (FEM) or Finite Difference Time Domain (FDTD) method [29–31] were used as the environment. Among them, TMM is time efficient, yet only limited to certain devices with layer-by-layer structure. FDTD and FEM are suitable for more general and complex optical structures, which nevertheless take considerable simulation time for each iteration, leading to relatively long time duration for RL-based agent training. For instance, it took one entire month to complete the training of the proposed RL algorithm by implementing the FDTD simulations as the environment [30].

In this paper, we propose a deep reinforcement learning (DRL) approach basing on asynchronous advantage-actor critic (A3C) [36] algorithm to design an optimal asymmetric polarization converter for blue light. The unit cell of the asymmetric polarization converter consists of a dielectric material layer sandwiched between a double-rod resonator structure and a cut-wire structure. Parameters including the layer thickness of double-rod, thickness of dielectric layer, width of rod and index of dielectric layer are considered for optimization. The optimization target is to obtain a linearly polarized light with transmittance larger than $0.5$ with unpolarized incident light.

To avoid extremely long training time, a deep neural network based on residual structure, named Transmittance Prediction Network (TPN), is designed to serve as the simulation environment to interact with the agent. Specifically, TPN predicts the transmittance with given structure parameters of the asymmetric polarization converter, where the prediction accuracy of the transmittance in two polarizations of the asymmetric polarization converter is $96.6\%$ and $95.5\%$, respectively. In the meanwhile, each TPN prediction only requires micro-second grade simulation time. Thanks to the time-efficiency of TPN, the training of A3C algorithm agents only takes 45 minutes to converge, which is much shorter than using FDTD simulation as the environment.

With the optimal structure parameters found by the A3C-based optimization, the average transmittance of our proposed asymmetric polarization converter is larger than $0.5$ for the wavelengths ranging from $444$ nm to $466$ nm with a maximum value of $0.605$ at the wavelength of $455$ nm, which is $21\%$ higher than the theoretical limit of $0.5$. The proposed reinforcement learning and deep learning approach provides plenty of practical insights for the design of high-efficiency asymmetric polarization converter, especially for the blue light.

2. Methodology overview

In this section, we will introduce the basic structure of the proposed asymmetric polarization converter and its optical properties. The overview of our optimization approach based on reinforcement learning will then be demonstrated.

2.1 Asymmetric polarization converter structure

Three-dimensional schematic of one unit cell of the proposed asymmetric polarization converter is illustrated in Fig. 1. It consists of a double-rod resonator (Aluminium, referred to as yellow color in Fig. 1) and a cut-wire structure (Aluminium) metasurface structures, and the two-layered metasurface structures are separated by one layer of dielectric material (referred to as transparent color in Fig. 1). The period of the converter is chosen to be $400$ nm along $x$ and $y$-axis respectively. Consequently, with comprehensive parameter sweep analysis, structure parameters as shown in Fig. 1 are explored to determine the performance of the asymmetric polarization converter. Parameters {$H_a$, $H_c$, $K_a$, $n$} in Fig. 1 denote the layer thickness of double-rod, thickness of dielectric layer, width of rod and index of dielectric layer, respectively, from top to bottom of the metasurface. To characterize the optical properties of the proposed structure, Jones matrix [38] is adopted to calculate the electric fields of orthogonal $x$ and $y$ polarizations of the transmitted light according to those of the incident light, which is given by

(1)$$\binom{E_{t}^{x}}{E_{t}^{y}}=\begin{pmatrix} t_{xx} & t_{xy}\\ t_{yx} & t_{yy} \end{pmatrix} \binom{E_{i}^{x}}{E_{i}^{y}} = {T} \binom{E_{i}^{x}}{E_{i}^{y}},$$

where $T$ is the Jones matrix of the metasurface, and {$E_{i}^{x}$, $E_{i}^{y}$}, {$E_{t}^{x}$, $E_{t}^{y}$} are the incident and transmitted electric field of the propagating light polarized along $x$ and $y$ directions. The element $t_{nm}$ in $T$ denotes transmission coefficient of the $m$-polarized electric field to the $n$-polarized electric field, $m, n \in \{x,y\}$, and is usually a complex number (containing both field transmission amplitude and phase information). Values of {$E_{i}^{x}$, $E_{i}^{y}$, $E_{t}^{x}$, $E_{t}^{y}$} can be extracted from FDTD simulations, and transmittance $T_{nm}$ is equal to the square of $\left | t_{nm} \right |$.

Fig. 1. Structure of asymmetric polarization converter. Light and dark blue arrows represent $x$-polarized and $y$-polarized light, respectively.

Download Full Size | PDF

FDTD method is employed to perform the simulation of the asymmetric polarization converter structure as shown in Fig. 1. Periodic boundary conditions are applied to both $x$ and $y$ directions and perfect matching layer boundary condition is applied to z direction. A fine mesh size of $20$ nm is chosen for all simulations. In order to ensure the accuracy of the calculation results, a finer mesh grid is applied at the interface between the metal and the dielectric ($20$ nm in the $x$ and $y$ directions, and $2$ nm in the $z$ direction). Curve fitting technique is employed to deal with dispersive materials. When dealing with the practical dispersive material, a polynomial fitting technique is used [37].

The designed and characterized multi-layer metamaterial linear polarizer rotates light with polarization perpendicular to its principal axis by $90^\circ$. Light with polarization parallel to its principal axis is transmitted undisturbed. Thereby, such a polarizer is able to output linearly polarized light from unpolarized input with a transmittance that can be substantially higher than the theoretical limit of $0.5$. Hence, $T_{yx}$ and $T_{yy}$ is supposed to be higher than $0.5$ at the same frequency range simultaneously.

2.2 Overview of the reinforcement learning approach

To search for the optimal structure parameters set {$H_a$, $H_c$, $K_a$, $n$} of the asymmetric polarization converter which can maximize the maximum transmittance of $T_{yy}$ and $T_{yx}$ for blue light, we propose a deep reinforcement learning agent based on asynchronous advantage actor-critic (A3C) algorithm as illustrated in Fig. 2. In contrast to the existing studies where the numerical simulation methods were adopted as the simulation environment to interact with the agent, leading to overly long training time, in this paper, a deep neural network is proposed to be the surrogate model and serve as the simulation environment. In particular, a deep fully-connected neural network, named transmittance prediction network (TPN), is designed and trained to predict transmittance of our proposed asymmetric polarization converter with given structure parameters {$H_a$, $H_c$, $K_a$, $n$}, replacing the FDTD numerical simulation. Compared with traditional simulation-based methods, the proposed method provides a time efficient, computing resources-saving and global optimization solution for asymmetric polarization converter design, which requires little prior professional knowledge or manual intervention.

Fig. 2. Graphic illustration of the optimization procedure for asymmetric polarization converter proposed in this paper.

Download Full Size | PDF

3. Transmittance prediction network based on deep learning

In this section, we will first introduce the data set for training the transmittance prediction network. The architecture of transmittance prediction network will then be demonstrated, and its accuracy and time-consumption will be evaluated.

3.1 Data set

In this paper, we formulate a data set with $12500$ sets of numerical simulation data from FDTD, which are paired data of the set of structure parameters {$H_a$, $H_c$, $K_a$, $n$} and transmittance ($T_{yy}$ and $T_{yx}$) of asymmetric polarization converter. Table 1 presents the value space and the step size of the each structure parameter in the FDTD simulation data set. For each structure parameter set, we run the FDTD simulation of the asymmetric polarization converter and obtain the transmittance of metasurface. Each set of unprocessed $T_{yy}$ and $T_{yx}$ data contains $200$ points respectively, with wavelength ranging from $400$ nm to $800$ nm. Since we are optimizing asymmetric polarization converter for blue light, we intercept $81$ points with wavelength between $430$ nm and $550$ nm from $T_{yy}$ and $T_{yx}$ of unprocessed data, respectively. In order to improve the prediction accuracy of the transmittance of TPN, we further downsample $T_{yy}$ and $T_{yx}$ to $28$ points respectively. The data set and source codes required to reproduce our results can be found at GitHub Dataset 1 [39].

Table 1. Space and step size of structure parameter of $H_{a}$, $H_{c}$, $K_{a}$ and $n$ in data set.

View Table | View all tables in this article

3.2 Architecture of transmittance prediction network

Figure 3 presents the architecture of TPN, which contains shared layers part (SLP) and two split output parts (SOPY and SOPX). TPN takes structure parameters {$H_a$, $H_c$, $K_a$, $n$} of the asymmetric polarization converter as input data, and predicts transmittance $T_{yy}$ and $T_{yx}$ after SOPY and SOPX respectively. In SLP, batch normalization [40] (BN) layer is adopted in the ending of SLP to speed up the training process and prevent over-fitting. SLP is followed by SOPY and SOPX, which both include residual structure [41] to prevent vanishing gradient in back-propagation. In SOPY and SOPX, we design multi dense linked layers with ReLU activation function [42]. We apply LeakyReLU [43] activation function at the end of SOPY and SOPX to address dying ReLU problem [44].

Fig. 3. Architecture of the transmittance prediction network. Dense $X$ means that this layer has $X$ neurons.

Download Full Size | PDF

3.3 Performance evaluation of transmittance prediction network

3.3.1 Network training

TPN is trained on a laptop equipped with i$7$-$7700$ CPU ($4$ cores and $8$ threads) and $16$G RAM. We use hold-out validation and randomly divide the data set formulated by FDTD into five subsets, four of which are adopted as training data sets and the other is evenly divided as validation and test data.

We adopt Keras [45] as the deep learning library. Mean Square Error (MSE) and Mean Absolute Percent Error (MAPE) are considered as key performance metrics to evaluate TPN and are defined in Table 2, where $T_{yy, i}$, $T_{yx, i}$ and $T'_{yy, i}$, $T'_{yx, i}$ denote the $i$th observed value of transmittance in the FDTD data set and the $i$th prediction value $T^{\prime }_{yy}$ or $T^{\prime }_{yx}$ using TPN, respectively. MSE is also chosen as the loss function for training. Adamax algorithm, a variant of Adam [46] based on the infinity norm, is used as training optimizer for TPN. The key parameters $\beta _1$ and $\beta _2$ in Adamax are set as $0.9$ and $0.999$, respectively, and the learning rate is $0.005$. Figure 4 illustrates how the loss values of training and validation vary with the number of epochs. It can be clearly seen from Fig. 4 that for both SOPY and SOPX, the MSE rapidly decreases as the number of epochs grows, and the training process converges at about $1000$ epochs. Moreover, TPN is not over-fitting the training data set as the training and validation loss are close enough.

Fig. 4. MSE of (a) SOPY and (b) SOPX training and validation $versus$ the number of epochs.

Download Full Size | PDF

Table 2. Definitions of metrics

View Table | View all tables in this article

3.3.2 Accuracy

To evaluate the performance of TPN in predicting $T_{yy}$ and $T_{yx}$, MAPE as defined in Table 2 is calculated for the test data set. The average MAPE of $T_{yy}$ and $T_{yx}$ is $3.4$ and $4.2$, respectively, indicating that TPN can achieve an accuracy of $96.6\%$ and $95.5\%$ of FDTD simulations on $T_{yy}$ and $T_{yx}$. Figure 5 presents the histogram of MSE of $T_{yy}$ and $T_{yx}$ over the entire test data. It can be seen that for the MSE of $T_{yy}$, more than 95% predictions in the test dataset are within $2.88 \times 10^{-3}$; for the MSE of $T_{yx}$, more than 95% predictions are within $1.20 \times 10^{-3}$.

Fig. 5. MSE histogram of (a) $T_{yy}$ and (b) $T_{yx}$.

Download Full Size | PDF

Figure 6 further presents the comparison of the absolute values of $T_{yy}$ and $T_{yx}$ predicted by TPN and simulated by FDTD. Due to limited space, here we randomly select 4 sets out of the test data for the sake of illustration. As shown in Fig. 6, a good match can be observed between TPN and FDTD results for $T_{yy}$ and $T_{yx}$, indicating that the proposed TPN can serve as an accurate surrogate model of FDTD simulations for performance prediction of the proposed asymmetric polarization converter.

Fig. 6. Comparison of $T_{yy}$ and $T_{yx}$ between simulation and prediction by FDTD and TPN in the test data.

Download Full Size | PDF

3.3.3 Time-consumption

As TPN coincides with FDTD in current property simulation, let us now compare their time efficiency. Specifically, for each prediction over data set, TPN only needs about $10^{-5}$ seconds on our laptop, while FDTD requires about $120$ seconds on average on a server equipped with two XEON-$2650$ v$4$ CPUs ($12$ cores and $24$ threads for each one) and $128$G RAM. That is, TPN is much more time-efficient than FDTD (micro-second grade versus second grade), and less demanding on computing resources as well.

4. Optimal structure design based on A3C algorithm

In this section, we will first state the optimization problem, and introduce the A3C algorithm, including the optimization process and architecture of A3C algorithm agents. Then, the performance of the A3C algorithm will be presented. The optimal structure searched by A3C algorithm will be evaluated.

4.1 Problem statement

For asymmetric polarization converter optimization, especially for our proposed structure, the two important target optical properties of the converter are transmittance $T_{yy}$ and $T_{yx}$ , which are determined by device’s internal structure parameters {$H_a$, $H_c$, $K_a$, $n$}. To search for the optimal asymmetric polarization converter structures, we aim to maximize the maximum value $T^{m}_A$ of average transmittance of $T_{yy}$ and $T_{yx}$ for blue light. Hence, the optimization problem can be further written as

(2)$$\max_{H_a, H_c, K_a, n} \quad T^{m}_A=\textrm{max}\left(\dfrac{T_{yy} + T_{yx}}{2}\right) .$$

4.2 A3C algorithm design

In this subsection, we will demonstrate the detail settings of A3C algorithm, including state, action and reward function. Furthermore, the local-global agents training approach and pseudo code of algorithm will then be illustrated.

4.2.1 Optimization process of A3C algorithm

We use A3C algorithm to search for better structure parameters via iterative interactions with TPN as the simulation environment. Figure 7 shows the optimization process of the asymmetric polarization converter. At the beginning of each epoch, we initialize the structure parameters {$H_a$, $H_c$, $K_a$, $n$} randomly within their valid boundaries as defined in Table 3 as state of A3C algorithm. Then, the A3C algorithm takes action to change the state according to the reward value calculated with average transmittance fedback from TPN. After a number of iterations, reward will converge and the optimal structure is obtained by A3C algorithm. Note that the searching range of each structure parameter is bounded in this paper to avoid overly large state space and therefore excessively long time duration to converge. The lower-bound and upper-bound of each parameter are carefully selected based on simulation or analytical results. For instance, the width of rod is upper-bounded by $112$ nm as the period of the converter is fixed to $400$ nm. Simulation results show that the transmittance significantly deteriorates when the structure parameters lie outside the ranges given in Table 3.

Fig. 7. Asynchronous advantage actor-critic algorithm used for finding better asymmetric polarization converter structure parameters.

Download Full Size | PDF

Table 3. Range of values of structure parameters $H_a$, $H_c$, $K_a$ and $n$.

View Table | View all tables in this article

Next, let us define the state, action and reward function in detail.

State: Define the state $s$ as structure parameters {$H_a$, $H_c$, $K_a$, $n$}. Table 3 demonstrates the range of values for the structure parameters. It is invalid for each parameter to be out of its allowable range.

Action: Actions play an import role in adjusting state in boundaries. Referring existed designing method [29,30,33] and considering the minimum precision of {$H_a$, $H_c$, $K_a$} in manufacture is normally limited to $1$ nm, the scale of adjustment for {$H_a$, $H_c$, $K_a$} is set to $1$ nm and $2$ nm. The actions space contains $2 \times 4=8$ different actions, including increment and decrement of values. The list of all the actions are shown in Table 4.

Table 4. Definition of actions in A3C algorithm.

View Table | View all tables in this article

Reward: The reward is used to estimate the performance of the action chosen by the agent in this step. As the objective is to maximize the maximum value of average transmittance, the higher the transmittance of current structure is, the higher the reward value is. Moreover, a negative reward value is given to the agent when the performance is unsatisfied, such as the structure exceeds boundaries or the transmittance is lower than $0.3$. Therefore, the reward function $R_t$ of our optimization process with A3C algorithm is given by

(3)$$R_t=\left\{ \begin{array}{ll} -0.5, & \textrm{invalid states}\\ \exp(T^{m}_{yy}+T^{m}_{yx}-1)-1, & T^{m}_{yy} <0.3\; \textrm{or}\; T^{m}_{yx} <0.3 \\ 0, &0.3 \leq \{T^{m}_{yy}, T^{m}_{yx}\} < 0.5 \\ \exp(T^{m}_{yy}+T^{m}_{yx}-1), & T^{m}_{yy} \geq 0.5 \; \textrm{and} \; T^{m}_{yx} \geq 0.5 \\ \dfrac{\exp(T^{m}_{yy}+T^{m}_{yx}-1)}{2}, & \textrm{otherwise}, \end{array} \right.$$

where $T^{m}_{yy}$ and $T^{m}_{yx}$ represents the maximum value of $T_{yy}$ and $T_{yx}$ predicted by TPN, respectively.

4.2.2 Architecture of A3C algorithm

Reinforcement learning allows the agent to learn how to perform action $a$ as human by reward $R$ return from environment to obtain the optimal action policy $\pi$. More specifically, as shown in Fig. 8, the agents in A3C algorithm can be divided as one global agent and a number of local agents, and all the agents share an identical network structure. The global agent will collect the gradients from local agents and update parameters of its own actor network and critic network, and then synchronize these parameters to local agents. That is, each independent local agent can learn different knowledge from their own environment, and then share it with the global agent. After that, local agent synchronizes the knowledge from global agent. Such a global-local training method is beneficial for all agents with knowledge from the experience of others.

Fig. 8. Architecture of global and local agents interactions in A3C algorithm.

Download Full Size | PDF

As Fig. 8 shows, A3C algorithm combines the advantages of value-based reinforcement learning and policy-based reinforcement learning, which consists of two main components: actor network and critic network. Specifically, the actor network implements policy $\pi (a_t \mid s_t; \theta _{a})$ to decide the action $a_t$ that should be taken in state $s_t$ of current time step $t$, where $\theta _{a}$ denotes the parameters in the actor network. On the other hand, the critic network serve as the value function $V(s_t;\theta _{v})$ to estimate the value of current state $s_t$, which is the expectation of the sum rewards from current state to final state.

The optimization procedure of A3C algorithm is depicted in Algorithm 1. At the beginning of optimization, we will initialize a global agent and $N$ local agents according to number of processor threads in CPU. The structure configurations of the global agent and local agents are identical. For every local agent within maximum epoch, it will select action $a_t$ according to current state $s_t$ through policy $\pi (a_t \mid s_t; \theta _{a})$, and append the feedback reward $R_t$ with $s_t$ and $a_t$ to the memory pool until the current epoch is terminated or the gradients need to be updated to the global agent. When the epoch is terminated or the time step variable $t$ reaches the maximum value (We set the maximum value as 20 in our paper), the loss of the local agent will be computed and its gradient will be adopted to update global agent parameters. The entropy of $\pi (a_t \mid s_t; \theta _{a})$ for state $s_t$ is further subtracted to the objective function to prevent A3C algorithm convergence to suboptimal deterministic policies in early iteration [47]. Moreover, at every epoch, we use $R_t$, $s_t$ and $a_t$ in memory pool of each time step, and adopt the $n$-step returns to ensure the actor and critic network to get propagated more quickly [48,49]. After the global agent parameters are updated, the local agent will synchronize the network parameters of the global agent.

Algorithm 1. A3C algorithm for transmittance optimization of asymmetric polarization converter

View Table | View all tables in this article

4.3 Performance evaluation of A3C algorithm

The agents in A3C algorithm are trained on a server equipped with i$9$-$10900$X CPU ($20$ cores) and $64$G RAM. As discussed in Section 4.2.2, A3C algorithm can take advantage of multi-processes to accelerate the optimization progress. Hence, we deploy a local agent on each core and make them interact with their respective environments independently in a parallel manner. Figure 9 illustrates the reward performance of the A3C algorithm agents searching for the optimal structure. It can be seen from Fig. 9 that the A3C algorithm agents completely converge within $2500$ epochs, where the average reward is about $16$. Agents reach to the optimal structure parameters set {$H_a$, $H_c$, $K_a$, $n$} = {$218$ nm, $348$ nm, $53$ nm, $1.314$}, which achieves the maximum transmittance calculated through TPN among all the iterations. Note that experiments are repeatedly conducted with different epochs including $2500$, $3500$ and $10000$ epochs, and it is found that the optimal structure parameters set obtained in different experiments is the same. Furthermore, A3C algorithms with $3500$ and $10000$ epochs are converged within $2500$ epochs. The training of $2500$ epochs takes about $45$ minutes, which is much shorter than the training time of one month as reported in Ref. [30].

Fig. 9. A3C algorithm reward $versus$ the number of training steps.

Download Full Size | PDF

Let us take a closer look at the time consumption of the A3C agent training with the TPN prediction as environment and the FDTD simulation as environment. Specifically, as each FDTD simulation requires about $120$ seconds, it takes us around $17$ days to collect the training data set with $12500$ paired data for TPN. After that, we use TPN to be surrogate model of FDTD and the environment to interact with the A3C agent. In this case, the training of $2500$ epochs takes about $45$ minutes, which includes $75973$ time steps, i.e., $75393$ interactions with TPN. On the other hand, if FDTD simulation is adopted as the environment directly, no data collection is required in advance. Nevertheless, during each time step of the training of A3C agent, it needs to wait for the feedback from FDTD. Assume that the same number of time steps is required, $75393$ interactions with FDTD would take around $105$ days. Note that if hyperparameters and network structure tuning are further needed, more rounds of training are required, indicating much more time would be needed if FDTD is adopted as the environment. In all, the major component of time consumption for TPN-based DRL approach is the data set collection, which is a one-time cost; while the time consumption for FDTD-based DRL approach is bottlenecked by the running time of the simulation environment, which could be devastating for the training efficiency of the agent.

4.4 Performance evaluation of optimal structure

After searched by A3C algorithm, the corresponding optimal structure is {$H_a$, $H_c$, $K_a$, $n$} = {$218$ nm, $348$ nm, $53$ nm, $1.314$}. To further investigate the asymmetric polarization converter performance of the proposed optimal structure, we simulate the optimal structure in FDTD and compare with theoretical limit of $0.5$ transmittance. As shown in Fig. 10, $T_{yy}$ is higher than $0.5$ where wavelength between in $444$ nm and $470$ nm, $T_{yx}$ is higher than $0.5$ where wavelength between in $444$ nm and $461$ nm. The average transmittance $T_A$ larger than $0.5$ range from $444$ nm to $466$ nm. Moreover, the maximum value of $T_A$ is $0.605$ at the wavelength of $455$ nm, which is improved by $21\%$ compared to the theoretical limit of $0.5$.

Fig. 10. Transmittance of $T_{yy}$, $T_{yx}$ and average transmittance $T_{A}$ $versus$ wavelength of proposed optimal structure in FDTD simulation.

Download Full Size | PDF

To further validate the accuracy of TPN, Fig. 11 shows the prediction results of $T_{yy}$ and $T_{yx}$ for the optimal structure by using TPN and FDTD, respectively. They are well matched in both directions $T_{yy}$ and $T_{yx}$, with the MAPE of $4.09$ and $6.21$, respectively.

Fig. 11. Comparison of (a) $T_{yy}$ and (b) $T_{yx}$ between TPN and FDTD for the proposed optimal structure.

Download Full Size | PDF

In terms of the difficulty of experimental preparation and the matching degree with the optimal solution, $SiO_2$, a widely used dielectric material, is chosen as the dielectric layer material here to further conduct a verification. As shown in Fig. 12, with the practical material of $SiO_2$, the $T_{yx}$ is higher than $0.5$ within the wavelength range from $475$ nm to $486$ nm and $T_{yy}$ is higher than $0.5$ within the wavelength range from $466$ nm to $488$ nm, where {$H_a$, $H_c$, $K_a$} = {$218$ nm, $348$ nm, $53$ nm}. The average transmittance $T_{A}$ larger than $0.5$ for the wavelength range from $471$ nm to $487$ nm. It indicates that the target requirement of transmittance larger than $0.5$ can still be fulfilled for wavelength range from $471$ nm to $487$ nm. Our optimization results provide insights for material synthesizing. If material with the same refractive index as our optimized result can be prepared, the performance of asymmetric polarization converter can be further improved.

Fig. 12. The performance of device with $SiO_2$ as the dielectric layer material. {$H_a$, $H_c$, $K_a$} = {$218$ nm, $348$ nm, $53$ nm}.

Download Full Size | PDF

5. Conclusion

This paper presents a highly-accurate, time-efficient and low-complexity intelligent approach for device structure optimization of asymmetric polarization converter for blue light. A fully-connected deep neural network based on residual structure named TPN is designed to predict the transmittance of two directions $T_{yy}$ and $T_{yx}$ based on structure parameters {$H_a$, $H_c$, $K_a$, $n$}. With a FDTD numerical simulation data set, the network is trained and converged with average accuracy of $96.6\%$ and $95.5\%$. Moreover, each prediction with TPN only takes about $10^{-5}$ seconds, which is significantly decreased compared to that of FDTD, indicating that TPN can serve as a good candidate for the simulation environment to interact with RL agents in a real-time manner.

Thanks to the time efficiency of TPN, we further design a DRL agent to search for the optimal structure to maximize the average transmittance using A3C algorithm. The optimal structure parameters set is found to be {$H_a$, $H_c$, $K_a$, $n$} = {$218$ nm, $348$ nm, $53$ nm, $1.314$}. Simulation results show that the average transmittance of the optimal structure is larger than $0.5$ for the wavelengths ranging from $444$ to $466$ nm with a maximum value of $0.605$ at the wavelength of $455$ nm.

Although the idea of using deep neural network (DNN)-based surrogate models to serve as the simulation environment of deep reinforcement learning agents is beneficial to DRL training efficiency and time consumption, it also has some limitations. Firstly, a data set needs to be prepared for the training of TPN. If more structure parameters are further included for optimization, a larger data set is required which takes more time to collect and the network needs to be retrained. In this case, fine-tuning techniques can be adopted to reduce the amount of new data. Moreover, although DNN-based surrogate models have shown good generalization capabilities to work accurately, to some extent, in the regions outside of the training set, it still has certain boundary conditions. That is, surrogate models could not replace FDTD simulations in the unlimited parameter space.

Note that in addition to the structure parameters considered in this paper, a number of other parameters, such as the rotation of double-rod and material of dielectric layer, can be further taken into consideration for design optimization in the future. Moreover, A3C is adopted as the DRL algorithm in this paper, while it is of great interest to further compare the optimization results of different DRL algorithms such as Q-learning [50], Deep Q-Network (DQN) [51], Double DQN [52], A2C [53] and so forth. Apart from DRL, there are also other optimization techniques such as heuristic algorithms that deserve further study for nano-photonic structure design. Last but not the least, this paper focuses on the design of asymmetric polarization converter for blue light, yet how to design broadband high-efficiency polarization converter is another interesting topic for future study.

Funding

National Natural Science Foundation of China (62075173).

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Number 62075173.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Dataset 1 [39].

References

1. H. Choi, S. W. Cho, J. Kim, and B. Lee, “A thin 3D-2D convertible integral imaging system with a pinhole array on a polarizer,” Opt. Express 14(12), 5183–5190 (2006). [CrossRef]

2. X. Yu and H. S. Kwok, “Application of Wire-Grid Polarizers to Projection Displays,” Appl. Opt. 42(31), 6335–6341 (2003). [CrossRef]

3. O. Poncelet, G. Tallier, P. Simonis, A. Cornet, and L. A. Francis, “Synthesis of bio-inspired multilayer polarizers and their application to anti-counterfeiting,” Bioinspir. Biomim. 10(2), 026004 (2015). [CrossRef]

4. S. E. Mun, J. Hong, G. Jeong, and B. Lee, “Broadband circular polarizer for randomly polarized light in few-layer metasurface,” Sci. Rep. 9(1), 2543 (2019). [CrossRef]

5. L. Zhou, Y. Zhou, B. L. Fan, F. Nan, G. H. Zhou, Y. Y. Fan, W. J. Zhang, and Q. D. Qu, “Tailored Polarization Conversion and Light-Energy Recycling for Highly Linearly Polarized White Organic Light-Emitting Diodes,” Laser Photonics Rev. 14(7), 1900341 (2020). [CrossRef]

6. Y. Zhang, L. Yang, X. K. Li, Y. L. Wang, and C. P. Huang, “Dual functionality of a single-layer metasurface: polarization rotator and polarizer,” J. Opt. 22(3), 035101 (2020). [CrossRef]

7. Y. Kiarashinejad, Y. Abdollahramezani, S. Zandehshahvar, M. O. Hemmatyar, and A. Adibi, “Deep learning reveals underlying physics of light–matter interactions in nanophotonic devices,” Adv. Theory Simul. 2(9), 1900088 (2019). [CrossRef]

8. J. Petschulat, C. Menzel, A. Chipouline, C. Rockstuhl, A. Tünnermann, F. Lederer, and T. Pertsch, “Multipole approach to metamaterials,” Phys. Rev. A 78(4), 043811 (2008). [CrossRef]

9. B. Gallinet, J. Butet J, and O. J. F. Martin, “Numerical methods for nanophotonics: standard problems and future challenges,” Laser Photonics Rev. 9(6), 577–603 (2015). [CrossRef]

10. J. Peurifoy, Y. Shen, L. Jing, Y. Yang, F. Cano-Renteria, B. G. DeLacy, J. D. Joannopoulos, M. Tegmark, and M. Soljačić, “Nanophotonic particle simulation and inverse design using artificial neural networks,” Sci. Adv. 4(6), eaar4206 (2018). [CrossRef]

11. S. So, J. Mun, and J. Rho, “Simultaneous inverse design of materials and structures via deep learning: demonstration of dipole resonance engineering using core–shell nanoparticles,” ACS Appl. Mater. Interfaces 11(27), 24264–24268 (2019). [CrossRef]

12. I. Malkiel, M. Mrejen, A. Nagler, U. Arieli, L. Wolf, and H. Suchowski, “Plasmonic nanostructure design and characterization via deep learning,” Light: Sci. Appl. 7(1), 60 (2018). [CrossRef]

13. W. Ma, F. Cheng, and Y. Liu, “Deep-learning-enabled on-demand design of chiral metamaterials,” ACS Nano 12(6), 6326–6334 (2018). [CrossRef]

14. J. Jiang and A. F. Jonathan, “Global optimization of dielectric metasurfaces using a physics-driven neural network,” Nano Lett. 19(8), 5366–5372 (2019). [CrossRef]

15. Y. Deng, S. Ren, K. Fan, J. M. Malof, and W. J. Padilla, “Neural-adjoint method for the inverse design of alldielectric metasurfaces,” Opt. Express 29(5), 7526–7534 (2021). [CrossRef]

16. D. Liu, Y. Tan, E. Khoram, and Z. Yu, “Training deep neural networks for the inverse design of nanophotonic structures,” ACS Photonics 5(4), 1365–1369 (2018). [CrossRef]

17. O. Hemmatyar, S. Abdollahramezani, Y. Kiarashinejad, M. Zandehshahvar, and A. Adibi, “Full color generation with Fano-type resonant HfO2 nanopillars designed by a deep-learning approach,” Nanoscale 11(44), 21266–21274 (2019). [CrossRef]

18. X. Xu, C. Sun, Y. Li, J. Zhao, J. Han, and W. Huang, “An improved tandem neural network for the inverse design of nanophotonics devices,” Opt. Commun. 481, 126513 (2021). [CrossRef]

19. H. A. Akbar, A. Rehman, Z. Karim, A. Usman, and S. H. Asim, “Deep Learning enabled Forward Modeling and Inverse Design of Integrated Nanophotonic Gratings,” Frontiers in Optics. Optical Society of America JTh5A-100, (2021).

20. M. V. Zhelyeznyakov, S. Brunton, and A. Majumdar, “Deep learning to accelerate scatterer-to-field mapping for inverse design of dielectric metasurfaces,” ACS Photonics 8(2), 481–488 (2021). [CrossRef]

21. R. Yan, T. Wang, X. Jiang, X. Huang, L. Wang, X. Yue, and Y. Wang, “Efficient inverse design and spectrum prediction for nanophotonic devices based on deep recurrent neural networks,” Nanotechnology 32(33), 335201 (2021). [CrossRef]

22. Z. Liu, S. P. Rodrigues, K. T. Lee, and W. Cai, “Generative model for the inverse design of metasurfaces,” Nano Lett. 18(10), 6570–6576 (2018). [CrossRef]

23. S. So and J. Rho, “Designing nanophotonic structures using conditional deep convolutional generative adversarial networks,” Nanophotonics 8(7), 1255–1261 (2019). [CrossRef]

24. W. Ma, F. Cheng, Y. Xu, Q. Wen, and Y. Liu, “Probabilistic representation and inverse design of metamaterials based on a deep generative model with semi-supervised learning strategy,” Adv. Mater. 31(35), 1901111 (2019). [CrossRef]

25. J. Jiang, D. Sell, S. Hoyer, J. Hickey, J. Yang, and J. A.. Fan, “Free-form diffractive metagrating design based on generative adversarial networks,” ACS Nano 13(8), 8872–8878 (2019). [CrossRef]

26. S. An, B. Zheng, H. Tang, M. Y. Shalaginov, L. Zhou, H. Li, and H. Zhang, “Generative multi-functional metaatom and metasurface design networks,” arXiv preprint arXiv:1908.04851 (2019).

27. A. P. Blanchard-Dionne and O. J. F. Martin, “Successive training of a generative adversarial network for the design of an optical cloak,” OSA Continuum 4(1), 87–95 (2021). [CrossRef]

28. C. Yeung, R. Tsai, B. Pham, B. King, Y. Kawagoe, D. Ho, and A. P. Raman, “Global inverse design across multiple photonic structure classes using generative deep learning,” Adv. Opt. Mater. 9(20), 2100548 (2021). [CrossRef]

29. I. Sajedian, T. Badloe, and J. Rho, “Optimisation of colour generation from dielectric nanostructures using reinforcement learning,” Opt. Express 27(4), 5874–5883 (2019). [CrossRef]

30. I. Sajedian, H. Lee, and J. Rho, “Double-deep Q-learning to increase the efficiency of metasurface holograms,” Sci. Rep. 9(1), 10899 (2019). [CrossRef]

31. T. Badloe, I. Kim, and J. Rho, “Biomimetic ultra-broadband perfect absorbers optimised with reinforcement learning,” Phys. Chem. Chem. Phys. 22(4), 2337–2342 (2020). [CrossRef]

32. H. Wang, Z. Zheng, C. Ji, and L. Guo, “Automated multi-layer optical design via deep reinforcement learning,” Mach. Learn.: Sci. Technol. 2(2), 025013 (2021). [CrossRef]

33. A. Jiang, Y. Osamu, and L. Chen, “Multilayer optical thin film design with deep Q learning,” Sci. Rep. 10(1), 12780 (2020). [CrossRef]

34. H. Wankerl, M. L. Stern, A. Mahdavi, C. Eichler, and E. W. Lang, “Parameterized reinforcement learning for optical system optimization,” J. Phys. D: Appl. Phys. 54(30), 305104 (2021). [CrossRef]

35. L. A. A. Pettersson, L. S. Roman, and O. Inganäs, “Modeling photocurrent action spectra of photovoltaic devices based on organic thin films,” J. Appl. Phys. 86(1), 487–496 (1999). [CrossRef]

36. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” arXiv:1602.01783 (2016).

37. Q. G. Du, C. H. Kam, H. V. Demir, H. Y. Yu, and X. W. Sun, “Broadband absorption enhancement in randomly positioned silicon nanowire arrays for solar cell applications,” Opt. Lett. 36(10), 1884–1886 (2011). [CrossRef]

38. Y. F. Yu, A. Y. Zhu, R. Paniagua-Domínguez, Y. H. Fu, B. Luk’yanchuk, and A. I. Kuznetsov, “High-transmission dielectric metasurface with 2 phase control at visible wavelengths,” Laser Photonics Rev. 9(4), 412–418 (2015). [CrossRef]

39. C. Yi, Optimization for Asymmetric Polarization Converter1Github (accessed Jan. 2022) https://github.com/ChuqiaoYi/Optimization-for-Asymmetric-Polarization-Converter.

40. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” International conference on machine learning. PMLR 448–456, 2015.

41. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv:1512.03385 (2016).

42. X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings315–323, 2011.

43. A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML Workshop on Deep Learning for Audio, Speech and Language Processing 30(1), 2013.

44. L. Lu, Y. Shin, Y. Su, Y. Su, and G. E. Karniadakis, “Dying ReLU and Initialization: Theory and Numerical Examples,” Commun. Comput. Phys. 28(5), 1671–1706 (2020). [CrossRef]

45. F. Chollet, Keras, https://github.com/fchollet/keras, (2015).

46. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 (2014).

47. R. J. Williams and J. Peng, “Function optimization using connectionist reinforcement learning algorithms,” Connection Sci. 3(3), 241–268 (1991). [CrossRef]

48. C. J. C. H. Watkins, “Learning from delayed rewards,” PhD thesis, University of Cambridge England (1989).

49. J. Peng and R. J. Williams, “Incremental multi-step Q-learning,” Machine Learning Proceedings 226–232, (1994).

50. C. J. C. H. Watkins and P. Dayan, “Technical Note: Q-Learning,” Mach. Learn. 8(3-4), 279–292 (1992). [CrossRef]

51. V. Mnih, K. Kavukcuoglu K, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013 (2013).

52. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” Proceedings of the AAAI conference on artificial intelligence (30)1, (2016).

53. OpenAI, https://github.com/openai/baselines/tree/master/baselines/a2c, (2019).

Structure parameter	Min value	Max value	Step size
$H_{a}$	$100$ nm	$280$ nm	$20$ nm
$H_{c}$	$100$ nm	$550$ nm	$50$ nm
$K_{a}$	$20$ nm	$92$ nm	$8$ nm
$n$	$1.1$	$1.46$	$0.04$

Metrics	Definition
Mean square error (MSE)	$\frac{1}{n} \sum_{i = 1}^{n} {(T_{j, i} - T_{j, i}^{'})}^{2}$ , $j = y y, y x$
Mean absolute percent error (MAPE)	$\frac{100}{n} \sum_{i = 1}^{n} \| \frac{T_{j, i} - T_{j, i}^{'}}{T_{j, i}} \|$ , $j = y y, y x$

Action number	Action definition
$0$	Increase $H_{a}$ by $2$ nm
$1$	Increase $H_{c}$ by $2$ nm
$2$	Increase $K_{a}$ by $2$ nm
$3$	Increase $n$ by $0.02$
$4$	Decrease $H_{a}$ by $1$ nm
$5$	Decrease $H_{c}$ by $1$ nm
$6$	Decrease $K_{a}$ by $1$ nm
$7$	Decrease $n$ by $0.001$

Structure parameter	Min value	Max value	Step size
$H_{a}$	$100$ nm	$280$ nm	$20$ nm
$H_{c}$	$100$ nm	$550$ nm	$50$ nm
$K_{a}$	$20$ nm	$92$ nm	$8$ nm
$n$	$1.1$	$1.46$	$0.04$

Metrics	Definition
Mean square error (MSE)	$\frac{1}{n} \sum_{i = 1}^{n} {(T_{j, i} - T_{j, i}^{'})}^{2}$ , $j = y y, y x$
Mean absolute percent error (MAPE)	$\frac{100}{n} \sum_{i = 1}^{n} \| \frac{T_{j, i} - T_{j, i}^{'}}{T_{j, i}} \|$ , $j = y y, y x$

Designing high efficiency asymmetric polarization converter for blue light: a deep reinforcement learning approach

Abstract

1. Introduction

2. Methodology overview

2.1 Asymmetric polarization converter structure

2.2 Overview of the reinforcement learning approach

3. Transmittance prediction network based on deep learning

3.1 Data set

3.2 Architecture of transmittance prediction network

3.3 Performance evaluation of transmittance prediction network

3.3.1 Network training

3.3.2 Accuracy

3.3.3 Time-consumption

4. Optimal structure design based on A3C algorithm

4.1 Problem statement

4.2 A3C algorithm design

4.2.1 Optimization process of A3C algorithm

4.2.2 Architecture of A3C algorithm

4.3 Performance evaluation of A3C algorithm

4.4 Performance evaluation of optimal structure

5. Conclusion

Funding

Acknowledgements

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (12)

Tables (5)

Equations (3)

Optics Express

Structure parameter	Range of values
$H_{a}$	[ $100, 300$ ] nm
$H_{c}$	[ $100, 600$ ] nm
$K_{a}$	[ $30, 112$ ] nm
$n$	[ $1.3, 1.5$ ]