Training optronic convolutional neural networks on an optical system through backpropagation algorithms

Ziyu Gu; Zicheng Huang; Yesheng Gao; Xingzhao Liu

doi:10.1364/OE.456003

1. Introduction

As the novel research direction in machine learning, deep learning [1,2] makes machines more intelligent and plays a significant role in solving pattern recognition [3,4], object classification [5] and language processing problems [6–8]. The development of deep learning technology benefits from the updated framework of artificial neural networks (ANNs). Though the complicated networks improve the task performance, the large-scale models bring massive computation cost in electronic processing [9]. Furthermore, the accelerating of hardware computing faces the fundamental limits of transistor scaling according to Moore’s law. Hence the demand of alternative computing methods is urgent and necessary [10,11].

To break through the bottlenecks in electronic computing, applying optical technologies to big data processing is a promising way due to its ultrafast processing speed and massive parallelism characteristics. In recent years, optical neural networks (ONNs) are developing rapidly and have been realized in both free-space and integrated settings [12–18]. Among them, ONNs realized in free-space transmit the information through optical diffraction which realize parallel computing in light speed to maximum advantage. The diffractive deep neural network (${{\rm {D}}^{2}}{\rm {NN}}$) proposed by Lin, a pioneering work of significance, promoted the development of free-space ONNs [19–22]. The following works constantly improved the framework of ${{\rm {D}}^{2}}{\rm {NN}}$ to make models deeper [23], more robust [24] and even be trained through optical platform directly [25]. However, compared with the mainstream approach which implemented convolutional operations in hidden layers to extract features, ${{\rm {D}}^{2}}{\rm {NN}}$ utilized fully-connected operations instead, whose framework was equivalent to a multilayer perceptron. Simultaneously, in consideration of the optical $4f$ system is capable of realizing convolutional operations at the speed of light in free-space, it is feasible and promising to construct convolutional neural networks (CNNs) in optics [26]. Frameworks of optical CNNs were usually hybrid optical-electronic, which only achieved convolutional operations in optics, whereas other operations were still performed via electronic computing [27–29]. Hardware platform still needed to process massive computational operations hence the advantages of optical computing were not fully exploited. Therefore, in our previous works, we proposed an optronic convolutional neural network (OPCNN) which was capable of executing all computational operations in optics [30]. Due to implement convolutional module, down-sampling module and fully-connected module in optics, our framework performed object classification tasks in a similar fashion as in digital convolutional neural networks. The classification performance on Modified National Institute of Standards and Technology (MNIST) dataset [31] and Fashion-MNIST dataset [32] demonstrated the feasibility of OPCNN.

However, the training stage of most free-space ONNs is achieved in electronic platform and the optical platform only achieves inference stage with loading the pre-trained weights. Therefore, there are two main challenges derived from realizing free-space ONNs in this way. First challenge is how to eliminate the difference resulted from the platform inconsistency between training stage and inferencing stage. For example, weights are trained in electronic platform under the assumption that whole framework is strict alignment. While when loaded in optical platform, the misalignment error between layer-to-layer in optical platform will lead to the mismatch of weights thus influence the practical performance in practical inferencing process. Faced with this challenge, researchers have proposed several methods to eliminate abovementioned effects, such as adding potential misalignment error for $x$, $y$, and $z$ directions of each diffractive layer in training stages or improving the robustness of network through substituting with layers insensitive to displacement [24][33]. These methods are effective to some extent while cannot tackle the problem at root. Besides the influence of misalignment error, the initial fabrication error such as lens curvature error and the quantization error during encoding weights on optical components will also influence the optical computing results. And simultaneously, these errors are unmeasurable so cannot be modeled in training stages. The second challenge is how to thoroughly decrease the urgent demand of hardware with high computing power. Although the inference tasks are executed in optical platform with high processing speed and low consuming energy, the training tasks executed in electronic platform still depend on the processing ability of hardware. To accelerate the training speed, the high computing power of hardware is indispensable. While the advanced hardware will bring more energy consumption. Faced with this condition, realizing training stages in optical platform is necessary. Hence the essential issue is how to implement the backpropagation algorithms which optimizes the network through minimizing the loss function via optical computing. The in-situ training for optical interference neural networks are proposed first and its performance demonstrated the feasibility of implementing backpropagation in optics directly [34–36]. In 2020, Zhou et al. realized the in-situ training of ${{\rm {D}}^{2}}{\rm {NN}}$ in free space [25]. They applied the gradient descent algorithm and calculated the gradient of loss function with respect to parameters, via measuring the forward and backward propagated optical fields at each layer. On account of whole derivation operation is conducted through optical computing, the computation cost in electric platform is almost negligible. However, the approximately fully-connected structure of ${{\rm {D}}^{2}}{\rm {NN}}$ is differ from the convolutional structure of OPCNN thus this backpropagation algorithm need to be modified.

In this work, we propose the optical backpropagation algorithm of OPCNN to train the neural network in optical platform directly. We derive that the gradients of each layer are relative to optical fields in forward and backward propagating process hence the gradients can be calculated through measuring the intensity of each field. In in-situ training process, major computational operations can be executed in optics and the energy cost of electronic computing, which controls the data loading or image acquisition, is much less than training networks on electronic computer. We realize the introduction of optical nonlinearity in OPCNNs via inserting photorefractive crystal (SBN:60) between adjacent convolutional layers and then derive the corresponding optical backpropagation algorithms of framework [37–39]. The photorefractive crystal achieves nonlinearity through phase modulation in complex field. To be specific, the refractive index of crystal will change with the intensity of incident light so an additional phase will be modulated on output light. This self-focusing nonlinearity is tunable with a voltage applied across its $c$-axis. Compared with introducing nonlinearity in electronic platform, the realize of optical nonlinearity contributes to avoid the time delay in the process of acquiring then generating optical field. To satisfy real-time updating weight according to the gradients, the reconfigurable device spatial light modulators (SLMs) are used to load parameters. The measured complex optical field is realized via sCMOS camera to record amplitude information and wavefront sensor to record phase information. The generation of complex optical field is realized via a complex field generation module (CFGM) consisting of a $4f$ optical system with a low-pass filter at its Fourier plane [40]. Through simulating with physical parameters of each optical component, the in-situ trained OPCNN is validated feasible to solve complex object classification tasks on several datasets. Also, the simulation results show the performance after introducing optical nonlinearity is significantly improved compared with the linear optical framework. Furthermore, through analyzing the optical inferencing performance variation of OPCNNs under misalignment situations, we demonstrate the stronger robustness existed in networks through optical training rather than networks through electronic training. Compared with other in-situ training algorithms, our algorithm is applicable to train the free-space optical neural networks whose structures are mainly composed of convolutional module. The propose of superposed channel contributes to simplify the optical structure and provides the capacity for OPCNN to solve complex classification tasks.

The rest of paper is organized as follows. Section 2 presents the backpropagation algorithms of OPCNN without and with optical nonlinearity. Then Section 3 shows the numerical simulation results of in-situ training OPCNN to solve object classification tasks under different situations. The conclusions are drawn in Section 4.

2. Optical backpropagation algorithms

Before deriving the backpropagation algorithm, we will discuss the framework of OPCNN firstly. In our previous work, the key components of proposed OPCNN include convolutional module, down-sampling module, nonlinear activation module and global average pooling (GAP) module (Fig. 1). The main functions of down-sampling module are decreasing amount of parameters and removing redundant information of features [41]. Consequently, down-sampling module is an integral part to reduce the computation cost in electronic training process due to the limitation of computing power of hardware. Nevertheless, this module is no longer an essential part in optical training process because the speed of parallel computing in optics is independent of computational amounts. Hence the framework designed in this work only consists of convolutional module, nonlinear activation module and GAP module. The implementation method of convolutional layer is discussed in Ref. [30]. We employ the optical $4f$ system with lenslet array to realize multi-channel paralleling convolutional operations. Additional phase is modulated on the frequency spectrum of kernels to ensure the output images of each channel shift to same position, which realize the three-dimensional convolution. While in this work, we choose to implement the convolutional module with separate channel, which all convolutional layers contain equal number of channels and the output of channel in the front layer is fed as input to corresponding channel in the next layer. This kind of implementation helps to downsize the scale of framework through simplifying the cascaded convolutional layers to one single convolutional layer with one generated new kernel according to the associative law of convolution. Through combine all convolutional layers as one single layer, the scale of the framework will be compressed and convenient to calibrate and fabricate. For different datasets, we will adjust the amount of channels and make the relationship between the number of channels $N$ and the categories of dataset $C$ is $N = mC$, where $m$ is a non-negative real number. In simulations, MNIST and fashion-MNIST datasets used to train and validate the network contains ten category objects hence we design ten channels in convolutional layers. When utilizing more complicated datasets such as the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset [42,43], we will increase the amount of channels to satisfy a better classification performance, such as forty in this work. The output optical fields of these forty channels will be measured by an sCMOS camera and then be superposed to ten output. Each output is corresponding to one category. After carrying out global average pooling to these ten output, the scores are used to calculate loss function or predict classification result, which the channel accessing the highest score corresponds to the predicted category. The input images and kernels are all loaded on SLMs for reprograming and real-time data processing. In the rest of this section, we will derive the backpropagation algorithms of OPCNN without and with optical nonlinearity respectively.

2.1 Without optical nonlinearity

Before considering the backward process, we need to present the forward process of OPCNN in matrix multiplication form. First we will deduce the convolutional process implemented through an optical $4f$ system. It is well known that a $2f$ system consisting of single Fourier lens is capable of achieving Fourier transform. Input image is loaded at the front focal plane of Fourier lens and its frequency spectrum will be shown at the back focal plane. According to angular spectrum propagation algorithm, this process can be expressed as:

(1)$$U_{0}^{\prime}(u, v)=\frac{A}{j \lambda f} \iint U_{0}(x, y) \exp \left[{-}j \frac{2 \pi}{\lambda f}(x u+y v)\right] d x d y,$$

where ${U_0}\left ( {x,y} \right )$ represents the optical field of input image and ${U_0}^{\prime } \left ( {u,v} \right )$ represents the optical field of output, which is the frequency spectrum in other words; $\lambda$ and $f$ denote the wavelength of light and the focal length of Fourier lens, respectively. Ignoring the complex constant item, the integration is a standard Fourier transform. For the convenience of deriving, we need to rewrite this integration as matrix multiplication form. Assuming input image is an matrix with the size of $M \times M$, we convert the matrix into a column vector ${{\mathbf {U}}_{\mathbf {0}}}$. Equally, the output matrix is also converted to a column vector ${{\mathbf {U}}_{\mathbf {0}}}^{\prime }$. Then, the exponential item in integration can be rewritten as a matrix ${\mathbf {P}}$ with the size of ${M^{2}} \times {M^{2}}$. Hence this Fourier transform process can be expressed as Eq. (2).

(2)$${{\mathbf{U}}_{\mathbf{0}}}^{\prime} = {\mathbf{P}}{{\mathbf{U}}_{\mathbf{0}}},$$

where

(3)$$\left( {\begin{array}{c} {{U_0}^{\prime} \left( {{u_1},{v_1}} \right)}\\ {{U_0}^{\prime} \left( {{u_1},{v_2}} \right)}\\ \vdots \\ {{U_0}^{\prime} \left( {{u_M},{v_M}} \right)} \end{array}} \right){\rm{ = }}\left( {\begin{array}{ccc} {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_1}{u_1} + {y_1}{v_1}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_M}{u_1} + {y_M}{v_1}} \right)}}}\\ \vdots & \ddots & \vdots \\ {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_1}{u_M} + {y_1}{v_M}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_M}{u_M} + {y_M}{v_M}} \right)}}} \end{array}} \right) \times \left( {\begin{array}{c} {{U_0}\left( {{x_1},{y_1}} \right)}\\ {{U_0}\left( {{x_1},{y_2}} \right)}\\ \vdots \\ {{U_0}\left( {{x_M},{y_M}} \right)} \end{array}} \right),$$

At the back focal plane, the frequency spectrum of input image will do element-wise product with the kernel matrix whose size is $M \times M$. Therefore, we firstly vectorize the kernel matrix and then convert the column vector to a diagonalizable matrix with the size of ${M^{2}} \times {M^{2}}$. Thus the element-wise product can be expressed as Eq. (4).

(4)$$\begin{array}{c} {{\mathbf{U}}_{\mathbf{0}}}^{\prime \prime } = {{\mathbf{W}}_{\mathbf{1}}}{{\mathbf{U}}_{\mathbf{0}}}^{\prime} \\ {{\mathbf{W}}_{\mathbf{1}}} = diag\left( {{e^{j{{\mathbf{\Phi }}_{\mathbf{1}}}}}} \right) = \left( {\begin{array}{cccc} {{e^{ - j{\phi _{1,1}}}}} & 0 & \cdots & 0\\ 0 & \ddots & {} & \vdots \\ \vdots & {} & \ddots & \vdots \\ 0 & \cdots & \cdots & {{e^{ - j{\phi _{M,M}}}}} \end{array}} \right) \end{array},$$

where the phase items ${{\mathbf {\Phi }}_{\mathbf {1}}}{\rm {\ =\ }}\left ( {{\phi _{1,1}},{\phi _{1,2}}, \ldots,{\phi _{M,M}}} \right )$ are the parameters to be trained. The reason for only choosing the phase item to train is phase information plays a substantial role in the Fourier presentation of a signal. The optical $4f$ system is composed of two cascaded $2f$ system hence the whole propagation process can be expressed as Eq. (5).

(5)$${{\mathbf{U}}_{\mathbf{1}}}{\rm{ = }}{\mathbf{P}}{{\mathbf{W}}_{\mathbf{1}}}{\mathbf{P}}{{\mathbf{U}}_{\mathbf{0}}},$$

Therefore, after the coherent light propagating through $N$ cascaded convolutional layer with multiple channels, the optical field of output layer can be expressed as Eq. (6).

(6)$${{\mathbf{U}}_{\mathbf{N}}^{k}{\rm{ = }}\prod_{i = 1}^{N} {\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} \right)} {\mathbf{U}}_{\mathbf{0}}^{k}} {\kern 1cm} {k = 1,2, \ldots ,channel\_num},$$

where $i$ represents the $i{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer; $k$ represents the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel. We use the sCMOS camera to measure the power spectrum of optical field at the output plane and calculate its amplitude information before carrying out global average pooling operation. Notice that the above operation equals to introduce the nonlinearity through electronic computing. The subsequent operations executed on computer are shown below:

(7)$$\begin{array}{cc} {{\mathbf{O}}^{k}} = \left| {{\mathbf{U}}_{\mathbf{N}}^{k}} \right|{\rm{ = }}\left| {\prod_{i = 1}^{N} {\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} \right)} {\mathbf{U}}_{\mathbf{0}}^{k}} \right|\\ \begin{array}{ll} {{\mathbf{O}}_{{\mathbf{Superpose}}}^{i}{\rm{ = }}\sum_{j = 1}^{m} {\left( {{{\mathbf{O}}^{i + C(j - 1)}}} \right)} } & {i = 1, \ldots ,C} \end{array}\\ {\mathbf{O}}_{{\mathbf{GAP}}}^{i} = avg\left( {{\mathbf{O}}_{{\mathbf{Superpose}}}^{i}} \right)\\ {\mathbf{O}}_{{\mathbf{softmax}}}^{i} = \frac{{\exp \left( {{\mathbf{O}}_{{\mathbf{GAP}}}^{i}} \right)}}{{\sum\nolimits_j {\exp \left( {{\mathbf{O}}_{{\mathbf{GAP}}}^{j}} \right)} }}\\ L ={-} \sum\nolimits_i {{{\mathbf{T}}^{i}}\ln \left( {{\mathbf{O}}_{{\mathbf{softmax}}}^{i}} \right)} . \end{array}.$$

where $m$ represents the multiple of channels number with respect to category; ${{\mathbf {O}}^{k}}$ represents the output optical field of the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel and ${\mathbf {O}}_{{\mathbf {Superpose}}}^{i}$ represents the $i{\rm {\ }\hbox{-}{\rm \ th}}$ output after superposition operation; $C$ represents the number of categories and $m$ is a non-negative real number; ${\mathbf {T}}$ represents the ground truth labels and $L$ represents the loss function. Here we use the softmax function to normalize the output after GAP layer and use cross entropy to generate loss function. By now we have derived the forward propagation expression of OPCNN without optical nonlinearity.

To optimize the network, we need to minimize the loss function through gradient descent algorithm. Therefore, the key point of backward propagation is calculating the gradient of loss function $L$ with respect to parameters ${\mathbf {\Phi }}_{\mathbf {n}}^{k}$ of the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel in the $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer:

(8)$$\begin{aligned} \frac{{\partial L}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}} & = \frac{{\partial L}}{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}\frac{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}} + \frac{{\partial L}}{{\partial {\mathbf{U}}{{_{\mathbf{N}}^{k}}^ * }}}\frac{{\partial {\mathbf{U}}{{_{\mathbf{N}}^{k}}^ * }}}{{\partial {\mathbf{\Phi }}{{_{\mathbf{n}}^{k}}^ * }}}\\ & = 2{\mathop{\rm Re}\nolimits} \left[ {{{\left( {\frac{{\partial L}}{{\partial {{\mathbf{O}}^{k}}}} \odot \frac{{{\mathbf{U}}{{_{\mathbf{N}}^{k}}^ * }}}{{\left| {{\mathbf{U}}_{\mathbf{N}}^{k}} \right|}}} \right)}^{T}}\frac{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}} \right]\\ & = 2{\mathop{\rm Re}\nolimits} \left\{ {{{\left[ \begin{array}{l} \frac{1}{m}tile{\left( {upsample{{\left( {{\mathbf{O}}_{{\mathbf{softmax}}}^{i} - {{\mathbf{T}}^{i}}} \right)}_{{M^{2}} \times 1}}} \right)_m}\\ \odot \frac{{{\mathbf{U}}{{_{\mathbf{N}}^{k}}^ * }}}{{\left| {{\mathbf{U}}_{\mathbf{N}}^{k}} \right|}} \end{array} \right]}^{T}}\frac{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}} \right\}, \end{aligned}$$

where $upsample{\left ( {\cdot } \right )_{{M^{2}} \times 1}}$ means the upsampling operation in contrast to global average pooling operation. For a value $A$, the output after upsampling is a matrix with the size of ${M^{2}} \times 1$ and all elements in matrix have an equal value ${A \mathord {\left/ {\vphantom {A {(M \times M)}}} \right.} {(M \times M)}}$. $tile{\left ( {\cdot } \right )_m}$ means to tile the matrix $m$ times in contrast to superposition operation. According to Eq. (8), the next step is to derive the gradient of the last convolutional layer ${\mathbf {U}}_{\mathbf {N}}^{k}$ with respect to the parameters ${\mathbf {\Phi }}_{\mathbf {n}}^{k}$ of the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel in the $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer.

(9)$$\frac{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}{\rm{ = }}j * \left[ {\left( {\prod_{i = n + 1}^{N} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} } \right){\mathbf{P}}} \right] * diag\left[ {{\mathbf{W}}_n^{k}{\mathbf{P}}{\rm{(}}\prod_{i = 1}^{n - 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} ){\mathbf{U}}_{\mathbf{0}}^{k}} \right],$$

if we let

(10)$${{\mathbf{E}}^{k}}{\rm{ = }}\frac{1}{m}tile{\left( {upsample{{\left( {{\mathbf{O}}_{{\mathbf{softmax}}}^{i} - {{\mathbf{T}}^{i}}} \right)}_{{M^{2}} \times 1}}} \right)_m} \odot \frac{{{\mathbf{U}}{{_{\mathbf{N}}^{k}}^ * }}}{{\left| {{\mathbf{U}}_{\mathbf{N}}^{k}} \right|}},$$

Eq. (8) can be simplified to

(11)$$\begin{aligned} \frac{{\partial L}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}} & = 2{\mathop{\rm Re}\nolimits} \left\langle {{{\left( {{{\mathbf{E}}^{k}}} \right)}^{T}} * \left\{ {j * \left[ {\left( {\prod_{i = n + 1}^{N} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} } \right){\mathbf{P}}} \right] * diag\left[ {{\mathbf{W}}_n^{k}{\mathbf{P}}{\rm{(}}\prod_{i = 1}^{n - 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} ){\mathbf{U}}_{\mathbf{0}}^{k}} \right]} \right\}} \right\rangle \\ & = 2{\mathop{\rm Re}\nolimits} {\left\{ {j * \left[ {{\mathbf{W}}_{\mathbf{n}}^{k}{\mathbf{P}}{\rm{(}}\prod_{i = 1}^{n - 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} ){\mathbf{U}}_{\mathbf{0}}^{k}} \right] \odot \left[ {{\mathbf{P}}\left( {\prod_{i = N}^{n + 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} } \right){{\mathbf{E}}^{k}}} \right]} \right\}^{T}}, \end{aligned}$$

if we let

(12)$$\begin{aligned} {\mathbf{F}}_{\mathbf{n}}^{k} & = {\mathbf{W}}_{\mathbf{n}}^{k}{\mathbf{P}}{\rm{(}}\prod_{i = 1}^{n - 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} ){\mathbf{U}}_{\mathbf{0}}^{k}\\ {\mathbf{B}}_{\mathbf{n}}^{k} & = {\mathbf{P}}\left( {\prod_{i = N}^{n + 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} } \right){{\mathbf{E}}^{k}}, \end{aligned}$$

Eq. (11) can be simplified to

(13)$$\frac{{\partial L}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}{\rm{ = }}2{\mathop{\rm Re}\nolimits} {\left[ {j * {\mathbf{F}}_{\mathbf{n}}^{k} \odot {\mathbf{B}}_{\mathbf{n}}^{k}} \right]^{T}}.$$

Through analyzing Eq. (12) we can find vector ${\mathbf {F}}_{\mathbf {n}}^{k}$ represents to the output optical field after coherent light passing through kernel ${\mathbf {W}}_{\mathbf {n}}^{k}$ at the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel in the $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer. If letting ${{\mathbf {E}}^{k}}$, which is a vector of the same size as ${\mathbf {U}}_{\mathbf {0}}^{k}$, as the input of network and starting from the last convolutional layer to backward propagate, the vector ${\mathbf {B}}_{\mathbf {n}}^{k}$ will represent to the output optical field before coherent light passing through kernel ${\mathbf {W}}_{\mathbf {n}}^{k}$ at the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel in the $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer. The gradient of loss function with respect to parameters is relate to these two vectors only hence it is possible to realize backpropagation process in optical platform.

Before designing the optical system, we should determine which optical fields need to be measured. First, we need to measure the output ${\mathbf {U}}_{\mathbf {N}}^{k}$ of the last convolutional layer, which is used to generate the error optical field ${{\mathbf {E}}^{k}}$. Then we need to measure the optical field ${\mathbf {F}}_{\mathbf {n}}^{k}$ during forward propagating process and the optical field ${\mathbf {B}}_{\mathbf {n}}^{k}$ during backward propagating process, which ${\mathbf {n}}$ denotes $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer. For measuring these complex optical fields, we need to utilize sCMOS camera to record their amplitude information and wavefront sensor to record their phase information. Besides measuring complex optical fields, we also need to generate complex error optical field ${{\mathbf {E}}^{k}}$. In the optical system, amplitude SLM is used to load real-valued image ${\mathbf {U}}_{\mathbf {0}}^{k}$ and phase-only SLMs are used to load real-valued phase coefficient ${\mathbf {\Phi }}_{\mathbf {n}}^{k}$ to generate $\exp \left ( {j{\mathbf {\Phi }}_{\mathbf {n}}^{k}} \right )$. Actually, it is incapable of generating complex optical field through amplitude SLM or phase-only SLM separately. Therefore, we employ the method proposed in Ref. [25] and Ref. [40] which synthesizes the amplitude information and phase information from phase-only SLM through a CFGM. CFGM is composed of a $4f$ optical system with a low pass filter at its Fourier plane. For a complex field $U\left ( {x,y} \right ) = A\left ( {x,y} \right ){e^{j\varphi \left ( {x,y} \right )}}$, it can be rewritten to $U\left ( {x,y} \right ) = B{e^{j\theta \left ( {x,y} \right )}} + B{e^{j\vartheta \left ( {x,y} \right )}}$, where $B = {{{A_{\max }}} \mathord {\left /{\vphantom {{{A_{\max }}} 2}} \right.} 2}$ is an constant , $\theta \left ( {x,y} \right ) = \varphi \left ( {x,y} \right ) + {\cos ^{{\rm {\ }\hbox{-}{\rm \ 1}}}}\left ( {{{A\left ( {x,y} \right )} \mathord {\left /{\vphantom {{A\left ( {x,y} \right )} {{A_{\max }}}}} \right.} {{A_{\max }}}}} \right )$ and $\vartheta \left ( {x,y} \right ) = \varphi \left ( {x,y} \right ) - {\cos ^{{\rm {\ }\hbox{-}{\rm \ 1}}}}\left ( {{{A\left ( {x,y} \right )} \mathord {\left /{\vphantom {{A\left ( {x,y} \right )} {{A_{\max }}}}} \right.} {{A_{\max }}}}} \right )$. Then we generate the checkerboard patterns with phase $\theta \left ( {x,y} \right )$ and $\vartheta \left ( {x,y} \right )$. The checkerboard patterns which modulate phase information are loaded on phase-only SLM. After Fourier transform, the low pass filter at Fourier plane will block all diffraction orders but the zero one, in which is the frequency spectrum of $U\left ( {x,y} \right )$. After another Fourier transform, the complex field $U\left ( {x,y} \right )$ will be generated at back focal plane. According to above-mentioned analysis, the designed optical system is showed in Fig. 1.

By now the flow path of in-situ training OPCNN is clear. Take OPCNN which convolutional layers contain ten channels as an example, the training process is as follows: step one is loading images on input channels and loading generated random phase parameters on each convolutional layer; step two is recording the optical field ${\mathbf {F}}_{\mathbf {n}}^{k}$ at each layer and the output optical field ${\mathbf {U}}_{\mathbf {N}}^{k}$ of last convolutional layer through sCMOS camera and wavefront sensor; step three is feeding ${\mathbf {U}}_{\mathbf {N}}^{k}$ to computer for remaining computing operations and generating the error optical field ${{\mathbf {E}}^{k}}$; step four is encoding the complex optical field ${{\mathbf {E}}^{k}}$ on optical system through a CFGM and ${{\mathbf {E}}^{k}}$ will be fed as input to OPCNN for backward propagating; step five is recording the optical field ${\mathbf {B}}_{\mathbf {n}}^{k}$ at each layer; step six is calculating the gradient of loss function respect to parameters according to Eq. (13) and updating the parameters according to ${\left ( {{\mathbf {\Phi }}_{\mathbf {n}}^{k}} \right )_{new}} = {\mathbf {\Phi }}_{\mathbf {n}}^{k} - \eta \frac {{\partial L}}{{\partial {\mathbf {\Phi }}_{\mathbf {n}}^{k}}}$, in which $\eta$ denotes learning rate; step seven is loading ${\left ( {{\mathbf {\Phi }}_{\mathbf {n}}^{k}} \right )_{new}}$ on OPCNN for next round of training. The above seven steps are repeated cycle until the loss function $L$ stable and below the threshold value. The pseudo code of in-situ training algorithm without optical nonlinearity (Algorithm 1) is shown in the first table of Appendix.

2.2 With optical nonlinearity

After deriving the backpropagation algorithm of OPCNN without nonlinear activation, we will consider to introduce optical nonlinearity in network through utilizing photorefractive crystal (SBN:60). Photorefractive crystal is one kind of nonlinear material and its nonlinearity characteristic depends on the applied voltage. The principle of realizing nonlinearity through photorefractive crystal is as follows. When light illuminates on photorefractive crystal, material will absorb photons to produce non-uniformly distributed free space charges. The drift diffusion and recapture of space charge in the crystal result in the redistribution of space charge and generating space charge field. The space charge field modulates the refractive index by linear electro-optic effect which forms the phase-shift grating. Therefore, output light will be modulated different additional phase information at each position which value relates to the intensity of input light, thereby implementing optical nonlinearity successfully [38,39].

Supposing the base refraction index of photorefractive crystal is ${n_0}$, the change of index under light illumination can be expressed as $\Delta n = \kappa {E_{app}}{{\left \langle I \right \rangle } \mathord {\left /{\vphantom {{\left \langle I \right \rangle } {\left ( {1 + \left \langle I \right \rangle } \right )}}} \right.} {\left ( {1 + \left \langle I \right \rangle } \right )}}$, where $\left \langle I \right \rangle$ is an intensity perturbation above a spatially homogeneous background intensity $\left \langle {{I_0}} \right \rangle$, ${E_{app}}$ is an electric field applied across c-axis and $\kappa = {n_0}{r_{33}}\left ( {1 + \left \langle {{I_0}} \right \rangle } \right )$ is a constant depending on the base index of refraction ${n_0}$, the electro-optic coefficient ${r_{33}}$ and background intensity $\left \langle {{I_0}} \right \rangle$. From equations we can find the refraction index changing of photorefractive crystal is motivated by applied electric field and its changing rate depends on the intensity of illustrating light. According to above theory, the generated additional phase information relates to the change of refraction index. Supposing a coherent beam propagates through 10 $mm$ of SBN:60 photorefractive crystal under the voltage of 1000 $V$, the generated phase shift can be expressed as

(14)$$\Delta \varphi \left( I \right) = \pi \frac{{\left\langle I \right\rangle }}{{1 + \left\langle I \right\rangle }},$$

hence the optical field of output light can be expressed as

(15)$${E_{output}} = g({E_{input}}) = {E_{input}}{e^{j\pi \frac{{\left\langle {{{\left| {{E_{input}}} \right|}^{2}}} \right\rangle }}{{1 + \left\langle {{{\left| {{E_{input}}} \right|}^{2}}} \right\rangle }}}},$$

where ${E_{input}}$ and ${E_{output}}$ represent to the optical field of input light and output light; ${\left | {{E_{input}}} \right |^{2}}$ represents to the intensity of input light; $g\left ( {\cdot } \right )$ represents the nonlinear activation function.

As shown in Fig. 2, photorefractive crystal can be inserted at the output plane of each convolutional layer and the nonlinear activated optical field will be fed as input to next convolutional layer. In this way, Eq. (5) will be rewritten as

(16)$${\mathbf{U}}_{\mathbf{i}}^{k}{\rm{ = }}g\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{PU}}_{{\mathbf{i}} - {\mathbf{1}}}^{k}} \right)$$

The remaining operations are the same to Eq. (7) in forward propagating process. Then we need to calculate the gradient of loss function with respect to parameters. Comparing to Eq. (8), the gradient of the last convolutional layer ${\mathbf {U}}_{\mathbf {N}}^{k}$ with respect to the parameters ${\mathbf {\Phi }}_{\mathbf {n}}^{k}$ of the $k{\rm {\ }\hbox{-}{\rm \ th}}$ channel in the $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer has changed due to the introduction of optical nonlinearity. This derivation can be expressed as

(17)$$\begin{aligned} \frac{{\partial {\mathbf{U}}_{\mathbf{N}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}} & = \prod_{i = N}^{n + 1} {(\frac{{\partial {\mathbf{U}}_{\mathbf{i}}^{k}}}{{\partial {\mathbf{U}}_{{\mathbf{i - 1}}}^{k}}})} \frac{{\partial {\mathbf{U}}_{\mathbf{n}}^{k}}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}\\ & = j * \left\{ {\prod_{i = N}^{n + 1} {diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{PU}}_{{\mathbf{i}} - {\mathbf{1}}}^{k}} \right)} \right]{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}} } \right\} * \left\{ {diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}} \right)} \right]} \right\}\\ & * \left\{ {{\mathbf{P}}diag\left[ {{\mathbf{W}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}} \right]} \right\}, \end{aligned}$$

where $g'\left ( {\cdot } \right )$ represents the derivative of nonlinear function. Then $\frac {{\partial L}}{{\partial {\mathbf {\Phi }}_{\mathbf {n}}^{k}}}$ can be expressed as:

(18)$$\frac{{\partial L}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}} = 2{\mathop{\rm Re}\nolimits} {\left\langle {j * \left[ {{\mathbf{W}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}} \right] \odot \left\{ \begin{array}{l} {\mathbf{P}}diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}} \right)} \right] * \\ \left( {\prod_{i = N}^{n + 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{PU}}_{{\mathbf{i}} - {\mathbf{1}}}^{k}} \right)} \right]{{\mathbf{E}}^{k}}} } \right) \end{array} \right\}} \right\rangle ^{T}},$$

if we let

(19)$$\begin{aligned} {\mathbf{F}}_{\mathbf{n}}^{k} & = {\mathbf{W}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}\\ {\mathbf{B}}_{\mathbf{n}}^{k} & = {\mathbf{P}}diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{n}}^{k}{\mathbf{PU}}_{{\mathbf{n}} - {\mathbf{1}}}^{k}} \right)} \right] * \left( {\prod_{i = N}^{n + 1} {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{P}}diag\left[ {g'\left( {{\mathbf{PW}}_{\mathbf{i}}^{k}{\mathbf{PU}}_{{\mathbf{i}} - {\mathbf{1}}}^{k}} \right)} \right]{{\mathbf{E}}^{k}}} } \right), \end{aligned}$$

Eq. (19) can be simplified to

(20)$$\frac{{\partial L}}{{\partial {\mathbf{\Phi }}_{\mathbf{n}}^{k}}}{\rm{ = }}2{\mathop{\rm Re}\nolimits} {\left[ {j * {\mathbf{F}}_{\mathbf{n}}^{k} \odot {\mathbf{B}}_{\mathbf{n}}^{k}} \right]^{T}},$$

Fig. 1. The in-situ training algorithm of OPCNN without optical nonlinearity. (a) The forward propagation and (b) backward propagation processes are implemented through the same optical system. The sCMOS camera and wavefront sensor are utilized to record the amplitude information and phase information of output images. For convenient recording, we construct the $4f$ system to extend the output plane. (a) In forward propagation process, the output ${\mathbf {F}}_{\mathbf {n}}^{k}$ of each convolutional layer is recorded respectively. The output of the last convolutional layer is fed into computer for calculating error optical field ${{\mathbf {E}}^{k}}$. Before backward propagating, we encode the complex information ${{\mathbf {E}}^{k}}$ on phase-only SLM through CFGM which consists of a $4f$ optical system with a low pass filter at its Fourier plane. (b) In backward propagation process, the output ${\mathbf {B}}_{\mathbf {n}}^{k}$ of each layer is also recorded respectively. The recorded optical fields ${\mathbf {F}}_{\mathbf {n}}^{k}$ and ${\mathbf {B}}_{\mathbf {n}}^{k}$ during forward and backward propagating process are feed into computer to obtain the gradient result through simple calculation. The updated weight is reloaded on SLMs for the next round in-situ training.

Download Full Size | PDF

Fig. 2. The in-situ training algorithm of OPCNN with optical nonlinearity. (a) The forward propagation and (b)(c) the backward propagation processes are implemented through the same optical system. (a) In forward propagation process, we introduce the optical nonlinearity in OPCNN through placing the photorefractive crystal SBN:60 at the output plane of each convolutional layer. The optical fields ${\mathbf {F}}_{\mathbf {n}}^{k}$ and ${\mathbf {U}}_{\mathbf {n}}^{k}$ of each convolutional layer are recorded respectively. (b) In backward propagation process, first we need to generate and encode complex optical field ${\mathbf {E}}_{{\mathbf {n + 1}}}^{k}$ through CFGM, and record the optical field ${\left ( {{\mathbf {B}}_{\mathbf {n}}^{k}} \right )^{\prime } }$. (c) Then we generate and encode complex optical field ${\mathbf {E}}_{\mathbf {n}}^{k}$ from ${\left ( {{\mathbf {B}}_{\mathbf {n}}^{k}} \right )^{\prime } }$, and record the optical field ${\mathbf {B}}_{\mathbf {n}}^{k}$. The recorded optical fields ${\mathbf {F}}_{\mathbf {n}}^{k}$ and ${\mathbf {B}}_{\mathbf {n}}^{k}$ are feed into computer to obtain the gradient result through simple calculation. The updated weight is reloaded on SLMs for the next round in-situ training.

Download Full Size | PDF

Because of the introduction of optical nonlinearity, the process of backpropagation algorithm becomes a little complicated. Besides recording the optical field ${\mathbf {F}}_{\mathbf {n}}^{k}$ and the output optical field ${\mathbf {U}}_{\mathbf {N}}^{k}$ of last convolutional layer, we also need to record the output optical field ${\mathbf {U}}_{\mathbf {n}}^{k} = g\left ( {{\mathbf {PW}}_{\mathbf {n}}^{k}{\mathbf {PU}}_{{\mathbf {n}} - {\mathbf {1}}}^{k}} \right )$ of each layer during forward propagating process. Before backward propagating process, we need to calculate ${\left ( {{\mathbf {B}}_{\mathbf {n}}^{k}} \right )^{\prime } }{\rm {\ =\ }}\prod _{i = N}^{n + 1} {{\mathbf {PW}}_{\mathbf {i}}^{k}{\mathbf {P}}diag\left ( {{\mathbf {G}}_{\mathbf {i}}^{k}} \right ){{\mathbf {E}}^{k}}} {\rm {\ =\ }}{\mathbf {PW}}_{{\mathbf {n + 1}}}^{k}{\mathbf {PE}}_{{\mathbf {n + 1}}}^{k}$ and generate the new error field ${\mathbf {E}}_{\mathbf {n}}^{k}$. ${\mathbf {E}}_{\mathbf {n}}^{k}$ will be fed as input to $n{\rm {\ }\hbox{-}{\rm \ th}}$ convolutional layer for backward propagating and the optical field of ${\mathbf {B}}_{\mathbf {n}}^{k}$ is recorded through camera and wavefront sensor.

(21)$$\begin{aligned} {\mathbf{E}}_{\mathbf{n}}^{k} & = diag\left( {{\mathbf{G}}_{\mathbf{n}}^{k}} \right) * {\left( {{\mathbf{B}}_{\mathbf{n}}^{k}} \right)^{\prime} }\\ {\mathbf{B}}_{\mathbf{n}}^{k} & = {\mathbf{PE}}_{\mathbf{n}}^{k}. \end{aligned}$$

After recording the optical field of ${\mathbf {F}}_{\mathbf {n}}^{k}$ and ${\mathbf {B}}_{\mathbf {n}}^{k}$, the gradient can be calculated according to Eq. (20) and parameters will be updated according to ${\left ( {{\mathbf {\Phi }}_{\mathbf {n}}^{k}} \right )_{new}} = {\mathbf {\Phi }}_{\mathbf {n}}^{k} - \eta \frac {{\partial L}}{{\partial {\mathbf {\Phi }}_{\mathbf {n}}^{k}}}$. Then the remaining operations are the same to Section 2.1. The pseudo code of in-situ training algorithm with optical nonlinearity (Algorithm 2) is shown in the second table of Appendix.

Above all are the backpropagation algorithms of in-situ training OPCNN without/with optical nonlinearity. In the next section, we will numerically simulate the in-situ training process of OPCNN with physical parameters of optical components to estimate the validity of derived algorithms.

3. Simulation results and analyzation

In order to demonstrate the feasibility of our proposed in-situ training algorithm, we perform experiments which train OPCNN in optical platform or in electronic platform and compare the performance difference on two types of training. MNIST, Fashion-MNIST and MSTAR datasets are used as training sets and testing sets in experiments.

3.1 Performance of object classification tasks

First we will validate the feasibility of in-situ training algorithm of OPCNN without optical nonlinearity. The framework designed in experiments contains one convolutional layer with ten channels, one nonlinear activation layer (realized through recording amplitude information only by sCMOS camera) and one GAP layer. The light propagation and optical components are simulated under ideal conditions. In simulation, we load input images on amplitude SLMs (Hes6001, Holoeye) with the size of 1920 $\times$ 1080 pixels and load kernels on phase SLMs (Pluto-2, Holoeye) with the same size. The pixel pitch of these SLMs is equal to $8\ um$. Between each SLM we place lenslet arrays ($f = 30\ mm$, $\phi = 2\ mm$, Edmund optics) to implement multi-channel parallel convolution. At output plane, we use ORCA-Flash 4.0 V3 sCMOS camera (Hamamatsu, C13440-20CU) to record amplitude information and wavefront sensor (Phasics SID4) to record phase information. Because the focal length of lenslet array is too short to place recording devices, we utilize a $4f$ system comprising two convex lens ($f = 250\ mm$, $\phi = 25.4\ mm$, Thorlabs) to realize the extension of output layer for convenient to measure optical field. The CFGM consists of a $4f$ system same as above and an adjustable pinhole used as a low pass filter. The whole optical system is illuminated by a coherent laser light with wavelength $532\ nm$. Then we will discuss the physical parameters of network. The physical size of SLMs is 15.36 $\times$ 8.64 mm and the diameter of each lens in lenslet array is 2 $mm$. Hence a piece of SLM is capable of containing ten channels whose physical size equals to focal length of lenslet. Theoretically, one channel with the size of 3 $\times$ 3 $mm$ should encode data with the size of 250 $\times$ 250 pixels thus the size of MNIST or Fashion-MNIST images should be upsampled eight times to 224 $\times$ 224 pixels firstly and then be zero-padded to 250 $\times$ 250 pixels to fit the channel. The kernels are generated randomly at first in the range of $\left ( { - \pi,\pi } \right )$ with the size of 250 $\times$ 250 pixels. Both in optical training and electronic training process during forward propagating in simulation, the convolutional layers are implemented in optical platform and follow-up recording, global average pooling and category predicting are implemented in electronic platform. In backward propagation, network with electronic training realize weight updating through Adam optimizer and network with optical training realize weight updating through proposed backpropagation algorithm [44]. Both optical training and electronic training are performed under Python version 3.5.0 with Pytorch framework on a desktop computer (GPU: NVIDIA TITAN RTX). The training set of either MNIST dataset or Fashion-MNIST dataset includes 60000 images and the testing set includes 10000 images. All networks are trained with these datasets for 20 epochs at an initial learning rate 0.01.

Figure 3(a) presents the forward propagation process. The convolutional operations are executed optically and outputs are recorded by sCMOS camera. After global average pooling and softmax normalization on computer, the channel corresponding to the highest score represents the predicted category. After electronic training and optical training, for the same input images, the convolutional output and softmax output of two training approaches are shown in Fig. 3(b). This comparison of output values demonstrate that kernels updated in two platforms are different on account of the different gradient calculation way. For the designed framework, OPCNN achieves 96.97$\%$ classification accuracy on MNIST dataset in optical training and 96.61$\%$ classification accuracy in electronic training. As to Fashion-MNIST dataset, OPCNN achieves 84.09$\%$ classification accuracy in optical training and 83.92$\%$ classification accuracy in electronic training. The approximate classification performances validate the feasibility of proposed in-situ training algorithm.

Fig. 3. The performance comparison of OPCNN through different training approaches. (a) The forward propagating process of OPCNN. All convolutional layers are executed on optical system and the rest of operations including softmax normalization and prediction are executed in electronic platform. (b) The output comparison of two training approaches. Though the output value is not approximate due to the difference of calculating gradient, the prediction result can maintain the same. (c) The classification performance of OPCNN through optical training or electronic training on MNIST dataset. After a little epoch, both training algorithms have the ability of fast convergence.

Download Full Size | PDF

Then we will discuss the validity of training algorithm when OPCNN with optical nonlinearity. Initially, we improve the framework shown in Fig. 3(a) and increase the number of convolutional layer from single layer to two layers. Based on this framework, OPCNN can provide 96.26$\%$ classification accuracy in optical training and 96.44$\%$ classification accuracy in electronic training on MNIST dataset. As to Fashion-MNIST dataset, OPCNN can provide 82.16$\%$ classification accuracy in optical training and 82.51$\%$ classification accuracy in electronic training. Then we insert one piece of SBN:60 between these two convolutional layer. The nonlinearity response of SBN:60 is shown in Fig. 4(b). The rest of electronic operations remains unchanged. On account of introducing nonlinearity, the expressive ability of neural network will to be improved. Therefore, the performance of OPCNN on these two datasets can increase to 97.02$\%$ and 97.07$\%$ on MNIST dataset, 84.86$\%$ and 85.20$\%$ on Fashion-MNIST dataset. The classification performances are all shown in Table 1. Also, the feasibility of derived algorithm is verified simultaneously.

Fig. 4. The performance comparison of OPCNN with optical nonlinearity through different training approaches. (a) The forward propagating process of OPCNN with optical nonlinearity. The photorefractive crystal SBN:60 is placed at the output plane of first convolutional layer. (b) The nonlinearity response of SBN:60. This nonlinear material is capable of generating additional phase according to the intensity of incident light. (c) The classification performance of different frameworks on two datasets. We can see the introduction of optical nonlinearity increase the accuracy of models on account of the improvement of expressive ability.

Download Full Size | PDF

If dealing with more complicated datasets, OPCNN with ten convolutional channels is not capable of reaching expected good classification performance hence the following is the discussion of performance improvement of OPCNN with increasing amount of channels. Here we utilize MSTAR dataset as the training and testing data which is more difficult to classify than MNIST and Fashion-MNIST datasets. This dataset is collected by the Sandia National Laboratory SAR sensor platform. The collection was jointly sponsored by the Defense Advanced Research Projects Agency and the Air Force Research Laboratory as part of the MSTAR program. Thousands of Synthetic Aperture Radar (SAR) images containing ground targets were collected, including different target types, aspect angles, depression angles, serial number, and articulation, and only a small subset of which are publicly available on the website. The publicly released data sets include ten different categories of ground targets (armored personnel carrier: BMP-2, BRDM-2, BTR-60, andBTR-70; tank: T-62, T-72; rocket launcher: 2S1; air defense unit: ZSU-234; truck: ZIL-131; bulldozer: D7). For human eyes, it is hard to distinguish their categories unlike the handwritten digits and fashion costumes. For a two-layer OPCNN, framework with ten convolutional channels only reaches 66.07$\%$ classification accuracy. If increasing the amount of channels to forty, the classification accuracy will be significantly improved to 93.49$\%$ as shown in Fig. 5(d). The forward propagating process and superposition operation are shown in Fig. 5(a). For the multi-channel output of last convolutional layer, they can always be transformed to the output whose amount is equal to category through superposition operation. Be similar to three-dimensional convolution, we can increase the amount of convolutional channels to extract enough features and implement multi-features fusion for improving the cognitive function of OPCNN through superposition operation. The maximum amount of channels that one SLM can afford depends on the physical size of SLM and lenslet array. The physical size of single channel is proportional to the diameter of micro-lens in lenslet array. Therefore, if the optical system is fixed but the number of channels exceeds the maximum capacity of SLM, we can implement these channels group by group circularly. The independence of these channels makes it possible to realize them separately. In the meantime, the time consumption of forward or backward propagation will also increase. Therefore, it is proper to increase the maximum amount of channels that one SLM can afford as far as possible and there are two approaches to achieve this. First is utilizing the lenslet array with smaller physical size of micro-lens. For example, researchers use the lenslet array which lens diameter is only 0.57 mm in [28]. When using Pluto-2 SLM as the optical modulator, at least 96 channels can be contained in one single layer. The second approach is splicing the SLMs. In our early researches, we implemented an optronic high-resolution SAR processing scheme through splicing SLMs [45]. Each SLM can only processed a small part of large-scale SAR data so we divided SAR data into several SLMs in the scheme. Each SLMs was processed in parallel and then combined into a full-resolution image. In the same way, we can divide all convolutional channels into pieces of SLMs and use beam splitter to splice them together. This method may cause calibrated problem of the system so we must use PI Hexapod to control the movement of elements for precision motion and positioning when setting up system.

Fig. 5. The classificaiton performance of OPCNN with different amount of convolutional channels. (a) The forward propagating process of OPCNN with multiple channels. The superposition operation is implemented in electronic platform. (b) SAR (bottom) and the corresponding optical images (top) of the targets in MSTAR. In this work, we use the SAR images as training images. (c) The classification performance of OPCNN with/without optical nonlinearity via optical training or electronic training on MSTAR dataset. (d) The performance comparison of OPCNN with different amount of convolutional channels.

Download Full Size | PDF

Table 1. The classification accuracy of different models on two datasets

View Table | View all tables in this article

From Fig. 5(c) we can also find the introduction of optical nonlinearity improve the performance of OPCNN in simulation no matter via optical training or electronic training. The classification accuracy of OPCNN via electronic training is increased from 93.61$\%$ to 96.04$\%$. If using in-situ training algorithm, the accuracy is increased from 93.49$\%$ to 94.76$\%$. These variation trend correspond to the analyzation above. All the simulation results demonstrate that the OPCNN with separate convolutional channels is capable of classifying complicated dataset through increasing the amount of channels.

3.2 Analyzation of OPCNN via in-situ training

In this subsection, we will discuss the contributions of in-situ training algorithm to two main challenges mentioned in Section. 1. For first challenge, we will test the performance stability of OPCNN if training and inferencing processes are implemented in the same optical platform.

On account of implementing the training and inferencing processes on the same optical system, the misalignment error between each optical component will not influence the performance of OPCNN, that is to say, the robustness of OPCNN can be enhanced instead of changing framework. On the contrary, if training the networks in electronic platform, then loading the trained weight on optical system, the performance of OPCNN will fall dramatically [24]. Here we take OPCNN containing one convolutional layer as an example and the framework is shown in Fig. 6. From the optical system we can find, the relative position of input plane and kernel plane is the main source of misalignment error. Hence we first will analysis the performance influenced by position shift of input plane. In input plane, input images in multiple channels are encoded on SLM, and are alignment with paraxial of each lens. Because the area of input images is far less than SLM, we separate each input image in different location and pad the rest of area with zero. Despite tiny lateral position shift of input plane will lead images deviated from the paraxial, whole image can still propagate through corresponding lens on account of zero-padding. Therefore, its frequency spectrum still remains in the original position with an additional small rotation angle and the convolutional result will not change a lot. As to the axial position shift of input plane, its frequency spectrum remains unchanged with an additional secondary phase which also not influences the performance. Then we will analysis the performance influenced by position shift of kernel plane. On account of being located in Fourier space, lateral position shift of kernel plane will alter the intercept part of frequency spectrum as shown in Fig. 6(a) and axial position shift will influence the implementation of Fourier transform. Both situations will all decrease the performance of OPCNN.

Fig. 6. The misalignment situations occurred in optical system. Through analysis, the displacement of kernel planes is the dominant factor of decreasing the performance. Hence we simulate (a) the lateral position shift and (b) lateral position shift of kernel plane and compare the performance of OPCNN through optical training and inferencing, and through electronic training and optical inferencing. The simulation results demonstrate the network with in-situ training algorithm obtains the much stronger robustness.

Download Full Size | PDF

First we will discuss the performance of OPCNN trained through two kind of approaches when lateral position shift occurs in kernel plane. The size of kernel is 250 $\times$ 250 pixels, hence the movable range of kernel plane is (0, 250 pixels). Here we set the maximum displacement is 200 pixels. From Fig. 6(a), the performance of network through optical training can maintain stability with the classification accuracy no less than 87.05$\%$. While a vast performance decrease occurred in network through electronic training because the position of kernel is unable to match the position of frequency spectrum. Then we turn the focus on axial position shift occurs in kernel plane. Under this circumstance, the information arrived at kernel plane through Fourier lens is no longer the frequency spectrum of input image. Hence we need to modify the transmission matrix ${\mathbf {P}}$ in Eq. (5) with

(22)$$\begin{array}{c} {\mathbf{P'}}{\rm{ = }}\left( {\begin{array}{ccc} {{e^{ - j\frac{{2\pi }}{{\lambda \left( {f - \Delta z} \right)}}\left( {{x_1}{u_1} + {y_1}{v_1}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda \left( {f - \Delta z} \right)}}\left( {{x_M}{u_1} + {y_M}{v_1}} \right)}}}\\ \vdots & \ddots & \vdots \\ {{e^{ - j\frac{{2\pi }}{{\lambda \left( {f - \Delta z} \right)}}\left( {{x_1}{u_M} + {y_1}{v_M}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda \left( {f - \Delta z} \right)}}\left( {{x_M}{u_M} + {y_M}{v_M}} \right)}}} \end{array}} \right)\\ {{{\mathbf{P'}}}_{{\mathbf{inverse}}}} = diag\left( {{e^{ - j\frac{\pi }{{\lambda f}}\left( {1 - \frac{{f - \Delta z}}{f}} \right)\left( {{u_i}^{2} + {v_i}^{2}} \right)}}} \right)\left( {\begin{array}{ccc} {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_1}{u_1} + {y_1}{v_1}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_M}{u_1} + {y_M}{v_1}} \right)}}}\\ \vdots & \ddots & \vdots \\ {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_1}{u_M} + {y_1}{v_M}} \right)}}} & \cdots & {{e^{ - j\frac{{2\pi }}{{\lambda f}}\left( {{x_M}{u_M} + {y_M}{v_M}} \right)}}} \end{array}} \right), \end{array}$$

where $\Delta z$ donates the axial position shift and $i$ is from 1 to M. Considered the focal length of lenslet array is 0.03 $m$, we set the moving range of $\Delta z$ is from -0.009 $m$ to 0.009 $m$. As shown in Fig. 6(b), the performance of network through optical training is more robust than network through electronic training.

Besides the axial or lateral misalignment, the rotation of SLMs will also influence the accuracy of optical computing. In the following part, we will analyze the impact of rotation error to the performance of OPCNN.

First we will discuss the circumstance that SLM located in the kernel plane rotates along with $x$-axis or $y$-axis. Because the surface of SLM is symmetrical, we here only discuss the circumstance that rotation axis is $y$-axis. We suppose the value range of rotation angle is from 0 to ${12^ \circ }$. Just as the misalignment situation of axial position shift, the information arrived at rotating kernel plane is no longer the frequency spectrum of input images, but with additional phase shift. This phase shift depends on both the rotation angle $\theta$ and the position of pixel. Along with the change of rotation angle $\theta$, the performance of OPCNN is shown in Fig. 7(a). From figure we can find the performance of OPCNN via in-situ training is capable of maintaining its performance with the increase of rotation angle and its average classification accuracy can reach 93.19$\%$. On the contrary, the performance of OPCNN via electronic training and optical inferencing is susceptible to the rotational misalignment. Its accuracy fluctuates in a low fluctuation degree when rotation angle is small and appears a sharp drop with the increase of rotation angle.

Fig. 7. The misalignment situation occurs in optical system. When the kernel plane rotates along with the (a) $y$-axis or (b) $z$-axis, the performance of OCPNN through electronic training will greatly decreased. On the contrary, the strong robustness of network with in-situ training is capable of maintaining its performance in this misalignment situation.

Download Full Size | PDF

Then we will discuss the circumstance that SLM located in the kernel plane rotates along with $z$-axis. In this circumstance, not only the correspondence of pixels will change, but also the region of element-wise product will be truncated. The truncated area will expand with the increase of rotation angle which dramatically influence the result of convolutional operation. After truncating, the result of element-wise product is no longer the square matrix thus we need to pad them via zero-padding for the convenience of implementing Fourier transform in simulation. In simulation, we suppose the value range of rotation angle $\theta$ is from 0 to ${5^ \circ }$. The simulation result is shown in Fig. 7(b). As shown in figure, a slight rotation will cause the sharp decrease in classification accuracy of OPCNN through electronic training. The mismatch of pixels has a greater impact on performance compared with other misalignment situations. When training OPCNN optically, this influence will be weakened and the average accuracy will reach 91.89$\%$.

From the above two kinds of simulation we can find the rotation of optical components will indeed degrade the performance of OPCNN without in-situ training. Among these two situations, the mismatch of elements caused by rotating with z-axis will bring more serious performance reduction to OPCNN, which is similar to the variation tendency of performance under lateral misalignment. Our proposed in-situ training algorithm is a proven and effective approach to resist these performance reductions when misalignment exits in optical system. In real optical system, these misalignment situations cannot be eliminated totally even in accurate calibration and fabrication. We can only minimize these position errors as far as possible during calibration process through precise instruments such as PI Hexapods. While with the help of in-situ training algorithm, it is promising to neglect these impacts of OPCNN and reduce the difficulty of calibrating system in real application.

In general, the environment difference between electronic training and optical inferencing leads to the misalignment error sensitivity of OPCNN, hence it is necessary to implement in-situ training. Moreover, the in-situ training algorithm obviates the need of strict calibration to optical components, and reduces the complexity of constructing optical system. Especially in the situation that deeper network contains multiple kernel layer, the misalignment error accumulation will seriously affect the practical performance of OPCNN through electronic training. Through the simulation results, it is predictably that if increasing the number of kernel layers, the robustness of in-situ trained network will be much stronger than that training in electronic platform.

As to the influence of fabrication errors, they are similar to the misalignment errors. Due to these errors are inherent but unmeasurable, we cannot model them in code in advance to make the training process close to practice in electronic training thus if encoding the trained weight on optical system for inferencing, the performance will much lower than training accuracy. On the contrary, the in-situ training approach directly takes the physical size of optical components into account and the weight is trained on the basis of these existing fabrication errors. Meanwhile, the inferencing process is implemented on the same optical system thus the trained weight is perfect matched with the framework.

As to the influence of quantization errors, they are derived from the pixel depth of SLMs. The quantization effect is important in optical imaging, as discussed in [46]. In the article, researchers numerically evaluated the influence of the quantization of SLM for both amplitude and phase on the quality of holographic reconstructions in detail. The impact of other parameters such as resolution, zero-padding size, reconstruction distance, wavelength, random phase, pixel pitch, bit depth, phase modulation deviation, and filling factor is also discussed for the better reconstruction quality of computer-generated hologram (CGH). In our setup, the wavelength of laser used to illuminate the optical system is 532 $nm$. The green color channel of SLMs is used for addressing 8-bit gray level patterns. Hence the trained weight need to be quantized to 256 gray levels in advance before encoding. In this work, the phase information of kernel is defined as weight for training. That is to say, the phase value range (0, 2$\pi$) should be quantized to the gray value range (0, 255) firstly and SLMs will generate the phase information from the encoded gray value of each pixel according to look-up table. The pixel depth determines the gray value range of SLMs. The gray value must be an integer thus some approximate phase value will be quantized to the same gray value. Actually, the generated phase information on SLMs is differ from the phase information before quantization. As shown in Fig. 8, the comparison of gray value with different gray level quantization demonstrates the quantization errors can be reduced but not be eliminated if utilizing high modulation-precision SLMs. Considered the gray level of state-of-art SLMs are all 8-bit, the approach of improving quantization range is hard to realize. Therefore, we can use the in-situ training approach to eliminate the quantization errors as far as possible through multiple training iterations on optical system. For this kind of error, the fundamental solution is finding the optimal quantization for future SLM devices as analyzed in [46] and we believe the performance of our network could be increased with the development of fabrication level of future SLM. As present, we hope the application of the in-situ training approach can make sure that though existing the quantization errors, the OPCNN can still achieve ideal performance.

Fig. 8. The comparison of gray value under different gray level quantization. (a) Look-up table for an 8-bit SLM. For the unequal phase value 5.403 and 5.024 in (b) original phase map, their gray value through (c) 8-bit quantization are both equal to 205. Their gray value can be distinguished through (d) 12-bit quantization.

Download Full Size | PDF

For second challenge, we will compare the training speed and computational cost between electronic training approach and optical training approach. Here we take OPCNN without optical nonlinearity as an example. Except for convolutional operation, the rest operations in optical training such as superposition operation, GAP operation, softmax normalization, error optical field generation and weight updating operation are implemented in electronic platform, whose total time consuming is equal to that in electronic training. Therefore, the difference of training speed between two training approaches derives from the time consuming of the same operations implemented in two platforms. For training network with one image per iteration, the total time consuming in optical computing part (${T_{total}}$) is the sum of the time to encode data on SLMs (${T_{encode}}$), the time of propagating process (${T_{propagate}}$) and the time to measure optical fields (${T_{measure}}$). For time to encode data, the input frame rate of Pluto-2 phase only SLM is 60 $Hz$, which means the time to encode one frame image on SLM is 1/60 $s$. For the time of propagating process, it equals to the distance of propagation divided by the light speed $3 \times {10^{8}}$ $m/s$, which is approximate microsecond level. For the time to measure optical fields, the readout frame rate of Hamamatsu C13440-20CU sCMOS camera is 80 $Hz$ in condition of full resolution, USB 3.0 protocol and 8-bit gray level, which means the time to measure one frame amplitude information of optical field is 1/80 $s$. And the readout frame rate of Phasics SID4 wavefront sensor is 60 $Hz$, which means the time to measure one frame phase information of optical field is 1/60 $s$. Because the physical size of SLM and lenslet array determines we can only encode data of ten channels on one SLM for per frame, we need repeat encoding and measuring processes when implementing multi-channel OPCNN. Therefore, the total time consuming in optical forward propagation and backward propagation can be indicated in the equation below:

(23)$${T_{total}} = {T_{encode}} + {T_{propagate}} + {T_{measure}} \approx \frac{m}{{15}}{\rm{s}}.$$

where $m$ represents the multiple of channels number with respect to category and also represents the times of repeating encoding and measuring processes. For OPCNN with forty convolutional channels and two convolutional layers, the total time consuming in optical computing is ${T_{total}} = 0.27\ s$. The same operations implemented via electronic computing on desktop will consume 1.81 $s$, which is much longer than ${T_{total}}$. Besides, if increasing amount of convolutional layers, the increasing rate of time consuming in optical computing is approximate zero while in electronic computing is much fast because the processing speed of convolutional operation is not affected by computational cost in optical computing.

As to the computational cost, the main reason leads to the reduction is implementing convolutional operations via optical computing. Hence we will calculate the number of multiplication and addition operations in convolutional layers in electronic training process. For the OPCNN consisting of one convolutional layers with one convolutional channels, if training with an image with a size of $M \times M$, it will require $3{M^{3}}$ complex-valued multiplication operations and $3{M^{2}}\left ( {M - 1} \right )$ complex-valued addition operations according to Eq. (5), which means $12{M^{3}}$ real-valued multiplication operations and $6{M^{2}}\left ( {2M - 1} \right )$ real-valued addition operations. For the OPCNN consisting of N convolutional layers with K convolutional channels, the total number of real-valued operations in convolutional layers in forward propagating and backward propagating process equals to $12NK{M^{2}}\left ( {4M - 1} \right )$. Through in-situ training OPCNN in optical platform, a huge computational cost can be cut down which greatly reduce the computational burden of hardware in electronic platform.

Above simulations demonstrate the feasibility of proposed in-situ training algorithms for OPCNN. While in real applications, a large-scale optical system may cause severe power loss and the decrease of light power cannot support multiple layers for high-resolution optical computing. For this challenge, it is promising to use knowledge distillation approach to simplify the structure of OPCNN. The less optical elements and shorter propagation distance contributes to reduce the impact of power loss in whole process. Besides the structural improvement, we can also enhance the light power of laser beam via coherent beam combining technology to ensure the enough power supply in propagation process. Another challenge comes from the dynamic range of SLMs and detectors. The low dynamic range of optical modulators and optical detectors will influence the imaging quality in optical computing. Under the conditions of existing optical components, we hope the impact of low dynamic range can be reduced by in-situ training approach as far as possible. And we believe this limitation will be eliminated with the development of fabrication technology.

4. Conclusion

In this work, we derive the optical backpropagation algorithms of OPCNN and propose a novel and effectively approach to realize in-situ training neural networks in optical platform. With our scheme, the complex gradient calculation operations can be fast implemented through optical computing in light speed with low computational cost. On account of real-time training on the same optical system, the performance of OPCNN is no longer affected by the difference resulted from the platform inconsistency between training stage and inferencing stage hence increase the system robustness. Furthermore, we also introduce optical nonlinearity in OPCNN through utilizing photorefractive crystal and derive the corresponding backpropagation algorithm. To validate the feasibility of proposed algorithms, we conduct several simulations to test the classification performance of in-situ trained OPCNN on MNIST and Fashion-MNIST datasets. The simulation reaches the approximate accuracy between optical training results and electronic training results when OPCNN without misalignment. On the contrary, if position error existed in optical system, the performance of network with electronic training and optical inferencing will dramatically decrease while the performance of network through optical training is capable of maintaining its accuracy on account of the strong robustness.

There also exists some area for improvement in our proposed algorithm. Though the calculating speed is no longer the limit in optical training, we need to figure out how to process a batch of input images during one training session simultaneously. The convolutional output and error of one batch images will be mixed together thus hard to separate, which is unlike to storage these value in matrix when training in electronic platform. In addition, the backward propagating operations of OPCNN with optical nonlinearity are complicated due to multiple recording and encoding. Therefore, we need to simplify the whole procedure through improving algorithm or altering framework. One promising way is to take advantage of knowledge distillation to remain the nonlinearity in neural network without nonlinear layers. If combined this method with in-situ training algorithm, the practicability and simplicity can be further enhanced.

Appendix

A. In-situ training algorithms of OPCNN

Algorithm 1. Backpropagation without optical nonlinearity

View Table | View all tables in this article

Algorithm 2. Backpropagation with optical nonlinearity

View Table | View all tables in this article

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [31–32].

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

2. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks 61, 85–117 (2014). [CrossRef]

3. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385 (2015).

4. E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 640–651 (2016). [CrossRef]

5. A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Neural Information Processing Systems 25, 1097–1105 (2012).

6. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” arXiv: 1409.3215 (2014).

7. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” arXiv:1310.4546 (2013).

8. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805 (2019).

9. G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science 313(5786), 504–507 (2006). [CrossRef]

10. K. I. Kitayama, M. Notomi, M. Naruse, K. Inoue, and A. Uchida, “Novel frontier of photonics for data processing-photonic accelerator,” APL Photonics 4(9), 090901 (2019). [CrossRef]

11. P. R. Prucnal, B. J. Shastri, T. F. de Lima, M. A. Nahmias, and A. N. Tait, “Recent progress in semiconductor excitable lasers for photonic spike processing,” Adv. Opt. Photonics 8(2), 228–299 (2016). [CrossRef]

12. M. Miscuglio, A. Mehrabian, Z. Hu, S. I. Azzam, J. George, A. V. Kildishev, M. Pelton, and V. J. Sorger, “All-optical nonlinear activation function for photonic neural networks,” Opt. Mater. Express 8(12), 3851–3863 (2018). [CrossRef]

13. Y. Shen, N. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljacic, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

14. M. H. Tahersima, K. Kojima, T. Koike-Akino, D. Jha, B. Wang, C. Lin, and K. Parsons, “Deep neural network inverse design of integrated photonic power splitters,” arXiv:1809.03555 (2018).

15. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wight, A. Sebsastian, T. G. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

16. W. Ma, Z. Liu, Z. A. Kudyshev, A. Boltasseva, W. Cai, and Y. Liu, “Deep learning for the design of photonic structures,” Nat. Photonics 15(2), 77–90 (2020). [CrossRef]

17. G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljacic, C. Denz, D. A. B. Miller, and D. Psaltis, “Inference in artificial intelligence with deep optics and photonics,” Nature 588(7836), 39–47 (2020). [CrossRef]

18. Y. Zuo, B. Li, Y. Zhao, Y. Jiang, Y. Chen, P. Chen, G. Jo, J. Liu, and S. Du, “All-optical neural network with nonlinear activation functions,” Optica 6(9), 1132–1137 (2019). [CrossRef]

19. X. Lin, Y. Rivenson, N. Yardimci, M. Veli, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

20. J. Li, D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Class-specific differential detection in diffractive optical neural networks improves inference accuracy,” Adv. Photonics 1(06), 1–13 (2019). [CrossRef]

21. Y. Luo, D. Mengu, N. T. Yardimci, Y. Rivenson, M. Veli, M. Jarrahi, and A. Ozcan, “Design of task-specific optical systems using broadband diffractive neural networks,” arXiv:1909.06553 (2019).

22. D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Analysis of diffractive optical neural networks and their integration with electronic neural networks,” IEEE J. Sel. Top. Quantum Electron. 26(1), 1–14 (2019). [CrossRef]

23. H. Dou, Y. Deng, T. Yan, H. Wu, X. Lin, and Q. Dai, “Residual D²NN: training diffractive deep neural networks via learnable light shortcuts,” Opt. Lett. 45(10), 2688–2691 (2020). [CrossRef]

24. D. Mengu, Y. Zhao, N. Yardimci, Y. Rivenson, M. Jarrahi, and A. Ozcan, “Misalignment resilient diffractive optical networks,” Nanophotonics 9(13), 4207–4219 (2020). [CrossRef]

25. T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, and Q. Dai, “In situ optical backpropagation training of diffractive optical neural networks,” Photonics Res. 8(6), 940–953 (2020). [CrossRef]

26. J. Goodman, Introduction to Fourier Optics, 2nd. ed. (Roberts & Company Publishers, 1996).

27. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]

28. S. Colburn, Y. Chu, E. Shilzerman, and A. Majumdar, “Optical frontend for a convolutional neural network,” Appl. Opt. 58(12), 3179–3186 (2019). [CrossRef]

29. M. Miscuglio, Z. Hu, S. Li, J. George, R. Capanna, P. Bardet, P. Gupta, V. Sorger, and H. Dalir, “Massively parallel amplitude-only fourier neural network,” Optica 7(12), 1812–1819 (2020). [CrossRef]

30. Z. Gu, Y. Gao, and X. Liu, “Optronic convolutional neural networks of multi-layers with different functions executed in optics for image classification,” Opt. Express 29(4), 5877–5889 (2021). [CrossRef]

31. Y. Lecun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” At http://yann.lecun.com/exdb/-mnist/.

32. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747 (2017).

33. Z. Gu, Y. Gao, and X. Liu, “Position-robust optronic convolutional neural networks dealing with images position variation,” Opt. Commun. 505, 127505 (2022). [CrossRef]

34. A. Cruz-Cabrera, M. Yang, G. Cui, E. Behrman, J. Steck, and S. Skinner, “Reinforcement and backpropagation training for an optical neural network using self-lensing effects,” IEEE Trans. Neural Netw. 11(6), 1450–1457 (2000). [CrossRef]

35. T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5(7), 864–871 (2018). [CrossRef]

36. X. Guo, T. D. Barrett, Z. M. Wang, and A. I. Lvovsky, “Backpropagation through nonlinear units for the all-optical training of neural networks,” Photonics Res. 9(3), B71–80 (2021). [CrossRef]

37. T. Yan, J. Wu, T. Zhou, H. Xie, F. Xu, J. Fan, L. Fang, X. Lin, and Q. Dai, “Fourier-space diffractive deep neural network,” Phys. Rev. Lett. 123(2), 023901 (2019). [CrossRef]

38. L. Waller, G. Situ, and J. W. Fleischer, “Phase-space measurement and coherence synthesis of optical beams,” Nat. Photonics 6(7), 474–479 (2012). [CrossRef]

39. D. N. Christodoulides, T. H. Coskun, M. Mitchell, and M. Segev, “Theory of incoherent self-focusing in biased photorefractive media,” Phys. Rev. Lett. 78(4), 646–649 (1997). [CrossRef]

40. O. Mendoza-Yero, G. Mínguez-Vega, and J. Lancis, “Encoding complex fields by using a phase-only optical element,” Opt. Lett. 39(7), 1740–1743 (2014). [CrossRef]

41. O. Rippel, J. Snoek, and R. Adams, “Spectral representations for convolutional neural networks,” arXiv:1506.03767 (2015).

42. S. Chen, H. Wang, F. Xu, and Y.-Q. Jin, “Target classification using the deep convolutional networks for sar images,” IEEE Trans. Geosci. Remote Sensing 54(8), 4806–4817 (2016). [CrossRef]

43. E. R. Keydel, S. W. Lee, and J. T. Moore, “Mstar extended operating conditions: a tutorial,” Algorithms for Synth. Aperture Radar Imag. 2757, 228–242 (1996). [CrossRef]

44. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 (2017).

45. L. Liu, Y. Gao, and X. Liu, “Optronic high-resolution sar processing with the capability of full-resolution imaging, 2018 IEEE International Geoscience and Remote Sensing Symposium pp. 8913–8916 (2018).

46. Z. He, X. Sui, G. Jin, D. Chu, and L. Cao, “Optimal quantization for amplitude and phase in computer-generated holography,” Opt. Express 29(1), 119–133 (2021). [CrossRef]

		MNIST	Fashion-MNIST
Eletronic Training	Single Conv Layer	96.61%	83.92%
	Two Conv Layers	96.44%	82.51%
	Two Conv Layers with Nonlinearity	97.07%	86.20%
Optical Training	Single Conv Layer	96.97%	84.09%
	Two Conv Layers	96.26%	82.16%
	Two Conv Layers with Nonlinearity	97.02%	84.86%

		MNIST	Fashion-MNIST
Eletronic Training	Single Conv Layer	96.61%	83.92%
	Two Conv Layers	96.44%	82.51%
	Two Conv Layers with Nonlinearity	97.07%	86.20%
Optical Training	Single Conv Layer	96.97%	84.09%
	Two Conv Layers	96.26%	82.16%
	Two Conv Layers with Nonlinearity	97.02%	84.86%

Training optronic convolutional neural networks on an optical system through backpropagation algorithms

Abstract

1. Introduction

2. Optical backpropagation algorithms

2.1 Without optical nonlinearity

2.2 With optical nonlinearity

3. Simulation results and analyzation

3.1 Performance of object classification tasks

3.2 Analyzation of OPCNN via in-situ training

4. Conclusion

Appendix

A. In-situ training algorithms of OPCNN

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (3)

Equations (23)

Optics Express