Efficient training of unitary optical neural networks

Kunrun Lu; Xianxin Guo; Xianxin Guo

doi:10.1364/OE.500544

1. Introduction

In the past decade, machine learning has achieved unprecedented success, surpassing human-level performance in complex computer vision [1] and natural language processing tasks [2]. It has also been used to solve long-standing scientific challenges such as protein unfolding [3], algorithm optimization [4,5] etc. The latest triumph of large language models represents another major milestone that holds promise for the holy grail of artificial general intelligence [6].

The success of machine learning hinges on the availability of massive amount of compute power in the past decade. However, the trend of Moore’s law, which held for several decades and stated that the number of transistors on a chip doubles every two years, has undeniably come to an end as transistors are approaching their physical limits. There is no way for conventional digital electronics to keep up with the rapid growth trend of machine learning [7]. Additionally, the training, inference and deployment of neural networks cause significant energy consumption and terrible carbon emission [8]. All of these limitations highlight the pressing need for novel computing hardware with better sustainability. Optics emerges as a highly promising platform for achieving ultra-high computing speed and energy efficiency, thanks to the inherent coherence and superposition properties of light and various degrees of freedom for parallel information processing.

Optical neural networks (ONNs) can be implemented on different optical platforms, including free-space optics [9–11], fiber optics [12,13] and integrated photonics [14,15]. Free-space neural networks use bulk optical components, such as fibers and lenses, based on the principles of diffraction and interference. However, it is challenging to deploy them in real-world scenarios due to the difficulty of miniaturization and mass production. Fiber-based ONNs are mostly special-purpose analog solvers with limited application. Integrated photonics offers a more practical solution due to the small footprint and rich functionality offered by different materials and circuit architectures [16].

A popular photonic neural network architecture constructs the linear layer with meshes of Mach-Zehnder interferometers (MZIs) [14], mainly because the MZI mesh network can be conveniently designed, fabricated and programmed, and this photonic circuit has been pursed for quantum optical processing [17] so the adoption for machine learning is convenient. The linear layers in conventional neural networks are represented by arbitrary weight matrices. However, the MZI mesh only achieves unitary transform rather than arbitrary linear mapping between the input and output. Therefore, two of such MZI meshes in conjunction with an array of amplitude modulators are used to physically construct any arbitrary weight matrices via singular value decomposition (SVD). This leads to several practical issues limiting the ONN performance. Firstly, it requires two unitary matrices to implement one weight matrix in one layer, which doubles the hardware complexity and cost. Secondly, since any physical system exhibits certain amount of noise and imperfection, using two MZI mesh circuit in one layer doubles the system error, optical loss and degrades the system performance. Furthermore, the array of amplitude modulators is usually implemented using attenuators, which results in energy losses during light transmission. Overall optical amplification is required if the light is too weak to be detected. Using either attenuators or amplifiers inevitably degrades the system signal-to-noise ratio.

The above limitations can be avoided if we construct ONNs that comprise unitary linear layers. Unitary neural networks has been introduced to address the issue of vanishing and exploding gradients in recurrent neural networks [18], as the norm of a unitary matrix always equals one, thereby inherently ensuring stable gradients. Several optimization methods have been proposed to train such unitary neural networks. Arjovsky et al. [18] constructed the weight matrix from several elementary unitary matrices, but the expressive dimension of this optimization method is limited because the number of trainable parameters is much smaller than the full matrix size. Another method is to use matrix exponential and its approximation to calculate the gradient update [19], but this is computationally heavy. Some other methods rely on soft constraints so the unitarity is not guaranteed. Wisdom et al. introduced a full-capacity solution to optimize the weight matrices directly over the differential manifold of unitary matrices [20], hence the optimization procedure is accurate, and it can search for any unitary matrices with full expressivity.

In this work, we present a systematic numerical study of UONN on different classification and regression tasks using the full-capacity unitary optimizer during training. We further analyze the robustness of the UONN under different experimental imperfections and noise. Our results demonstrate that UONN offers significant advantages over traditional ONNs. In the future, a multi-layer deep UONN combined with error correction algorithms [21] will be a feasible solution towards advanced optical artificial intelligence.

2. UONN

Neural networks typically consist of an input layer, at least one hidden layer and an output layer as illustrated in Fig. 1(a). This network architecture is also known as a multi-layer perceptron or feedforward neural network. Information propagates through the network layer by layer where the connection between layers can be represented as matrix multiplication. After this linear connection, a nonlinear activation function is applied to the layers of neurons, enabling the neural network to learn complex nonlinear mappings between network input and output. The training process of a neural network can be described as follows: (1) feed training data through the input layer, and the output can be obtained through the forward calculation in the neural network; (2) calculate the error between network outputs and labels through a loss function, also known as an objective function; (3) calculate the gradients of the loss function with respect to all parameters through a procedure called error backpropagation; (4) update network parameters using optimization algorithms such as stochastic gradient descent (SGD). Several variants of SGD, such as MomentumSGD, RMSProp, and Adam, have been shown to improve the convergence speed and generalization performance of neural networks on various tasks. The choice of the optimization algorithm depends on the specific problem and the properties of the dataset.

Fig. 1. The architecture of UONN. (a) General architecture of neural network with input layer X, m hidden layers h and output layer Y. (b) Implementation of connection between layers in UONN as the form of $4\times 4$ unitary matrix $U$. (c) A schematic diagram of mode transformation $T_{mn}(\theta,\varphi )$. A line corresponds to an optical mode, and crossings between two modes correspond to a variable beam splitter, which can be implemented by a MZI.

Download Full Size | PDF

Here we explain the physical implementation of an UONN. It is known that any unitary transformation $U(N)$ can be physically realized by meshes of MZIs [22,23], with an example shown in Fig. 1(b). A single MZI can be constructed with two 50:50 balanced beam splitters and two phase shifters, as explained in Fig. 1(c). While the fixed 50:50 beam splitters are not configurable, the two phase shifters, parameterized by $\theta$ and $\varphi$, are to be learned during training. Transformation bewteen channels $m$ and $n$ $(m=n-1)$ corresponds to a lossless beam splitter operation between channels $m$ and $n$ with reflectivity $\cos (\theta )$ $(\theta \in [0, \pi /2])$ and a phase shift $\varphi$ at input port $m$. In mathematic, such transformation can be described as matrix $T_{mn}$ with parameters of $\theta$ and $\varphi$, which is a 2x2 MZI rotation matrix embedded in the N dimensional unitary space:

(1)$$T_{mn}(\theta,\varphi) = \begin{bmatrix} 1 & 0 & \dots & \dots & \dots & \dots & 0 \\ 0 & 1 & & & & & \vdots \\ \vdots & & \ddots & & & & \vdots \\ \vdots & & & e^{i\varphi}\cos(\theta) & -\sin(\theta) & & \vdots \\ \vdots & & & e^{i\varphi}\sin(\theta) & \cos(\theta) & & \vdots\\ \vdots & & & & \ddots & & \vdots \\ \vdots & & & & & 1 & 0 \\ 0 & \dots & \dots & \dots & \dots & 0 & 1 \end{bmatrix}_{N \times N}$$

Following the procedures outlined by Clements et al. [23], an arbitary unitary matrix can be decomposed as a diagonal matrix $D'$ followed with a specific ordered sequence $S$ of two-mode transformations $T_{mn}$, while $D'$ can be implemented in an interferometer by phase shifts on all individual channels at the output of an interferometer. A schematic diagram of implementation for $4\times 4$ unitary matrix is represented in Fig. 1(b). and the unitary matrix $U$ can be rewritten as:

(2)$$U = D'T_{23}T_{34}T_{12}T_{23}T_{34}T_{12}.$$

By construction, Eq. (2) physically corresponds to the 4-port interferometer, and the sequence of the $T_{mn}$ matrices follows the propagation direction of optical signals in the interferometer. The values of the $\theta$ and $\varphi$ of the $T_{mn}$ matrices in this equation determine the values of the beam splitting ratios and phase shifts that must be programmed to implement $U$. This decomposition principle can be generalized to any matrix dimension.

3. Unitary gradient descent

In machine learning, optimizer plays a critical role as an algorithm used to update model parameters and minimize the loss function during the training process. Optimizers iteratively adjust the model parameters in the direction of the steepest descent of the loss function. This is typically achieved by computing the gradient of the loss function with respect to the model parameters, and then utilizing this gradient to update the model parameters and minimize the loss function.

There are various optimizers available that aim to solve different optimization tasks. Among these, SGD is a popular optimization algorithm that uses a gradient descent approach to update the model parameters iteratively using a subset of training data (batch) at each step. The basic steps of SGD are as follows:(1) initialize the model parameters $p$ randomly; (2) split the dataset into batches of a certain size; (3) for each batch, compute the gradient $g$ of the loss function $f$ with respect to parameters $p$ using the current batch; (4) update the parameters $p$ at training iteration $k$ using the gradient $g$ and a learning rate $\lambda$ using the following formula:

(3)$$p^{(k+1)} = p^{(k)} - \lambda g.$$

To train our UONN, we follow the method developed in [20], and apply a UnitarySGD algorithm which updates parameters on the Stiefel manifold – the differential manifold of unitary matrices, thus ensuring the unitarity of weight matrices during the whole training process. It computes a descent curve along the manifold at each training iteration $k$, which is given by the matrix product of the Cayley transformation of $A^{k}$ with the current solution $W^{k}$.The optimization is mathematically defined as:

(4)$$Y^{(k)}(\lambda) = \Bigg(I+\frac{\lambda}{2}A^{(k)}\Bigg)^{{-}1}\Bigg(I-\frac{\lambda}{2}A^{(k)}\Bigg)W^{(k)},$$

where $A^{(k)} = G^{(k)^{H}}W - W^{(k)^{H}}G$ is a skew-Hermitian matrix and $G$ is the usual gradient of the loss function $f$ with respect to the matrix $W$. Gradient descent proceeds by performing updates $W^{(k+1)} = Y^{(k)} (\lambda )$. Unitary matrix can be optimized directly over the differentiable manifold of unitary matrices.

4. Numerical experiment

We conduct a set of numerical experiments to investigate the properties of UONN and the performance of the UnitarySGD optimization algorithm. The experiments involve two distinct tasks: classification and regression.

4.1 Classification of high-dimensional geometric shapes

For the classification task, we utilize a high-dimensional geometry dataset, where each data point has ten feature dimensions. The data are divided into two classes with a parabolic boundary surface:

(5)$$\begin{cases} |X_{10}| < \sum_{i=1}^{9}X_{i}^2, & \text{Class I} \\ |X_{10}| \geq\sum_{i=1}^{9}X_{i}^2, & \text{Class II} \end{cases}$$

To create the dataset, we initially sample 4000 data points from a 10-dimensional normal distribution, then scale the tenth dimension of the data points by a constant factor to evenly split the dataset into two classes.

Our UONN consists of two layers of unitary weights, where each weight matrix has a size of $20\times 20$ and is realized using 190 MZIs. In total, 380 MZIs are required to assemble the complete UONN. At the output, we apply a modulus square nonlinearity to represent the intensity detection of complex-valued light field. To evaluate the performance of our UONN, we train it using the UnitarySGD optimization algorithm. We use the following hyper-parameters: mini-batch size $N_{batch}=50$, learning rate $\lambda =10^{-4}$, epochs $N_{epoch}=100$. The network weights are initialized as random unitary matrices obtained from the QR decomposition of random complex-valued matrices. As a benchmark for this task, we also train a standard network without the unitarity constraint (non-UONN) using the same network size and the Adam optimizer. The implementation of each weight matrix requires 400 MZIs (380 MZIs for two $20\times 20$ unitary matrices and 20 MZIs for the $20\times 20$ diagonal matrix). In total, 800 MZIs are needed to construct an equivalent-sized non-UONN. To ensure reliable results, we conduct five runs for both UONN and the non-UONN, and the results are shown in Fig. 2.

Fig. 2. The result of 10-dimension parabolic classification task. (a) Visaulization of the classification result and UONN architecture. The figure on the left is the dataset with a parabolic boundary surface. The figure on the right is the decision boundary learned by the linear UONN. The figure in the middle shows the UONN architecture, which consists of two rectangular MZI meshes of size $20\times 20$ working as two linear layers of the network. (b) Training loss and (c) validation accuracy of UONN using UnitarySGD and non-UONN using Adam. (d)The decision boundary learned by using a UONN with SA optical activation function.

Download Full Size | PDF

For this classification task, our UONN trained with UnitarySGD achieves comparable performance to the non-UONN trained with Adam, scoring $84.78{\% }$ and $84.8{\% }$ accuracy respectively. This shows the efficacy of the UONN, even though the weights are heavily constrained to be unitary. We note that in the UONN as shown in Fig. 2(a), we use a $20\times 20$ MZI mesh for the unitary linear layer, and this is achievable with current photonic technology. In this classification experiment we only use ten input ports and two output ports. Similarly, the non-UONN does not utilize all of the MZIs comprising the weight matrices. One may also use the full mesh on other datasets with more input dimensions and output classes.

For this simple classification task, we should be able to achieve higher accuracy. The low accuracy of the above network is probably due to the lack of nonlinear activation function at the hidden layer. Therefore, we further incorporate a saturable absorption (SA) optical nonlinearity into the UONN. SA can be conveniently implemented using various materials such as atomic vapour and graphene [24]. We choose a reasonable absorption coefficient of 1.5 following each unitary layer. The resultant new decision boundary is plotted in Fig. 2(d), wihch indicates that the addition of the non-linear function increases the curvature of boundary and improves the classification ability of our UONN with a higher validation accuracy of 93.6%.

4.2 Learning matrix inverse

As for the regression task, we train our UONN to learn the inverse of a $100\times 100$ unitary matrix denoted as $U$. We first prepare a dataset with $50,000$ vectors $X_{in}$ of size $1\times 100$ that are randomly sampled from a Normal distribution. Then these vectors are transformed by a unitary matrix $U$ which is also randomly generated. The transformed data $X_{T}=X_{in} \times U$ is sent to the UONN as network input, and the network output is compared with the label, which is the raw input $X_{in}$. When the learned weight matrix is exactly the inverse of $U$, then network out $Y=X_{T}\times W = X_{in}$, yielding zero mean square error (MSE).

The UONN that we utilized in this study consists of only two unitary layers, with each weight matrix having a size of $100\times 100$, and we do not use any nonlinear activation units, because the target linear network should be able to be learned by the unitary weight matrix. At the output we define the loss function using a combination of real-part and imaginary-part MSE since the network is complex-valued:

(6)$$\mathcal{L} = \frac{1}{N}\Bigg(\sum\left[Re(Y)-Re(X_{in})\right]^2 + \sum\left[Im(Y)-Im(X_{in})\right]^2\Bigg).$$

In the numerical experiment we use the following hyper-parameters: learning rate $\lambda = 0.1$, epochs $N_{epoch}=100$, mini-batch size $N_{batch}=500$. We run the experiment five times to obtain reliable results, which are shown in Fig. 3(a). The validation accuracy can reach 100% with a proper learning rate, indicating excellent performance for this task. The real part of the product between the arbitrary transform matrix $U$ and the trained weights is extremely close to the unit matrix, and the imaginary part was much smaller than $10^{-3}$, which can be ignored. This shows that the UONN is capable of learning target matrix transformation to high accuracy.

Fig. 3. Results of matrix inversion learning and robustness comparison. (a) Training loss and validation accuracy of UONN in learning the inverse of an arbitrary unitary matrix. (b) Robustness comparison between unitary and non-UONN with the increasing component error.

Download Full Size | PDF

4.3 Network robustness

Since photonic integrated circuits usually exhibit certain imperfections that are hard to be eliminated [25], we next investigate the robustness of UONN using the matrix inverse learning task as an example. In this task we reduce the UONN network size down to $10\times 10$ because this is closer to the physical setting. Firstly, we assume the unitary layer to be implemented following the Clements design as briefly summarized above, as this is the current prevalent design. By aforementioned decomposition algorithm, the unitary matrix is decomposed into an array of reflectivity $\theta$ and an array of phase shifter $\varphi$, which are set with internal and external phase shifters respectively. Next, some deviation of the reflectivity $\sigma _{BS}$ is added into two balanced beam splitters to simulate the imperfections such as fabrication errors and suboptimal working wavelength. We further add some random Gaussian noise $\sigma _{PS}$ (rad) into internal and external phase shifters during each training epoch to simulate the effects of phase shifter cross-talk or inaccurate calibration. Both $\sigma _{PS}$ and $\sigma _{BS}$ are referred to as component error. This way we are able to simulate all the major physical imperfection in the system.

To benchmark the performance of our proposed UONN, we perform numerical experiment on a non-UONN as explained before. The robustness comparison result is shown in Fig. 3(b). We run each component error five times with epochs $N_{epoch}=20$ and a proper learning rate $\lambda$ for both UONN and non-UONN with different component error to obtain the mean value and standard deviation. The result show that both network degrades with increasing component error. However, the UONN is more robust against noise than the non-UONN, which is expected since the former network uses roughly half of the unitary MZIs in the network. We note that various error correction algorithms [21] can be applied to the UONN to further improve the network performance.

5. Conclusion

In this work, we present a systematic study of a novel type of ONN that is featured with unitary linear layers. Our UONN can be realized with existing unitary photonic chips that are developed for quantum optical tasks such as Boson sampling [17]. As compared with previous integrated ONNs using SVD and multiple matrix blocks for one layer [14], our design simplifies the network, reduces hardware error, and improves the overall network performance.

We demonstrate the efficacy of the proposed UONN using a UnitarySGD optimizer that enables optimization over the full unitary space. We show that the UONN trained with the UnitarySGD algorithm perform well in both classification and regression tasks. Furthermore, our analysis confirms the robustness of UONN in presence of practical noises and hardware imperfection. We anticipate that the UONN will serve as a valuable reference model of deep ONNs in the future, and our research offers practical guidance for enhancing the efficiency and resource utilization of ONNs by minimizing energy loss and reducing component numbers.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

2. T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners,” Advances in neural information processing systems 33, 1877–1901 (2020).

3. J. Jumper, R. Evans, A. Pritzel, et al., “Highly accurate protein structure prediction with alphafold,” Nature 596(7873), 583–589 (2021). [CrossRef]

4. A. Fawzi, M. Balog, A. Huang, et al., “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature 610(7930), 47–53 (2022). [CrossRef]

5. D. J. Mankowitz, A. Michi, A. Zhernov, et al., “Faster sorting algorithms discovered using deep reinforcement learning,” Nature 618(7964), 257–263 (2023). [CrossRef]

6. S. Bubeck, V. Chandrasekaran, R. Eldan, et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv, arXiv:2303.12712 (2023). [CrossRef]

7. J. Sevilla, L. Heim, A. Ho, et al., “Compute trends across three eras of machine learning,” in 2022 International Joint Conference on Neural Networks (IJCNN), (>IEEE, 2022), pp. 1–8.

8. E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in nlp,” arXiv, arXiv:1906.02243 (2019). [CrossRef]

9. X. Lin, Y. Rivenson, N. T. Yardimci, et al., “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

10. T. Zhou, X. Lin, J. Wu, et al., “Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit,” Nat. Photonics 15(5), 367–373 (2021). [CrossRef]

11. J. Spall, X. Guo, and A. I. Lvovsky, “Hybrid training of optical neural networks,” Optica 9(7), 803–811 (2022). [CrossRef]

12. X. Xu, M. Tan, B. Corcoran, et al., “11 tops photonic convolutional accelerator for optical neural networks,” Nature 589(7840), 44–51 (2021). [CrossRef]

13. U. Teğin, M. Yıldırım, İ. Oğuz, et al., “Scalable optical learning operator,” Nat. Comput. Sci. 1(8), 542–549 (2021). [CrossRef]

14. Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

15. J. Feldmann, N. Youngblood, M. Karpov, et al., “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

16. W. Bogaerts, D. Pérez, J. Capmany, et al., “Programmable photonic circuits,” Nature 586(7828), 207–216 (2020). [CrossRef]

17. J. Wang, F. Sciarrino, A. Laing, et al., “Integrated photonic quantum technologies,” Nat. Photonics 14(5), 273–284 (2020). [CrossRef]

18. M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neural networks,” in International conference on machine learning, (PMLR, 2016), pp. 1120–1128.

19. M. Lezcano-Casado and D. Martınez-Rubio, “Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group,” in International Conference on Machine Learning, (PMLR, 2019), pp. 3794–3803.

20. S. Wisdom, T. Powers, J. Hershey, et al., “Full-capacity unitary recurrent neural networks,” Advances in neural information processing systems 29 (2016).

21. S. Bandyopadhyay, R. Hamerly, and D. Englund, “Hardware error correction for programmable photonics,” Optica 8(10), 1247–1255 (2021). [CrossRef]

22. M. Reck, A. Zeilinger, H. J. Bernstein, et al., “Experimental realization of any discrete unitary operator,” Phys. Rev. Lett. 73(1), 58–61 (1994). [CrossRef]

23. W. R. Clements, P. C. Humphreys, B. J. Metcalf, et al., “Optimal design for universal multiport interferometers,” Optica 3(12), 1460–1465 (2016). [CrossRef]

24. X. Guo, T. D. Barrett, Z. M. Wang, et al., “Backpropagation through nonlinear units for the all-optical training of neural networks,” Photonics Res. 9(3), B71–B80 (2021). [CrossRef]

25. M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, et al., “Design of optical neural networks with component imprecisions,” Opt. Express 27(10), 14009–14029 (2019). [CrossRef]

Efficient training of unitary optical neural networks

Abstract

1. Introduction

2. UONN

3. Unitary gradient descent

4. Numerical experiment

4.1 Classification of high-dimensional geometric shapes

4.2 Learning matrix inverse

4.3 Network robustness

5. Conclusion

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (3)

Equations (6)

Optics Express