Parallel optical coherent dot-product architecture for large-scale matrix multiplication with compatibility for diverse phase shifters

Shaofu Xu; Jing Wang; Sicheng Yi; Xinrui Zhao; Binshuo Liu; Jiayi Shao; Weiwen Zou

doi:10.1364/OE.471519

1. Introduction

Matrix multiplication is a basic and indispensable building block in the computational models of various information technologies including artificial intelligence (AI) [1], wireless communications [2], autonomous driving [3], and data mining [4]. Typically, matrix multiplication consumes most computational resources (memory, space, and time) in the computational models since its complexity grows quadratically with input vector length [5]. Provided that next-generation information technologies are to process high-dimensional vectors and large-scale matrices, realizing fast and efficient matrix multiplication processors is the fundamental requisite. In recent years, photonic analog computing has become a promising candidate for high-performance matrix multiplication processors. The large-bandwidth photonic modulation and detection enable ultrafast clock frequency, over tens of GHz, greatly surpassing its electronic counterparts. Light propagating through a controlled structure physically actualizes the functionality of analog matrix multiplication with low energy consumption [6,7].

Integrated photonic matrix multiplication processors take advantage of multiple degrees-of-freedom of light, including space [8–10], time [11,12], wavelengths [13–15], and guiding modes [16], to implement matrix multiplication. Specifically, architectures using coherent light have the unique feature of processing complex optical fields: thus, their application cover not only real-value-based scenarios (e.g. AI and data mining), but also complex-value-based ones (e.g. communications and scientific computing). The typical coherent-light-based architecture exploits cascaded phase shifters for unitary transformations [8,17]. Then, following the singular value decomposition (SVD) theory, two cascaded unitary transformations and a diagonal transformation construct a universal matrix multiplication. However, a major challenge is to realize a matrix multiplication processor with high throughput, high precision (or signal-to-noise ratio, SNR), high energy efficiency, and high compactness simultaneously. Reasons of this challenge are discussed in Fig. 1. Insertion loss (IL), power use, and footprint are three basic figures of merit (FoMs) for integrated phase shifters, which are typically compromising among each other. As the integration scale grows, computing throughput becomes higher but the accumulated IL results in a reduction in precision. To maintain a desired computing precision, high-power lasers, amplifiers and ultralow-loss thermo-optic interferometers are adopted at the cost of energy efficiency and footprint. With fixed values of speed and pump power, the conventional architecture can achieve high-performance computing only if three FoMs of phase shifter are all optimized. In contrast, if the IL accumulation can be broken, we can allow some lossy phase shifters in order to achieve better power use and footprint. Recently, the optical coherent dot-product (OCD) architecture is demonstrated [18]. Inside the OCD, phase shifters are fully parallel deployed, providing a novel pathway avoiding IL accumulation.

Fig. 1. Computing FoMs vs. phase shifter FoMs. Throughput, SNR, energy efficiency, and area efficiency are four basic FoMs for processors. Such processor FoMs are directly determined by integrated scale, insertion loss, power use, and footprint of phase shifters. ‘Speed’ represents modulation speed of input modulators.

Download Full Size | PDF

In this paper, we present and study a parallel optical coherent dot-product (P-OCD) architecture, which connects parallel OCDs with loss-negligible routing structure. Matrix multiplications are computed with fully paralleled phase shifters instead of cascaded ones. IL of the architecture is then irrelevant with integration scale (see Fig. 1). Large-scale integration of phase shifters does not accumulate extra IL; thus, high throughput is achieved without affecting computing precision or energy/area efficiency. Besides, the P-OCD architecture allows various photonic integration processes, including standard silicon photonics, nano-electrooptic mechanics, phase-change materials, electro-absorptive modulation, ring-assisted Mach-Zehnder interferometers, and so forth. All these processes can realize a high-performance matrix multiplication processer. We examine the P-OCD architecture in terms of principle, computing FoMs, and practical task performances. Results show that the P-OCD architecture reaches higher FoMs than cascaded-interferometer-based architectures, especially when lossy interferometers are applied, and that the P-OCD architecture fulfills practical tasks (artificial neural networks and image compression taken as the benchmarks) more precisely and faster.

2. Principles

2.1 P-OCD architecture

In principle, a matrix multiplication contains multiple dot-products between the matrix rows and the input column vector, so that an OCD can be the physical building block for matrix multiplication processor. By deploying multiple OCDs in parallel way, the P-OCD architecture is constructed (shown in Fig. 2). A complex-valued vector is represented by multiple input optical fields. A local oscillation (LO) optical field is also input for the coherent homodyne detection. In each OCD, a uniform portion of light is split from the bus waveguides via inter-layer couplers (ILC) and array of interferometers are deployed to load complex-valued weights to the input optical fields. After the interferometers, weighted optical fields interfere in the optical combiners, yielding the result of complex-valued dot-product. In the architecture, the lengths of optical paths from the bus waveguides to the combiners are designed the same for the high-speed operation of OCDs. Homodyne detection with LO light and the 90°-shifted LO light is applied to detect the real part (I) and the imagery part (Q) of the complex-valued output.

Fig. 2. The P-OCD architecture. A cross-layer routing structure allocates input optical signals to OCDs uniformly with negligible insertion loss. (a) Detailed structure of inter-layer waveguide crossing. An inter-layer coupler (ILC) transit upper-layer (blue) light to lower layer (green). The height between layers is denoted as ‘h’. (b) Detailed structure of the interferometer. Two phase shifters are adopted to conduct complex optical field manipulation.

Download Full Size | PDF

The size of matrix is assumed to be N × N. The number of input ports is N + 1; the number of OCDs is N; and the maximum number of waveguide crossings in the routing structure is N × (N-1). When the matrix size becomes large, conventional planar waveguide crossings introduce significant and unacceptable loss and crosstalk. Considering that the P-OCD architecture separates the routing structure and the weighting structure, it is possible to use multiple layers to deploy crossing waveguides, shown in Fig. 2(a). Between two layers, insertion loss is caused by the evanescent wave coupling, whose intensity drops exponentially with the waveguide interval (‘h’ in Fig. 2(a)). As wider waveguide concentrates the optical mode inside the waveguide, broadening the waveguide is efficient to reduce evanescent wave coupling between layers. Using such method, negligible insertion loss below 0.0001 dB/crossing is achievable with proper design of ILCs and waveguide crossings [19].

Figure 2(b) shows the structure of a weighting interferometer, with two phase shifters at the upper and lower arm, respectively. This structure is capable of manipulating amplitude and phase of an optical field. We assume the input optical field and the targeted output field are A₁·exp(iθ₁) and A₂·exp(iθ₂), respectively. The phase shifts of the upper and the lower arm are represented as follows:

(1)$${\varphi _1} = ({{\theta_2} - {\theta_1}} )+ \arccos \left( {\frac{{{A_2}}}{{{A_1}}}} \right)$$

(2)$${\varphi _2} = ({{\theta_2} - {\theta_1}} )- \arccos \left( {\frac{{{A_2}}}{{{A_1}}}} \right).$$

The above equations represent a complex-valued multiplication with the weight of (A₂/A₁)·exp(i(θ₂-θ₁)).

In this section, we introduce the principle of the P-OCD architecture to perform complex-valued matrix multiplication. At the routing structure, 1/N portion of power is allocated to each interferometer. Inside an OCD, complex-valued weights are multiplied, then N outputs of the interferometers are combined coherently. These procedures can be formulated as follows:

(3)$${E_c} = \sqrt P {\left( {\frac{1}{{\sqrt N }}} \right)^2}\sum\limits_{k = 1}^N {{w_k}} .$$

where P is the input optical power of each port, E_c is the combined optical field, and w_k is the complex-valued weights inside a unit circle (|w_k|≤1). When all weights are in-phase, the combined optical field reaches the maximum amplitude, and the combined power equals to P. By configuring the amplitude or phase, the interferometers can perform other weights combinations. In the cases that the matrix is sparse or near-sparse, the fixed optical power splitting and combining may introduce power loss (remaining 1/N² in the worst case). One may use the tunable binary tree scheme [20] to design the splitters and combiners to compensate for the power loss. In practical situations, the integrated photonic devices are not perfect, causing the non-uniform distribution of optical power in interferometers. From the matrix perspective, such imperfection introduces error to every single matrix weight. Given the full parallelism of the interferometers, matrix weights can be configured independently. The imperfect power distribution can be compensated. To maintain the fidelity of matrix multiplication, we have to normalize the maximum output power of every interferometer according to the most-lossy one. In other words, the imperfections of photonic device cause penalty IL, without compromising the fidelity of the P-OCD architecture.

3. Performances

3.1 Insertion loss and SNR

The major difference of the P-OCD architecture from the interferometer-cascaded architectures is that all weighting phase shifters are deployed in a parallel manner. We take the symmetric version of Clements’ Mach-Zehnder interferometer mesh (MZIM) architecture as the representative of the cascaded architectures because it is, up to date, the most compact and shallow architecture [27]. It comprises N + 1 stages of interferometers (two unitary and one diagonal matrix) to perform arbitrary matrix multiplications. From the input to output port (or photodetection), its insertion loss accumulation can be formulated as:

(4)$$I{L_{MZIM}}[dB] = A + B \cdot (N + 1)$$

where all coefficients are in dB unit, A stands for the constant IL accumulation, including IL from input light coupling and device imperfection penalty, and B stands for the IL that accumulates linearly with integration scale, including IL from phase shifters and waveguide transmission. For P-OCD architecture, the IL accumulation is formulated as:

(5)$$I{L_{POCD}}[dB] = A^{\prime} + B^{\prime} \cdot N + C^{\prime} \cdot N(N - 1).$$

In the P-OCD architecture, light transmission from the input to the output only suffers the IL from interferometers once. So, constant IL A’ includes input light coupling, the ILC, phase shifter, and device imperfection penalty, B’ stands only for waveguide transmission loss, and C’ stands for IL of the waveguide crossing. Although the last term grows quadratically with integration scale, the coefficient C’ can be extremely low [19]. Therefore, the P-OCD architecture has the potential to decouple IL from integration scale.

Table 1 (columns in ‘C1’) shows the typical IL values of different sources. ‘α’ denotes the IL of a phase shifter. It is a key parameter and is sensitive to different types of manufacturing processes. Using these values, the coefficients of IL accumulation are calculated: i.e. A = 3, B=α+0.01, A’=3.4+α, B’=0.01, C’=2e-4. Figure 3(a) illustrates the IL accumulation of the MZIM and P-OCD architectures. At ultralow α, the MZIM architecture does not accumulate obvious loss, performing better than the P-OCD. As α increases, IL of the MZIM exceeds the P-OCD rapidly, as its linear IL accumulation coefficient B becomes large. With α above 0.5 dB, the MZIM will reach an unacceptable IL easily, limiting large-scale integration and large computing throughput. In contrast, the P-OCD can achieve large-scale integration even with lossy phase shifters. It can thus be inferred that the P-OCD architecture is more tolerant to the IL of phase shifters and is the potential candidate for achieving better computing precisions with identical input optical power.

Fig. 3. IL and SNR evaluation of the MZIM and P-OCD architectures. (a) Comparison of IL accumulation of the MZIM and P-OCD architecture. The solid and the dashed lines show the IL of MZIM and P-OCD architecture, respectively. Different phase shifter loss is shown with different colors. (b) Achievable SNR vs. scale and phase shifter loss of the MZIM. The SNR is shown as contours. For example, the ‘6-bit’ contour surrounds an area whose SNR at least supports 6-bit computing precision (i.e. SNR ≥ 37.88 dB). (c) Achievable SNR of the P-OCD. (b) and (c) share the same color bar.

Download Full Size | PDF

Table 1. Parameters used in this study

View Table | View all tables in this article

SNR measures the precision of computing. Noise of the P-OCD is contributed by the laser, the input signal generator (digital-analog converter), and the amplified PD. Here, deviation of weighting interferometers and splitting imperfections are not considered as noise since they can be compensated with various methods. The effects of device imperfections are calculated as IL. The SNR is then modelled with the relative intensity noise (RIN) of lasers and the noise of amplified PDs (amplifier noise and shot noise), formulated as follows:

(6)$$SNR = \frac{{{\alpha _l}{M^2}{P_{in}}}}{{NEP \cdot {R_{pd}}\sqrt {\varDelta f} + 2 \cdot {\alpha _l}{M^2}q{R_{pd}}\varDelta f{P_{in}} + \frac{1}{2}{\alpha _l}{M^2}RIN \cdot {R_{pd}}\varDelta f{P_{in}}^2}}$$

where α_l is the IL of optical link in linear unit, M is the modulation depth of the input optical signal (M = 0.5 in this work), P_in is the input optical power, NEP is the noise-equivalent power of the PD amplifier, R_pd is the responsivity of PD, Δf is the bandwidth (computing speed), q is the elementary charge, and RIN is the relative intensity noise of the laser. The values of these parameters are listed in Table 1 (columns in ‘C2’). The noise from PD amplifiers is a constant with fixed bandwidth and is the most impactful factor when input optical power is weak. When the input optical power is large, the RIN noise dominates and induces an upper limit to feasible SNR.

Typically, an optical processor adopts the lowest possible input optical power. Thus, the major noise source is the first term of amplified photodetection noise which is a determined value when the photodetector (PD) type and bandwidth is fixed. Larger insertion loss of optical link without extra pumping light directly results in lower SNR. Figures 3(b) and 3(c) show the achievable SNR at different scales and different phase shifter losses (α). The area surrounded by a contour shows the feasible scale and phase shifter loss to realize corresponding SNR, represented by bit precision in the figure. For MZIM architecture, there is a trade-off between scale and phase shifter loss. The loss performance of phase shifters becomes extremely rigid if large-scale integration and a proper SNR are both desired. Fabricating such phase shifters is challenging. For P-OCD architecture, large-scale integration and proper SNR can be both achieved with much wider range of phase shifter IL. Thus, various types of phase shifters (e.g. thermo-optic (TO), micro-electromechanical system (MEMS), and phase-change materials (PCM)) and various photonic integration processes (e.g. standard silicon-on-insulator (SOI), III-V) are all feasible in building the matrix multiplication processor.

3.2 Throughput and energy efficiency

Throughput of computing is determined by operation frequency and the number of computing units. The operating frequency is limited by the input electro-optic modulator and output PDs for all interferometer-based architecture (including the P-OCD), because the bandwidth of optical splitters, interferometers, and couplers are much larger than the electro-optic interconversions. In this work, we use a fixed operating frequency, 10 Gbaud, for evaluations. As the operation frequency is fixed, the way to increase computing throughput is to build large-scale computing units. For MZIM architecture, feasible integration scale is severely strongly limited by the IL of every interferometer. From Fig. 3 we read that the way to achieve over 100 TOPS (10¹² operations per second) with over 4-bit precision is to adopt phase shifters with IL lower than 0.15 dB. For P-OCD architecture, feasible integration scale is not limited by interferometer loss but the routing loss. With well-designed inter-layer routing structure, throughput of the P-OCD can reach over 500 TOPS with over 4-bit precision and the IL of phase shifters is not restrained.

The most interesting feature of photonic matrix multiplication is that the energy efficiency is inverse-proportional to integration scale with low-consuming weighting devices, because the energy consumption grows linearly with lateral scale while throughout grows quadratically with the lateral scale. As the P-OCD architecture is a promising candidate for achieving large-scale integration, its advantage lies in energy efficiency. Several representative phase shifters reported in the literature [28–41] are chosen for evaluation of the energy consumption. Table 2 lists key specifications of these phase shifters. TO phase shifters are typically low-loss but their energy consumption is high. MEMS and PCM based phase shifters feature ultra-low or even zero static energy consumption but with insertion loss sensitive to fabrication process.

Table 2. Parameters of phase shifters adopted for performance evaluation.

View Table | View all tables in this article

For photonic matrix multiplication processors, energy is dissipated mainly by lasers (or optical amplifiers), digital-to-analog converters (DACs), modulator drivers, weight tuning, PD amplifiers, analog-to-digital converters (ADCs) and affiliated electronics. Given that the major part of energy consumption is linear computing, the energy dissipation of affiliated electronics is excluded from our modelling. The power use of lasers, DACs, modulators, PDs, and ADCs scales linearly with the lateral scale of the matrix, while the weight tuning scales quadratically. The total power use is formulated as follows.

(7)$${P_{tot}} = N \cdot ({{P_{laser}}/{\eta_{wpe}} + {P_{DAC}} + {P_{mod}} + {P_{TIA}} + {P_{ADC}}} )+ {N^2} \cdot {P_{weight}}$$

where η_wpe is the wall-plug efficiency of a laser, 10% chosen in this work. Power use of different parts is listed in Table 1 (columns in ‘C3’). The operating frequency is 10 Gbaud. The power use of weighting interferometers can be found in Table 2 and a weighting interferometer contains two phase shifters.

Figure 4 illustrates the energy efficiency evaluation for MZIM and P-OCD architectures. Figures 4(a) and 4(b) show that, with TO phase shifters, energy efficiency has an upper limit of around 2 TOPS/W, because phase shifters accounts for the majority of energy consumption at large-scale and the number of phase shifters is linearly proportional to throughput. However, with low-consuming MEMS-based (Figs. 4(c) and 4(d)) and PCM-based (Figs. 4(e) and 4(f)) phase shifters, high energy efficiency can be achieved at large scale. Given that the MZIM architecture accumulates IL, the achievable scale and energy efficiency are limited. Using MEMS-based phase shifters (IL = 0.04 dB), over 100 TOPS/W can be obtained at around 30 × 30 scale and low SNR requirement. Using PCM-based phase shifters (IL = 0.5 dB), the energy efficiency cannot surpass 100 TOPS/W because of the minor IL increase. For the P-OCD architecture, over 100 TOPS/W energy efficiency is achievable with both MEMS and PCM mechanisms due to the decoupling of IL and scale. At the scale of around 150 × 150, energy efficiency can reach ∼120 TOPS/W.

Fig. 4. Energy efficiency of the MZIM and P-OCD architecture with different phase shifter mechanisms. Energy efficiencies with TO ((a) and (b)), MEMS ((c) and (d)), and PCM ((e) and (f)) phase shifters. The upper row is the energy efficiencies of MZIM and the lower row is that of the P-OCD.

Download Full Size | PDF

3.3 Area efficiency

The P-OCD architecture accepts various types of phase shifters to build high-performance matrix multiplication processor. It is feasible for moderately lossy but compact phase shifters to achieve high area efficiency without affecting throughput and SNR. Table 2 also lists the length of different types of phase shifters, which are used in the area efficiency evaluation. Figure 5 illustrates the energy/area efficiency of the MZIM architecture and P-OCD architecture with different phase shifters. The SNR is set at 20 dB and the best integration scale in the feasible set is chosen for the calculation of energy efficiency and area efficiency.

Fig. 5. Area and energy efficiency with different types of phase shifters. Different phase shifters are represented by different markers. Each phase shifter is calculated twice. Grey markers show their performance in the MZIM architecture. Colored markers show their performance in the P-OCD architecture. The red stars represent Nvidia A100 GPU at ‘int-8’ and ‘int-4’ precisions for reference. ‘Int-8’ performance is the published data and ‘int-4’ performance is calculated via typical digital processor principle.

Download Full Size | PDF

The P-OCD architecture has an extra routing structure compared with MZIM, but their number of phase shifters are the same (N²). Therefore, area efficiency of the P-OCD is inevitably lower than that of the MZIM. The footprint of the routing structure is assumed as N × N × 10μm × 100μm considering the footprint of ILC and waveguide interval. For interferometers, the interval of phase shifters is all set at 100 μm. For large phase shifters (e.g. MOS and BTO), the interferometer structure occupies most area so that the extra routing structure does not influence the area efficiency obviously. However, for compact phase shifters, area efficiency of the P-OCD decreases by 1% to 30% compared with the MZIM. Note that the energy efficiency of the P-OCD is about ten times higher. Such tradeoff of area efficiency is allowed in many cases. A way of increasing throughput and area efficiency is using the wavelength division multiplexing (WDM) method, i.e. multi-channel data on multiple wavelengths enter one processor to perform parallel computing. The proposed architecture can be ungraded with broadband couplers or binary tree splitters to support WDM operation. We can also observe that low-consuming phase shifters (excluding TO) can reach similar energy efficiencies (near 35 TOPS/W) in the P-OCD architecture, also because the P-OCD decouples IL from integration scale. Fluctuation of the phase shifter loss does not obviously affect energy efficiency. It can thus be inferred that the optimal computing performance can be obtained with compact and low-consuming phase shifters, but their IL is not the most impactful factor.

3.4 Computing tasks

Matrix multiplication is a general and fundamental operation in various applications. Different computing tasks require different precision to obtain reliable results. Here, we simulate two representative computing tasks, deep neural network (DNN) and image reconstruction, to evaluate the P-OCD architecture. For DNN task, the model is ResNet-50 [42] and the test dataset is ImageNet. The ResNet-50 comprises multiple convolutional layers. All convolutional operations are transformed into matrix multiplication using generalized matrix multiplication (GeMM), so that all linear layers of the ResNet-50 can be conducted by the photonic processor. The network structure and parameters are downloaded from an open-source pretrained model. The photonic processor only performs the inference phase. The baseline accuracy is calculated from the randomly-picked 500 images in the ImageNet without introducing noise. By changing the power of added Gaussian noise, we calculate the accuracy of different bit-precisions. For the image reconstruction task, sparse k-space reconstruction is simulated. The original image is Fourier transformed k-space map and 30% largest k-space values are maintained. Other valued are set to zero. Image is reconstructed via inverse Fourier transformation from the k-space map. The matrix multiplication processor is used for the inverse Fourier transformation. Structural similarity index measure (SSIM) is chosen as the FoM for image reconstruction, it baseline is calculated with the reconstructed image and the original image. By adding different levels of Gaussian noise to the inverse Fourier transformation process, SSIM at different bit-precision can be calculated.

The achievable throughput is directly relevant to integration scale, which is determined by desired SNR. The SNR is directly relevant to the FoMs of computing tasks. Therefore, we can construct the direct relationship between throughput and FoMs, shown in Fig. 6. Figure 6(a) shows the classification accuracy of the DNN with achievable throughput of the matrix multiplication processor. At low bit-precision, throughput is high but accuracy is not acceptable. By increasing the desired SNR and classification accuracy, throughput is sacrificed. The trade-off between throughput and accuracy is indispensable due to the analog computing nature. As shown in the figure, the upper limit of the trade-off is different for the P-OCD and the MZIM. Using ultralow-loss (IL = 0.04 dB) MEMS phase shifters, the MZIM performs better at low bit-precision and slightly worse at high bit-precision than the P-OCD architecture. However, when PCM phase shifter (IL = 0.5 dB) is used, the P-OCD still shows similar throughput-accuracy tradeoff while the throughput of MZIM decreases by ∼100 times. In contrast, the P-OCD architecture performs similarly with both low-loss or lossy phase shifters. Figure 6(b) shows the SSIM and achievable throughput in the image reconstruction task. Similar to the DNN task, there is a trade-off between SSIM and throughput. For the P-OCD architecture, larger phase shifter IL hardly influence the tradeoff limit, but the throughput of the MZIM architecture drops by ∼100 times when lossy PCM phase shifters is used. The result strengthens the fact that the P-OCD architecture is tolerant to phase shifter IL and compatible with a plethora of integration technologies.

Fig. 6. Performance of photonic matrix multiplication on computing tasks. (a) TOP-5 classification accuracy of Resnet-50 on ImageNet dataset. Dashed lines show the performance of the MZIM architecture and solid lines show the performance of the P-OCD architecture. Different colors mark different kinds of phase shifters (b) SSIM of image reconstruction of the ‘Cameraman’ image. Insets are the reconstructed images at different SSIM levels.

Download Full Size | PDF

4. Conclusion

We present and numerically study the P-OCD architecture for large-scale complex-valued matrix multiplication. The inter-layer routing structure enables parallel deployment of multiple OCDs with ultralow loss. As a result, the phase shifters of weighting are fully parallel instead of cascading in conventional architectures. The IL of phase shifters does not accumulate at large integration scale. In many cases, there is a trade-off between power use, footprint, and loss. A large number of phase shifters are energy-efficient and compact but lossy. The P-OCD architecture can take advantage of these phase shifters to achieve high-performance computing. We also examine the P-OCD architecture in terms of IL, SNR, throughput, energy efficiency, and area efficiency. Results show that the P-OCD architecture can achieve better performance compared with conventional non-parallel architectures. When the phase shifter is lossy (0.5 dB), the P-OCD architecture can reach 100× higher throughput and 10× higher energy efficiency than conventional ones. Two computing tasks (DNN and image reconstruction) are simulated. The throughput of P-OCD is not affected by the performance of phase shifters, proving the compatibility of this architecture with diverse phase shifters. We believe that the presented architecture can significantly lower the technical challenge of photonic processor fabrication. It has the potential to meet the demand for fast and efficient matrix processing in applications such as AI, wired/wireless communications, autonomous driving, and data mining.

Funding

National Key Research and Development Program of China (2019YFB2203700); National Natural Science Foundation of China (62205203, T2225023).

Disclosures

The authors declare no conflict of interest.

Data availability

Data underlying the results presented in this paper are publicly available from the references.

References

1. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives for Deep Learning,” preprint at arXiv, https://arxiv.org/abs/1410.0759, (2014).

2. V. Lottici, A. D’Andrea, and U. Mengali, “Channel estimation for ultra-wideband communications,” IEEE J. Select. Areas Commun. 20(9), 1638–1645 (2002). [CrossRef]

3. J. Choi, J. Lee, D. Kim, G. Soprani, P. Cerri, A. Broggi, and K. Yi, “Environment-detection-and-mapping algorithm for autonomous driving in rural or off-road environment,” IEEE Trans. Intell. Transport. Syst. 13(2), 974–982 (2012). [CrossRef]

4. H. Kargupta, W. Huang, K. Sivakumar, B. Park, and S. Wang, “Collective principal component analysis from distributed, heterogeneous data,” European Conference on Principles of Data Mining and Knowledge Discovery, 452–547 (2002).

5. Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A survey of accelerator architectures for deep neural networks,” Engineering 6(3), 264–274 (2020). [CrossRef]

6. B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. P. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,” Nat. Photonics 15(2), 102–114 (2021). [CrossRef]

7. H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen, Z. Ruan, and X. Zhang, “Photonic matrix multiplication lights up photonic accelerator and beyond,” Light: Sci. Appl. 11(1), 30 (2022). [CrossRef]

8. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

9. S. Xu, J. Wang, R. Wang, J. Chen, and W. Zou, “High-accuracy optical convolution unit architecture for convolutional neural networks by cascaded acousto-optical modulator arrays,” Opt. Express 27(14), 19778–19787 (2019). [CrossRef]

10. G. Mourgias-Alexandris, A. Totovíc, A. Tsakyridis, N. Passalis, K. Vyrsokinos, A. Tefas, and N. Pleros, “Neuromorphic photonics with coherent linear neurons using dual-IQ modualtion cells,” J. Lightwave Technol. 38(4), 811–819 (2020). [CrossRef]

11. X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, and D. J. Moss, “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature 589(7840), 44–51 (2021). [CrossRef]

12. S. Xu, J. Wang, and W. Zou, “Optical convolutional neural network with WDM-based optical patching and microring weighting banks,” IEEE Photonics Technol. Lett. 33(2), 89–92 (2021). [CrossRef]

13. A. Tait, T. Ferreira de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7(1), 7430 (2017). [CrossRef]

14. J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P. Pernice, “All-optical spiking neurosynaptic networks with self-learning capabilities,” Nature 569(7755), 208–214 (2019). [CrossRef]

15. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

16. C. Wu, H. Yu, S. Lee, R. Peng, I. Takeuchi, and M. Li, “Programmable phase-change metasurfaces on waveguides for multimode photonic convolutional neural network,” Nat. Commun. 12(1), 96 (2021). [CrossRef]

17. R. Tang, R. Tanomura, T. Tanemura, and Y. Nakano, “Ten-port unitary optical processor on a silicon photonic chip,” ACS Photonics 8(7), 2074–2080 (2021). [CrossRef]

18. S. Xu, J. Wang, H. Shu, Z. Zhang, S. Yi, B. Bai, X. Wang, and W. Zou, “Optical coherent dot-product chip for sophisticated deep learning regression,” Light: Sci. Appl. 10(1), 221 (2021). [CrossRef]

19. J. Chiles, S. Buckley, N. Nader, S. Nam, R. P. Mirin, and J. M. Shainline, “Multi-planar amorphous silicon photonics with compact interplanar couplers, cross talk mitigation, and low crossing loss,” APL Photonics 2(11), 116101 (2017). [CrossRef]

20. D. A. B. Miller, “Self-aligning universal beam splitter,” Opt. Express 21(5), 6360–6370 (2013). [CrossRef]

21. J. Sun, E. Timurdogan, A. Yaacobi, E. S. Hosseini, and M. R. Watts, “Large-scale nanophotonic phased array,” Nature 493(7431), 195–199 (2013). [CrossRef]

22. G. Mourou, B. Brocklesby, T. Tajima, and J. Limpert, “The future is fibre accelerators,” Nat. Photonics 7(4), 258–261 (2013). [CrossRef]

23. H. Huang and T. Kuo, “A 0.07-mm² 162-mW DAC achieving >65 dBc SFDR and < −70 dBc IM3 at 10 GS/s with output impedance compensation and concentric parallelogram routing,” IEEE J. Solid-State Circuits 55(9), 2478–2488 (2020). [CrossRef]

24. C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar, P. Winzer, and M. Lončar, “Integrated lithium niobate electro-optic modulators operating at CMOS-compatible voltages,” Nature 562(7725), 101–104 (2018). [CrossRef]

25. S. Zohoori, M. Dolatshahi, M. Pourahmadi, and M. Hajisafari, “A CMOS, low-power current-mirror-based transimpedance amplifier for 10 Gbps optical communications,” Microelectron. J. 80, 18–27 (2018). [CrossRef]

26. B. Murmann, “The race for the extra decibel: a brief review of current ADC performance trajectories,” IEEE Solid-State Circuits Mag. 7(3), 58–66 (2015). [CrossRef]

27. B. A. Bell and I. A. Wamsley, “Further compactifying linear optical unitaries,” APL Photonics 6(7), 070804 (2021). [CrossRef]

28. M. Mendez-Astudillo, M. Okamoto, Y. Ito, and T. Kita, “Compact thermo-optic MZI switch in silico-on-insulator using direct carrier injection,” Opt. Express 27(2), 899–906 (2019). [CrossRef]

29. P. Sun and R. M. Reano, “Submilliwatt thermo-optic switches using freestanding silicon-on-insulator strip waveguides,” Opt. Express 18(8), 8406–8411 (2010). [CrossRef]

30. J. Parra, J. Hurtado, A. Griol, and P. Sanchis, “Ultra-low loss hybrid ITO/Si thermo-optic phase shifter with optimized power consumption,” Opt. Express 28(7), 9393–9404 (2020). [CrossRef]

31. R. Baghdadi, M. Gould, S. Gupta, M. Tymchenko, D. Bunandar, C. Ramey, and N. C. Harris, “Dual slot-mode NOEM phase shifter,” Opt. Express 29(12), 19113–19119 (2021). [CrossRef]

32. P. Edinger, C. Errando-Herranz, and K. B. Gylfason, “Low-loss MEMS phase shifter for large scale reconfigurable silicon photonics,” MEMS Conference 2019, 27–31 (2019).

33. C. Papon, X. Zhou, H. Thyrrestrup, Z. Liu, S. Stobbe, R. Schott, A. D. Wiech, A. Ludwig, P. Lorahl, and L. Midolo, “Nanomechanical single-photon routing,” Optica 6(4), 524–530 (2019). [CrossRef]

34. Q. Zhang, Y. Zhang, J. Li, R. Soref, T. Gu, and J. Hu, “Broadband nonvolatile photonic switching based on optical phase change materials: beyond the classical figure-of-merit,” Opt. Lett. 43(1), 94–97 (2018). [CrossRef]

35. N. Dhingra, J. Song, G. J. Saxena, E. K. Sharma, and B. M. A. Rahman, “Design of a compact low-loss phase shifter based on optical phase change material,” IEEE Photonics Technol. Lett. 31(21), 1757–1760 (2019). [CrossRef]

36. P. Xu, J. Zheng, J. K. Doylend, and A. Majumdarm, “Low-loss and broadband nonvolatile phase-change directional coupler switches,” ACS Photonics 6(2), 553–557 (2019). [CrossRef]

37. M. Takenaka, J. Han, J. Park, F. Boeuf, J. Fujikata, S. Takahashi, and S. Takagi, “High-efficiency, low-loss optical phase modulator based on III-V/Si hybrid MOS capacitor,” Optical Fiber Communication Conference, Tu3K.3 (2018).

38. J. Han, F. Boeuf, J. Fujikata, S. Takahashi, S. Takagi, and M. Takenaka, “Efficient low-loss InGaAsP/Si hybrid MOS optical modulator,” Nat. Photonics 11(8), 486–490 (2017). [CrossRef]

39. Y. Xing, T. Ako, J. P. George, D. Korn, H. Yu, P. Verheyen, M. Pantouvaki, G. Lepage, P. Absil, A. Ruocco, C. Koos, J. Leuthold, K. Neyts, and J. Beeckman, “Digitally controlled phase shifter using an SOI slot waveguide with liquid crystal infiltration,” IEEE Photonics Technol. Lett. 27(12), 1269–1272 (2015). [CrossRef]

40. R. Shamy, A. A. Osama, A. E. Afifi, and M. A. Swillam, “A compact 100 GHz femtojoule silicon-organic hybrid modulator based on a novel Mach–Zehnder interferometer design,” J. Opt. 23(9), 095801 (2021). [CrossRef]

41. F. Eltes, C. Mai, D. Caimi, M. Kroh, Y. Popoff, G. Winzer, D. Petousi, S. Lischke, J. E. Ortmann, L. Czornomaz, L. Zimmermann, J. Fompeyrine, and S. Abel, “A BaTiO₃-based electro-optic pockels modulator monolithically integrated on an advanced silicon photonics platform,” J. Lightwave Technol. 37(5), 1456–1462 (2019). [CrossRef]

42. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” The IEEE/CVF Computer Vision and Pattern Recognition Conference, 770–778 (2016).

C1. Insertion loss parameters		C2. SNR modelling		C3. Energy consumption modelling
IL source	Value (dB)	Parameters	Value	Parameters	Value
Input coupling	2	NEP	10 pW/√Hz	η_wpe	0.1 [22]
Waveguide transmission	0.01/0.1mm	R_pd	0.8 A/W	P_DAC	2.5 mW [23]
Phase shifter	α	Δf	5 GHz	P_mod	0.02 mW [24]
ILC	0.4	RIN	-145 dB/Hz	P_TIA	1.4 mW [25]
Inter-layer crossing [19]	0.0002			P_ADC	30.7 mW [26]
Penalty [21]	1

Type	Reference	IL (dB)	Power use (μW/π)	Footprint (μm)
TO	[28]	1.1	28000	32.5
	[29]	2.8	540	100
	[30]	0.01	10000	70
MEMS	[31]	0.04	0.1^a	40
	[32]	0.3		70
	[33]	0.67		26
PCM	[34]	0.32	0^b	80
	[35]	0.5		50
	[36]	1		50
MOS	[37]	0.25	0.1^a	500
MOS	[38]	0.23	0.1^a	500
LCOS	[39]	0.35	0.1^a	70
SOH	[40]	1.15	0.1^a	167
BTO	[41]	0.58	0.1^a	1000

C1. Insertion loss parameters		C2. SNR modelling		C3. Energy consumption modelling
IL source	Value (dB)	Parameters	Value	Parameters	Value
Input coupling	2	NEP	10 pW/√Hz	η_wpe	0.1 [22]
Waveguide transmission	0.01/0.1mm	R_pd	0.8 A/W	P_DAC	2.5 mW [23]
Phase shifter	α	Δf	5 GHz	P_mod	0.02 mW [24]
ILC	0.4	RIN	-145 dB/Hz	P_TIA	1.4 mW [25]
Inter-layer crossing [19]	0.0002			P_ADC	30.7 mW [26]
Penalty [21]	1

Type	Reference	IL (dB)	Power use (μW/π)	Footprint (μm)
TO	[28]	1.1	28000	32.5
	[29]	2.8	540	100
	[30]	0.01	10000	70
MEMS	[31]	0.04	0.1^a	40
	[32]	0.3		70
	[33]	0.67		26
PCM	[34]	0.32	0^b	80
	[35]	0.5		50
	[36]	1		50
MOS	[37]	0.25	0.1^a	500
MOS	[38]	0.23	0.1^a	500
LCOS	[39]	0.35	0.1^a	70
SOH	[40]	1.15	0.1^a	167
BTO	[41]	0.58	0.1^a	1000

Parallel optical coherent dot-product architecture for large-scale matrix multiplication with compatibility for diverse phase shifters

Abstract

1. Introduction

2. Principles

2.1 P-OCD architecture

3. Performances

3.1 Insertion loss and SNR

3.2 Throughput and energy efficiency

3.3 Area efficiency

3.4 Computing tasks

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Tables (2)

Equations (7)

Optics Express