Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

Zhenguo Wu; Liang Yuan Dai; Asher Novick; Madeleine Glick; Ziyi Zhu; Sébastien Rumley; George Michelogiannakis; John Shalf; Keren Bergman

Journal of Lightwave Technology
Vol. 41,
Issue 12,
pp. 3737-3749
(2023)

Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

Zhenguo Wu, Liang Yuan Dai, Asher Novick, Madeleine Glick, Ziyi Zhu, Sébastien Rumley, George Michelogiannakis, John Shalf, and Keren Bergman

Open Access

Get PDF
Email
Share
Get Citation
Copy Citation Text
Zhenguo Wu, Liang Yuan Dai, Asher Novick, Madeleine Glick, Ziyi Zhu, Sébastien Rumley, George Michelogiannakis, John Shalf, and Keren Bergman, "Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications," J. Lightwave Technol. 41, 3737-3749 (2023)

Export Citation
- BibTex
- Endnote (RIS)
- HTML
- Plain Text
Save article

Abstract

As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs. Each CU processes a sub-part of the model and synchronizes results with others. Communication among these CUs has emerged as a key bottleneck in the training process. In this work, we present SiPAC, a Silicon Photonic Accelerated Compute cluster. SiPAC accelerates distributed DL training by means of two co-designed components: a photonic physical layer and a novel collective algorithm. The physical layer exploits embedded photonics to bring peta-scale I/O directly to the CUs of a DL optimized cluster and uses resonator-based optical wavelength selectivity to realize hardware multi-casting. The collective algorithm builds on the hardware multi-casting primitive. This combination expedites a variety of collective communications commonly employed in DL training and has the potential to drastically ease the communication bottlenecks. We demonstrate the feasibility of realizing the SiPAC architecture through 1) an optical testbed experiment where an array of comb laser wavelengths are shuffled by a cascaded ring switch, with each ring selecting and forwarding multiple wavelengths to increase the effective communication bandwidth and hence demonstrating the hardware multicasting primitive, and 2) a four-GPU testbed running a realistic DL workload that achieves 22% system-level performance improvement relative to a similarly-sized leaf-spine topology. Large scale simulations show that SiPAC achieves a 1.4× to 5.9× communication time reduction compared to state-of-the-art compute clusters for representative collective communications.

PDF Article

More Like This

Flexible silicon photonic architecture for accelerating distributed deep learning

Zhenguo Wu, Liang Yuan Dai, Yuyang Wang, Songli Wang, and Keren Bergman
J. Opt. Commun. Netw. 16(2) A157-A168 (2024)

Fast and scalable all-optical network architecture for distributed deep learning

Wenzhe Li, Guojun Yuan, Zhan Wang, Guangming Tan, Peiheng Zhang, and George N. Rouskas
J. Opt. Commun. Netw. 16(3) 342-357 (2024)

Modoru: Clos nanosecond optical switching for distributed deep training [Invited]

Cen Wang, Noboru Yoshikane, Daniel Elson, Yuta Wakayama, Daiki Soma, Shohei Beppu, and Takehiro Tsuritani
J. Opt. Commun. Netw. 16(1) A40-A52 (2024)

Previous Article Next Article

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Abstract

Cited By

Journal of Lightwave Technology