High-port low-latency optical switch architecture with optical feed-forward buffering for 256-node disaggregated data centers

Nikos Terzenidis; Miltiadis Moralis-Pegios; George Mourgias-Alexandris; Konstantinos Vyrsokinos; Nikos Pleros

doi:10.1364/OE.26.008756

1. Introduction

The growing diversity of emerging cloud and high performance computing (HPC) workloads has substantially transformed data center (DC) traffic patterns from North–south to East–west, following at the same time two major trends in compute and storage infrastructure: virtualization and convergence. Currently, the DC traffic is dominated by traffic flows that reside within the DC, accounting for almost 75% of the total traffic [1] and calling for optimized architectures in terms of resource utilization and energy efficiency. Traditional server-centric DC architectures that rely on the use of general-purpose server blades, with a fixed set of computing, memory and storage resources, has been reported to exhibit significant heterogeneity in resource usage per machine and workload [2, 3], leaving a considerable number of resources underutilized on large-scale DCs. In this context, resource disaggregation and rack-scale computing has been gaining momentum in DC architectures as a way to increase resource utilization, while at the same time minimizing the energy consumption and technology upgrade costs. However, interconnecting a vast number of compute, memory and storage components in a resource disaggregation paradigm imposes huge pressure to the DC network switching infrastructure due to the highly stringent latency and throughput requirements [4, 5].

The tighter co-integration of optical interconnects with the switch ASIC has contributed to a substantial upgrade of the switching capacities up to 8.2 Tb/s [6], while at the same time various switch architectures are being investigated, exploiting different all-optical switching technologies. To this end, optical circuit switching (OCS) based on 3D Micro Electromechanical Systems (MEMS) can yield the necessary high-port connectivity, scaling to hundreds of ports [7], but their millisecond-scale switching times effectively limit their employment as slow reconfigurable backplanes [8]. Extensive research efforts towards high-port count packet-level switching have been focused on the combined use of passive Arrayed Waveguide Grating Routers (AWGRs) [9] and tunable lasers or wavelength converters that take advantage of the wavelength routing capabilities of the cyclic Arrayed Waveguide Grating (AWG). In order to scale the I/O port number beyond the number of channels offered by the AWGR technology, switch architectures relying on hybrid schemes have been proposed and realized by combining Delivery-and-Coupling switches with cascaded cyclic AWGRs [10], or AWGs [11, 12]. These schemes require the use of a tunable laser at every input, prior the Delivery-and-Coupling switch, and employ a buffer-less contention resolution strategy, necessitating a centralized arbitration approach among all inputs towards establishing the required switch state. Another approach utilized a buffer-less Broadcast-and-Select (BS) architecture in conjunction with low port count cyclic AWGRs [13], where Semiconductor Optical Amplifier (SOA) ON/OFF gates were employed as the selection mechanism in the BS switch part, requiring again, tunable lasers at the BS front-end and a centralized control over all input ports.

Arbitrating collisions over a centralized control scheme definitely form a resource hog that can seriously degrade latency performance as the port count increases. Recent work has indicated that sub-μs latencies along with high-radix connectivity, as required in disaggregated DCs, can be offered via distributed control over small clusters of inputs [14]. This has been demonstrated in a SOA ON/OFF-based, buffer-less, Broadcast-and-Select switch implementation, employing electronic edge buffering and retransmission and demonstrating sub-μs latencies even for up to 1024x1024 switch designs. However, scalability in Broadcast-and-Select architectures can be challenging due to the high splitting ratio that introduces excessive losses and degrades Optical Signal to Noise Ratio (OSNR) as the switch port count increases. Moreover, the absence of any intra-switch buffering stage and the sole employment of wavelength routing for contention resolution purposes limit the switch performance to 70% throughput values for higher than 64x64 port layouts.

In this paper, we demonstrate a novel optical switch architecture for disaggregated DCs exploiting Field Programmable Gate Array (FPGA)-based distributed control over a combined BS and AWGR-based wavelength routing scheme that incorporates small-size optical feedforward buffering in order to offer increased throughput while retaining low-latency characteristics. The proposed architecture overcomes the inherent scalability limitations of the BS schemes by distributing the switching functions in small and independent switching clusters, named as the switch Planes, and by exploiting the multi-wavelength routing capabilities of AWGR devices. Moreover, low-latency forwarding is enabled by arbitrating only the intra-Plane traffic and avoiding the time-consuming packet drop and retransmit procedures, while throughput is improved by employing contention resolution stages based on feedforward buffering, building upon the theoretical findings of [15]. Feasibility of the 256-node switch has been experimentally validated for 10Gb/s optical data packets, utilizing a 1:16 optical BS layout and a 2-packet buffer contention resolution stage, followed by a SOA-Mach Zehnder Interferometer (MZI) tunable wavelength converter (WC) and a 16x16 AWGR wavelength router. Additional SOA-MZI WCs along with fiber delay lines were used for the buffer implementation between the BS and the AWGR stages, allowing for the experimental demonstration of successful routing between contending packets originating from two different input ports. Error-free performance for all different switch input/output combinations has been obtained with a power penalty of <2.5dB. The architecture is evaluated via simulations for a 256-node system and 64-byte long optical packets at 10Gb/s, with the packet length corresponding to a 64-byte long cache memory line content, revealing that even a small optical buffer size of 2 optical packets per contention resolution stage, yields 605nsec of mean-packet-latency with >85% throughput for up to 100% loads. Increasing the buffer size to 14 packet slots leads to almost loss-less switch layouts with 97% throughput for up to 100% loads, still retaining a sub-μs mean-latency value of 950nsec. The switch credentials, also for realistic DC applications, have been investigated in a multi-rack simulation analysis, revealing throughput values up to 90% and a mean-latency value of ~900ns, with 4 packet-slot buffers.

The rest of the paper is organized as follows: Section 2 describes the proposed optical switch architecture, while Section 3 reports on its experimental evaluation. Section 4 presents the throughput and latency performance analysis of the architecture when employed in single and multi-rack configurations. Finally, Section 5 concludes the paper.

2. Switch Architecture

The proposed optical switch architecture when employed in a 256-node disaggregated DC Rack system is illustrated in Fig. 1. The system is composed of 16 rack-trays, each one incorporating 16 nodes, interconnected via a 256x256 optical Top-of-Rack (ToR) switch. A fiber-pair is employed to connect each node to the switch, utilizing a single optical link at a fixed wavelength on the transmitter side of the node, whereas on the receiver side the nodes support the reception of a Wavelength Division Multiplexed (WDM) signal by using an optical demultiplexer. The WDM signal is formed by 16 wavelength channels i.e., a number of channels equal to the number of nodes per tray. In a disaggregated rack-scale environment, where compute, memory and storage nodes are organized in pools, residing on different trays, employment of the proposed switch architecture eliminates the need for additional intra-tray switching stages, that would eventually lead to higher latency values for inter-tray communication. Taking advantage of the high-port count offered by the proposed switch, every node connects directly to the switch so that the packets experience the same low-level latency irrespective of whether they are destined to a node on the same tray or at a different tray within the rack.

Fig. 1 Schematic illustration of a 256-node D.C., organized in 16 Rack-Trays with 16 nodes per tray, interconnected via a 256x256 ToR switch.

Download Full Size | PDF

The switch is organized in 16 Planes followed by 16 AWGR devices. Considering the switch ingress path, input port allocation per Plane is performed so as to have node#i from every tray connecting to Plane#i, suggesting that Plane#i inputs (j,i) designate a connection to node#i from tray#j, with i,j = 1, 2, ..16. In order to clarify the input port allocation scheme, nodes on Fig. 1 are colored in accordance with the Plane they are connected, i.e. all nodes connecting to Plane#1 are yellow. The Plane’s role is to aggregate traffic from 16 input ports and forward it via a BS switching scheme to the proper contention resolution block that includes optical feed-forward buffers at a packet-size granularity and provides contention resolution on the time domain. At this stage, contention resolution is performed on a per-output-tray basis, featuring an independent contention resolution block per-output-tray and therefore providing the highest throughput under uniform traffic distribution among all output trays.

The switch architecture utilizes a distributed control scheme, with an FPGA controller per Switch Plane, allowing this way indepedent header processing and buffering among the switch i/o’s that are connected to the same Plane. Synchronous slotted operation is considered at the Plane edge, where the FPGA simultaneously processes all routing requests. Finally, each Plane is connected to 16 16x16 AWGR devices, each one linking to a specific tray and providing contention resolution on the wavelength domain. The use of an AWGR with a port-number equal to the number of trays ensures that no congestion will occur at this part of the switch.

To ease our design description, we classified the architecture into 3 functional stages, presented by the detailed layout of the switch at Fig. 2(a). Stage A embodies header processing and signal broadcasting, Stage B is responsible for tray selection and contention resolution, while destination node selection takes place at Stage C. At Stage A, part of the optical signal emerging at every input port is sent to the FPGA for header processing following optoelectronic conversion. The remaining part of the signal is delayed to account for the required header processing time and is subsequently amplified and broadcasted via 1:16 splitter to the next stage.

Fig. 2 (a) Detailed Layout of the 256x256 switch architecture utilizing optical feed-forward buffers with a max size of K packets, (b) Process flow on FPGA controller for incoming packets.

Download Full Size | PDF

At Stage B, tray selection is realized by activating the appropriate tunable WC that forwards the packet to the respective Tray Contention Resolution (TCR) block, so that the activation of WC#k directs the incoming packet to TCR#k. At the same time, the WC control signal dictates the WC input wavelength value, so that the packet is forwarded to the desired delay line of the TCR block through an AWG demultiplexer located at the WC output. Each TCR block comprises a feed-forward optical buffer with K delay lines, inducing delay ranging from 0 to K-1 packet slots(t_p) and providing buffering capability up to K-1 packets. Each delay line of TCR#i accumulates optical packets from all 16 WC#i’s using a 16:1 optical power combiner. The K delay lines corresponding to the same TCR block are then multiplexed via an AWG, with the FPGA controller resolving contention between the packets propagating in the same TCR and ensuring collision avoidance by allowing a single packet to exit the multiplexer at a given packet-slot.

At Stage C, routing to the desired node is achieved using tunable WCs and AWGRs. Every input of a single AWGR#k connects to the output WC#k’s of every Plane that are destined to the same tray#k, while AWGR#k outputs are connected to all nodes of the tray#k. The node that the packet will be forwarded to is defined again by the FPGA control, by selecting the input wavelength of the Stage C WCs.

The operations performed during the processing of a routing request by the FPGA controller are illustrated on the flowchart of Fig. 2(b). A routing table lookup is carried out for every incoming packet in order to determine the TCR block to be forwarded. After determining the respective TCR for all packets on the Plane, the scheduling algorithm defines the priority, according to which the packets will be forwarded to the respective buffers of the TCR. This procedure takes place in parallel for every TCR block with contending packets. A 2-step arbitration scheme has been implemented for the prototype evaluation of the proposed architecture, where the scheduler, at first, serves contending packets according to the priority assigned to each one via a specific field on its header. During the 2nd arbitration step, packets with the same priority are served according to a round-robin policy. The above scheme allows quality of service features by employing the packet-designated priority scheduler, while at the same time preventing starvation of the different flows via a basic-fairness round-robin policy. Subsequently, for every packet the controller examines the current state of the switch so as to check for available buffer delay lines to that specific TCR block. In the case of at least 1 free buffer, the first available buffer is marked as occupied, the switch state is updated accordingly and an activation signal is generated towards the appropriate WC, triggering the conversion to the wavelength that corresponds to this buffer. In the case of no available buffers, the packet is dropped at the switch, by means of not activating any WC.

3. Proof-of-Concept Experimental Evaluation

To demonstrate the working principle of the proposed architecture and at the same time assess the 256x256 switch feasibility, a fully functional switch Plane has been experimentally evaluated comprising 2 input ports and 1 tray contention resolution block, with 2 optical packet buffers, interconnected to a 16x16 AWGR. Figure 3 illustrates the corresponding experimental setup. A Stratix V FPGA has been used for generating the 2 incoming packet streams, employing small form-factor pluggable (SFP) and 10 Gigabit small form-factor pluggable (XFP) modules. The same FPGA was utilized in parallel as the switch controller that was described on Section 2. A commercially available AWGR device, fabricated by Semicon, was used for the wavelength routing stage, while the wavelength converters were realized with CIP SOA-MZI devices.

Fig. 3 Experimental setup of a complete switch Plane comprising 2 input ports and 1 tray contention resolution block with 2 optical packet buffers, interconnected to a 16x16 AWGR

Download Full Size | PDF

The two input streams comprise three 405-bit-long 10.3125Gb/s NRZ data packets and two dummy packets, used for receiver synchronization purposes, with an interpacket guardband of 35 bits. For each stream, a 70/30 splitter is used to feed part of the input signal to the FPGA for header processing, while the Broadcast and Select stage for the full 256x256 switch, is emulated via a 1/16 splitter. The latency introduced by the FPGA during the header processing operations was measured to be 456ns, with 444ns originating from FPGA transceiver circuit functions and 12ns originating from the processing functions. To ensure that header processing is successfully completed before forwarding the data, a 456ns inducing fiber delay line was used to delay the optical signal, whereas an Erbium Doped Fiber Amplifier (EDFA) amplifier was used to compensate for splitting losses. The SFP data stream (Input #1) entering Stage B splits into two identical signals, that arrive at ports D and E of Input’s 1 WC#1, serving as control signals for SOA-MZI-1. The same procedure is followed for the XFP data stream (Input #2). At the same stage, three continuous wave (CW) laser beams tuned at 1556.7nm, 1559.1nm and 1559.7nm were modulated to produce 425-bit-long 4.5 MHz envelopes. All three envelopes are subsequently multiplexed in an AWG and fed as SOA-MZI-1 input signal into port G. The same procedure is followed for Input’s 2 WC#1. Output signals from ports C and B, of SOA-MZI-1 and SOA-MZI-2 respectively, are demultiplexed in separate AWGs at Tray#1 Contention Resolution block. The demultiplexed signals are combined to the appropriate delay line of the TCR block, according to each signal’s wavelength, while 1/16 combiners were used to emulate, the full 256x256 switch losses.

The signal entering Stage C, after being amplified and filtered in a 5nm optical band-pass filter (OBPF), splits into two identical signals that arrive at ports A and H of Output 1 WC, serving as control signals for SOA-MZI-3. An optical beam produced by a Tunable Laser Source (TLS) is modulated to produce 415-bit-long, 4.5 MHz packet envelopes, that are subsequently fed into port C, serving as input signal of SOA-MZI-3. The wavelength that the TLS is tuned at each packet-slot, defines the output port that the packet will be forwarded via the AWGR. Finally, the SOA-MZI-3 output is injected to Input #1 of a 16x16 AWGR device. Signal at the AWGR outputs is recorded at a digital sampling oscilloscope and evaluated with a Bit-Error-Rate (BER) tester. All WCs relied on the differentially-biased SOA-MZI configuration [16], while optical envelope signals at all three WCs were modulated utilizing LiNbO₃ modulators driven by the FPGA.

Figures 4(a)-4(n) illustrate the experimental results obtained when two data streams are simultaneously injected into the switch and all the incoming packets need to be forwarded to the same output tray. The SFP input data stream (Input#1), consisting of packets #A, #B and #C, is illustrated in Fig. 4(a), while the XFP stream (Input #2) with packets #D, #E, #F is illustrated in Fig. 4(b). Figures 4(e)-(i) illustrate the respective traces collected, using Altera’s SingalTap application, throughout the FPGA forwarding operation of Packets #A, #B, #C, #D and #E. The FPGA traces correspond to the beginning of every packet and, by 2, are processed and forwarded simultaneously by the FPGA, i.e., packets #A and #D, packets #B and #E, while after the comparison between packets #C and #F, only #C is forwarded and the respective trace is presented. Considering for example, the case of packets #A and #D, during the first FPGA-timeslot (~4ns), header comparison is performed, between the packet headers and the port-ID leading to the generation of a positive comparison result (logic-level ‘1’). Regarding packet #A the enable_direct1 signal, driving the optical modulator of the control signal WC#1 of Input#1, is asserted after 3 timeslots dictating that the packet will be forwarded to the “direct” TCR buffer. It should be noted that packet #A is routed to the “direct” buffer since the switch buffers are considered empty at the beginning of the evaluation. Processing of packet’s #D header, as depicted in Fig. 4(f), results in assertion of the enable_delay1t2 signal, forcing packet #D to enter “tp” TCR buffer. For our experimental evaluation, a static QoS priority scheme was followed during packet generation, where packets from Input#1 were assigned a higher priority value against packets from Input#2. As a result of this scheme, packet #F is dropped, while the order of the incoming packets contesting for the same output is changed to #A, #D, #B, #E, #C, at the switch output. The first routing scenario evaluated, resembled the case where all packets are destined to the same destination node, that is attached on AWGR port #8 or #9. The resulting output traces at ports #8 and #9 along with the corresponding optical spectra, when sequentially tuning the TLS of Stage C to λ₈ = 1552nm and to λ₉ = 1552.8nm, for a duration equal to the duration of the entire packet stream, are depicted in Figs. 4(c) and 4(d). On the second routing scenario, packets #A, #D, #B, #E, #C are independently forwarded to AWGR outputs 7-11, by sequentially tuning the TLS at wavelengths λ₇, λ₈, λ₉, λ₁₀, λ₁₁, for a duration of 1 packet-slot, revealing proper routing and buffering on a packet-level basis. The respective oscilloscope traces and eye diagrams of each packet are depicted on Figs. 4(j)-(n).

Fig. 4 (a) Incoming SFP and (b) XFP data stream traces and eyes, (c)-(d) Output oscilloscope traces and spectrums when routing all packets to output channels 8 & 9 of the AWGR, (e)-(i) FPGA traces during the header processing/scheduling operations of packets #A, #D, #B, #E and #C, (j)-(n) Output oscilloscope traces and eyes when routing each packet to different output of the AWGR.

Download Full Size | PDF

To validate signal quality degradation, BER measurements were performed individually for each packet when routed to AWGR output port #8. The FPGA’s and BER tester’s internal oscillators, were phase locked by the use of a common 10 Gb/s sinusoidal signal generated by a signal generator. Figure 5(a) illustrates the BER curves obtained, revealing an average power penalty of 2dB against Back-to-Back (BtB) measurements of the input SFP and XFP data streams. An extensive analysis of the signal degradation was also performed, through BER measurements, for all the 48 possible combinations of input and control wavelengths for SOA-MZI 3. The control wavelength has 3 possible values based on the delay path the optical packet traversed prior arriving to SOA-MZI 3, while the input wavelength has 16 possible values corresponding to the possible output channels of the AWGR. The system was initially optimized for operation using the combination of input and control signals that correspond to Buffer 1 and AWGR port#6 and subsequently BER measurements were obtained for all the other combinations without altering the SOA-MZI 3 driving conditions. Figure 5(b) depicts the measured power penalty, at 10⁻⁹-bit error rate, for all the 48 possible wavelength combinations, concluding to a mean value of 2.01dB with a standard deviation of 0.218dB. The minimum and maximum BER power penalty values, were 1.5dB and 2.44dB, respectively, while lower power penalties were obtained for packets traversing the first buffer, as they correspond to the initial optimized combinations.

Fig. 5 (a) BER measurements for packets routed to AWGR output port #8, (b) Power penalty deviation at 10⁻⁹ error rate, for all possible combinations of buffering at the TCR block and subsequent AWGR output channel selection at the TLS.

Download Full Size | PDF

Finally, considering the power consumption of our switch architecture a value of 4.516 W per port is reported, for a N² × N² implementation with K buffers, estimating 0.6 W for every SOA, 1 W for every EDFA, 1 W for every FPGA and 0.062 W for every Laser. Considering 40 Gb/s operation that can be provenly supported by the utilized WC biasing scheme [16], an energy efficiency of 112.9pJ/bit is achieved. This low power consumption value along with the expected advances in silicon photonics fabrication technologies [17], could allow an integrated low-cost, low-power implementation of the proposed switch architecture.

4. System-level Performance analysis

The system level performance of the switch architecture in terms of throughput and packet end-to-end latency has been investigated, using the Omnet + + discrete event simulation platform. The primary network functions were implemented by deploying the Node and the Switch models, with the Node model being responsible for data traffic generation and sinking. A synchronous slotted network operation is assumed and the packets are generated at predefined packet-slots, each one lasting for 57.6ns. The destination node for every generated packet is selected between all the remaining nodes, following a uniform-random distribution pattern. The Switch model emulates the operations of the FPGA controller, as illustrated in the flow-chart of Fig. 2(b), along with the underlying data forwarding plane that has been described by means of Fig. 2(a). Both the Node and the Switch models collect individual simulation statistics, in conjunction with the global statistics collection that is performed via an auxiliary model that was developed for this purpose. In order to offer a thorough evaluation of the architecture’s performance, in terms of latency, both mean packet delay, as well as p90 delay metrics were collected by the simulation. The switch port allocation to the network nodes was performed according to the procedure described in Section 2 and illustrated in the layout of Fig. 1.

In our first analysis, we assumed a single-rack DC system incorporating 256 computing nodes-servers, featuring 10Gbps channel bandwidth for every optical link between the nodes and the switch. Computing nodes were modelled to generate fixed-length optical packets with a packet size of 72 bytes, with the 8-bytes standing for header, synchronization and guardband requirements and the 64 bytes forming the data payload, matching in this way the size of a typical single cache-line transfer. The FPGA processing latency was set equal to the experimentally measured value of 456ns, with the biggest part originating by the Physical Coding Subsystem (PCS) / Physical Medium Attachment (PMA) and Serializer/Deserializer (SerDes) functions. The propagation latency owing to the time-of-flight through the various optical components of the switch (fibers, amplifiers, AWGRs), excluding the buffer delay lines, was considered to equal 35ns.

The performance of the switch architecture was evaluated as a function of the available buffer delay lines per TCR block. Figure 6(a) presents the respective throughput versus the offered load results for different numbers of buffers per TCR ranging from 0 to 14 packet slots. As expected, throughput increases with increasing buffer size, approaching 97% for 100% offered load as long as the buffer size equals to 14 packet slots. It is important to note that throughput increases almost linearly until 70% load even for the experimentally employed buffer size value of only 2 packet slots, reaching a maximum throughput of ~85%. Figure 6(b) presents the mean packet delay versus the offered load results, showing that latency ranges between 550 and 610 ns for a buffer size between 0 and 2 packet slots and for loads until 100%. Mean latency increases as the buffer size increases reaching a maximum value of 950nsec for a buffer size of 14 packet slots where throughput reaches also its maximum value. Figure 6(c) presents the p90 delay results vs. the offered load, revealing a p90 value of 600ns for lower than 50% load where almost no collisions are introduced, while reaching 1.24μs for maximum load and 14 buffers per TCR. As can be observed, the p90 delay metrics perform a step-wise “jump” as contention occurs due to the fact that packets are forwarded to longer TCR buffers that introduce delays in packet duration granularity. As an example, for the case of 14 buffers, the increased contention encountered at 50% load enforces the use of an extra TCR buffer, that effectively translates to a p90 latency jump from 600ns at 40% load, to ~650ns. At this point, it should be noted that the latency introduced by the FPGA transceiver circuit, that includes the SerDes, word alignment and synchronization functions, and accounts for the biggest part of the overall latency, depends mainly on the specific FPGA model and the respective PHY IP block offered by the FPGA manufacturer. Custom transceiver designs can be stripped down to include only the necessary blocks, and have been reported to offer latency values of just 57.9ns [18]. Finally, the latency does not become unbounded on the graph, beyond the saturation point, since no packet retransmission mechanism has been employed in the simulated network layout.

Fig. 6 Simulation results for different number of buffers per TCR (a) Single-rack throughput, (b) Single-rack mean latency, (c) Single-rack p90 latency, (d) schematic of the 4-rack system, (e) 4-rack throughput, (f) 4-rack mean latency, (g) 4-rack p90 latency

Download Full Size | PDF

In order to retain the applicability perspective for disaggregated DCs but also evaluate the performance of the proposed architecture in a large-scale environment, a multi-rack system has been modelled where packets may experience multi-hop delays before reaching their destination. Figure 6(d) illustrates the respective layout, where 4-Racks, incorporating 768 nodes are interconnected via 4 ToR 256x256 switches along with 1 aggregation 256x256 switch in a 2 layer-switching hierarchy.

Switch port allocation in the 256-port ToR switches was performed in accordance with an oversubscription ratio of 3:1, resulting in 192 intra-Rack and 64 inter-Rack links. The traffic pattern was modified accordingly, so that 75% of the traffic generated by every node was uniformly distributed to nodes of the same rack, while the rest 25% of the traffic was uniformly distributed to nodes from the other 3 racks. Every Rack comprised 192 nodes, evenly distributed in 12 trays, with every tray incorporating 16 nodes. Link allocation between the intra-Rack nodes and the ToR switch input ports was performed in accordance with the procedure described in detail, in Section 2. Moreover, the inter-Rack links originating from the aggregation switch were evenly distributed on every ToR switch’s Planes. Using the aforementioned policy every ToR switch Plane aggregates input traffic from 12 intra-Rack nodes and 4 inter-Rack links, being at the same time able to forward traffic directly to every intra-Rack node and to 4 inter-Rack links.

Input port allocation to the aggregation switch was performed so that outgoing inter-rack links from Plane#i of every ToR switch are connected to the input ports of aggregation switch’s Plane#i. As an example, Plane #1 in the aggregation switch aggregates traffic from Rack’s#1 switch Plane#1, Rack’s#2 switch Plane#1, Rack’s#3 switch Plane#1 and Rack’s#4 switch Plane#1. The proposed scheme ensures minimum contention between traffic on each Plane of the aggregation switch, since contention between inter-Rack links from the same Rack has already been resolved in the respective ToR switch. In our analysis, we considered that the multi-wavelength signal exiting the ToR switch Planes is demultiplexed before being injected at the aggregation switch input ports.

Figure 6(e) illustrates the respective throughput results for different numbers of buffers per TCR ranging from 0 to 4 packet slots. Despite the fact that throughput performance of the 4-rack system, is slightly decreased compared to the single-rack system due to the extra packet drops encountered on the intermediate switching layer, a maximum value of 90% throughput is obtained with 4 buffers per TCR. Mean packet delay, illustrated in Fig. 6(f) is increased compared to the single-rack system, reaching a maximum value of ~900 ns with 4 buffers. Finally, p90 delay, illustrated in Fig. 6(g), is also increased due to the extra switching layer introduced, reaching a maximum value of 1.75 μs with 4 buffers.

5. Conclusions

The transition from traditional server-centric DC architectures towards disaggregated compute, memory and network systems imposes strict requirements on the switching infrastructure that has to ensure high-port connectivity, along with reduced latency and increased throughput. This article presents a high-radix optical switch architecture that exploits limited optical feedforward buffering for contention resolution purposes within a hybrid Broadcast-and-Select and AWGR-based wavelength routed switching scheme. The use of independent BS switch Planes prior the wavelength routed stage allows for a distributed control scheme that allows for sub-μsec latency values, while overcoming the scalability challenges of high-port switches relying exclusively on BS architectures. The feasibility of the proposed switch architecture is experimentally demonstrated with 10Gb/s optical data packets entering through different switch input ports and contending for the same AWGR input, revealing error-free performance with a power penalty of <2.5dB for all possible wavelength combinations. The use of optical feedforward buffers provides a throughput value of 85%, even for just 2 buffers per contention resolution stage, while the latency remains below 605ns, as validated through our simulation analysis. Finally, multi rack simulation analysis revealed throughput values up to 90% with 4 buffers per TCR, while the maximum mean-delay value reached ~900ns, validating the switch credentials for realistic dc applications.

Funding

Horizon 2020 Framework Programme (688172, 688544, 687632).

References and links

1. Cisco, “Cisco Global Cloud Index: Forecast and Methodology, 2015–2020” (2016), retrieved http://www.cisco.com/c/dam/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.pdf.

2. S. Di, D. Kondo, and F. Cappello, “Characterizing Cloud Applications on a Google Data Center,” in 2013 42nd International Conference on Parallel Processing (2013), pp. 468–473. [CrossRef]

3. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analysis,” in Proceedings of the Third ACM Symposium on Cloud Computing (ACM, 2012), pp. 1–13. [CrossRef]

4. S. Han, N. Egi, A. Panda, S. Ratnasamy, G. Shi, and S. Shenker, “Network support for resource disaggregation in next-generation datacenters,” in Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks (ACM, 2013), pp. 1–7. [CrossRef]

5. K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodoropoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Espina, S. Lopez-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos, and T. Berends, “Rack-scale disaggregated cloud data centers: The dReDBox project vision,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2016), pp. 690–695.

6. A. V. Krishnamoorthy, H. D. Thacker, O. Torudbakken, S. Müller, A. Srinivasan, P. J. Decker, H. Opheim, J. E. Cunningham, I. Shubin, X. Zheng, M. Dignum, K. Raj, E. Rongved, and R. Penumatcha, “From Chip to Cloud: Optical Interconnects in Engineered Systems,” J. Lightwave Technol. 35(15), 3103–3115 (2017). [CrossRef]

7. Polatis, “SERIES 7000 - 384x384 port Software-Defined Optical Circuit Switch” (2016), retrieved http://www.polatis.com/series-7000-384x384-port-software-controlled-optical-circuit-switch-sdn-enabled.asp.

8. Q. Chen, V. Mishra, N. Parsons, and G. S. Zervas, “Hardware Programmable Network Function Service Chain on Optical Rack-Scale Data Centers,” in Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2017), paper Th2A.35. [CrossRef]

9. R. Proietti, Y. Yin, R. Yu, C. J. Nitta, V. Akella, C. Mineo, and S. J. B. Yoo, “Scalable Optical Interconnect Architecture Using AWGR-Based TONAK LION Switch With Limited Number of Wavelengths,” J. Lightwave Technol. 31(24), 4087–4097 (2013). [CrossRef]

10. K. i. Sato, H. Hasegawa, T. Niwa, and T. Watanabe, “A large-scale wavelength routing optical switch for data center networks,” IEEE Commun. Mag. 51(9), 46–52 (2013). [CrossRef]

11. K. Ueda, Y. Mori, H. Hasegawa, K. i. Sato, and T. Watanabe, “Large-Scale and Simple-Configuration Optical Switch Enabled by Asymmetric-Port-Count Subswitches,” IEEE Photonics J. 8(2), 1–10 (2016). [CrossRef]

12. K. Ueda, Y. Mori, H. Hasegawa, and K. i. Sato, “Large-Scale Optical Switch Utilizing Multistage Cyclic Arrayed-Waveguide Gratings for Intra-Datacenter Interconnection,” IEEE Photonics J. 9(1), 1–12 (2017). [CrossRef]

13. Y.-K. Yeo, Z. Xu, D. Wang, J. Liu, Y. Wang, and T.-H. Cheng, “High-speed optical switch fabrics with large port count,” Opt. Express 17(13), 10990–10997 (2009). [CrossRef] [PubMed]

14. S. D. Lucente, N. Calabretta, J. A. C. Resing, and H. J. S. Dorren, “Scaling low-latency optical packet switches to a thousand ports,” J. Opt. Commun. Netw. 4(9), A17–A28 (2012). [CrossRef]

15. Z. Wen De and R. S. Tucker, “Wavelength routing-based photonic packet buffers and their applications in photonic packet switching systems,” J. Lightwave Technol. 16(10), 1737–1745 (1998). [CrossRef]

16. M. Spyropoulou, N. Pleros, K. Vyrsokinos, D. Apostolopoulos, M. Bougioukos, D. Petrantonakis, A. Miliou, and H. Avramopoulos, “40 Gb/s NRZ Wavelength Conversion Using a Differentially-Biased SOA-MZI: Theory and Experiment,” J. Lightwave Technol. 29(10), 1489–1499 (2011). [CrossRef]

17. JEPPIX, “The road to a multi-billion Euro market in Integrated Photonics” (2015), retrieved https://phi.ele.tue.nl/jpx/JePPIXRoadmap2015.pdf.

18. B. Deng, M. He, J. Chen, D. Gong, D. Guo, S. Hou, X. Li, F. Liang, C. Liu, G. Liu, P. K. Teng, A. C. Xiang, T. Xu, Y. Yang, J. Ye, X. Zhao, and T. Liu, “Component Prototypes Towards a Low-Latency, Small-Form-Factor Optical Link for the ATLAS Liquid Argon Calorimeter Phase-I Trigger Upgrade,” IEEE Trans. Nucl. Sci. 62(1), 250–256 (2015). [CrossRef]

High-port low-latency optical switch architecture with optical feed-forward buffering for 256-node disaggregated data centers

Abstract

1. Introduction

2. Switch Architecture

3. Proof-of-Concept Experimental Evaluation

4. System-level Performance analysis

5. Conclusions

Funding

References and links

Cited By

Figures (6)

Optics Express

Nikos Terzenidis	https://orcid.org/0000-0001-8180-3953
George Mourgias-Alexandris	https://orcid.org/0000-0002-9646-3119