Out-of-core diffraction algorithm using multiple SSDs for ultra-high-resolution hologram generation

Jaehong Lee; Duksu Kim

doi:10.1364/OE.493984

1. Introduction

Diffraction calculation is fundamental in various applications, including computer-generated holography (CGH) and other optical simulations [1]. For instance, the CGH process employs diffraction calculations (e.g., Fresnel diffraction or angular spectrum method [1]) to calculate the propagation of the light source from objects to the hologram plane. Diffraction calculation is also one of the most time-consuming jobs in many applications, and various acceleration techniques have been proposed [2–7]. Generally, diffraction calculation can be accelerated by fast Fourier transform (FFT)-based convolution [2]. Consequently, an efficient FFT algorithm also improves diffraction calculation. There have been numerous attempts to accelerate FFT computation, including minimizing the number of arithmetic operations [8] and employing parallel computing resources [9–13]. These acceleration algorithms for diffraction and FFT computation can significantly decrease the diffraction calculation time.

However, most prior approaches fail to show the benefit for extreme-scale problems (e.g., ultra-high-resolution hologram generation) because most assume that the system memory is large enough, but that may not hold for large-scale problems. For example, it needs an ultra-high resolution to achieve a wide angle-of-view (AoV) for holographic display since the smaller pixel pitch makes the broader AoV [14]. The resolution of a 15.6 $\times$ 15.6 cm hologram with 40 degrees of AoV is $200K \times 200K$, and its data size is about 596 gigabytes (GB). Because the memory space we need during the diffraction process is about two or three times the data size, it actually requires hundreds of GB or several terabytes (TB). Furthermore, the memory overhead is increased when it uses zero-padding, like in the FFT-based diffraction method. However, recent commodity computing systems (e.g., PC or workstations) have 16-256 GB system memory. Therefore, the required space for ultra-high-resolution is much larger than the system memory size.

Bowman and Roberts [15] proposed a dealiased convolution algorithm that does not require zero-padding to reduce the required memory space for the diffraction process. Shimobaba et al. [7] employ an implicit convolution method that does not require zero-padding to reduce memory usage for diffraction calculation. Also, Kang et al. [16] proposed an in-place algorithm that performs 2D-FFT on the original data without extra memory space for computation. Although these works significantly reduce the memory usage for diffraction calculation, it is still hard to handle extreme-scale problems where the data size itself is much larger than the system memory.

One simple solution is relying on the virtual memory system of an operating system. The virtual memory system makes it possible to run a program that requires larger space than the physical memory size [17]. When the required memory space exceeds the available system memory, parts of the data are physically maintained on the swap device. The swap device is a part of a disk such as Hard Disk Drive (HDD) and Solid-State Drive (SSD). During the computation, the operating system brings data between memory and the swap device according to the data access request. However, accessing a swap device is much slower (e.g., hundreds times) than accessing physical memory. Therefore, the processing performance drops significantly from a specific point, such as the required memory size exceeding the physical memory space. A solution for minimizing performance degradation is pre-fetching the required data. However, unfortunately, operating systems cannot know the exact data access sequence. Unlike the operating system, we know or can expect the data access pattern on our algorithm. It means that we can optimize the performance of the target algorithm by managing the data transfer between the swap device and memory manually instead of depending on the virtual memory system. Such an approach is called out-of-core processing [12,18,19].

Out-of-core diffraction computation algorithms have also been proposed [3–6,20], and we can divide them into three categories. The first category includes shifted diffraction methods that propagate shifted wave planes for the x- and y- coordinates [5,6]. The main idea of the shifted diffraction methods is dividing the planes into sub-plane (i.e., tile) and propagating each sub-plane of the source plane to every sub-planes of the destination plane. In this approach, it only needs to load one sub-plane at a time, not a whole wave plane, during the propagation of a sub-plane. Therefore, the required memory size can be reduced to the data size of two sub-planes, and we can perform diffraction calculations with limited memory space. However, it could be significantly slower than a single propagation of the entire plane since the computational cost is exponentially proportional to the number of sub-planes [16]. Note that this tiling wave plane approach propagates each source plane’s sub-plane to all the destination plane’s sub-planes.

Unlike the shifted diffraction that divides wave planes, we can decompose the FFT computation. It is the second category, the FFT decomposition method [5,6]. Similar to the shifted-diffraction method, it also divides the data into sub-blocks. However, from the perspective of the diffraction calculation, it is like a single propagation of the entire plane. The time complexity of this method is linearly proportional to the number of sub-blocks, and the computational cost is much lower than the shifted-diffraction method.

The last category of the out-of-core diffraction calculation algorithms is strip-based processing. Blinder and Shimobaba [16] decompose the hologram into strips (e.g., a line of a hologram) along the dimension and represent 2D-FFT computation as a set of 1D-FFTs with multiple transposition operations. Since a strip resides in the continuous memory (or disk) space, unlike a tile (i.e., 2D block), data access efficiency for a strip is better than a tile. Therefore, they can minimize read and write operations during the diffraction calculation. Blinder and Shimobaba [16] used an SSD for out-of-core processing while using strip-based processing. As a result, they achieved significant speedup (e.g., 500-fold) over the rectangular tile-based diffraction method. However, we found that the disk access time is still a bottleneck of the diffraction calculation for the out-of-core case. Such disk access bottleneck makes it hard to take advantage of the high computing capability of a GPU. Also, those prior out-of-core approaches still requires zero padding that requires much additional memory space and increases the computational cost.

Contributions: In this paper, we introduce a novel out-of-core diffraction calculation algorithm that utilizes multiple SSDs to mitigate the performance degradation caused by disk access latency (Sec. 3). To fully harness the capabilities of multiple SSDs, our method leverages the inherent even and odd separation characteristics of the implicit diffraction approach. We decompose the diffraction operation into three distinct phases (Sec.3.2) and design dedicated out-of-core algorithms for each phase(Sec.3.4–3.6). This enables us to effectively utilize the available SSDs and optimize the distribution of the I/O workload.

We implemented our method on two different computing environments with multiple SSDs and compared it with prior out-of-core diffraction approaches (i.e., Blinder and Shimobaba [16]) and a RAID (Redundant Array of Independent Disks)-based solution (Sec. 4). Our method achieves up to 2.43 times higher performance than prior approaches using multiple SSDs for large-scale diffraction computation. Moreover, we observed that our approach continues to improve performance with up to four SSDs, unlike prior or naive approaches that show degraded performance with more SSDs in some cases. To validate the effectiveness of our approach in actual CGH processes, we applied our method to generate a layer-based hologram with a resolution of $200K \times 200K$ (Sec. 5). Our approach reduced the generation time by about 38% compared to the prior method. These results demonstrate the usefulness of our approach for generating ultra-high-resolution holograms.

2. Preliminaries

This section provides the preliminaries for understanding the proposed algorithm, including the implicit convolution method and properties of SSD related to our approach.

2.1 Implicit convolution

When we apply FFT-based convolution for non-periodic data, it needs to remove aliases to avoid cyclic output. A typical approach is wrapping the input data with extra data pads of zeros, called zero-padding. However, zero-padding requires not only extra memory space (e.g., two or three times the input data) but also increases computational cost. The implicit convolution method [15] using implicit padding mitigates the spatial and computational overhead of the conventional (or explicit) convolution method. The conventional convolution method first pads zeros around the input sequence. Then, it performs $u_2(x_2, y_2)= \mathscr {F}^{-1}\left [ \mathscr {F}\left [ u_1(x_1,y_1) \right ]H(f_x, f_y) \right ]$, where $u_1$ and $u_2$ are input and output sequence, $\mathscr {F}$ and $\mathscr {F}^{-1}$ are forward and inverse Fourier transform, and $H$ is a transfer function of the convolution system [2]. Finally, it takes the target sequence from the result.

For a given sequence of $N$ complex numbers $\{u[j]\}:= u[0], u[1],\ldots, u[N-1]$, the result of discrete Fourier transform (DFT) with the zero-padding, $\{U[m]\}:=U[0], U[1],\ldots, U[2N-1]$, is defined by Eq. (1), where the twiddle factor $\omega ^\alpha _\beta =exp(-2\pi i\frac {\alpha }{\beta })$, $i = \sqrt {-1}$.

(1)$$U[m]=\sum_{j=0}^{2N-1}u[j] \omega_{2N}^{jm}=\sum_{j=0}^{N-1}u[j] \omega_{2N}^{jm}$$

Since the values on the zero-padding are zero, we can avoid looping over the zero-padding region. Based on this property, the implicit convolution split Eq. (1) into two terms (Eq. (2)), including even and odd like Eq. (3) and Eq. (4).

(2)$$U[m]=\left\{\begin{matrix} U_{even}[\lfloor \frac{m}{2}\rfloor] & if~ m ~is~ even\\ U_{odd}[\lfloor \frac{m}{2}\rfloor] & if~ m ~is~ odd & \end{matrix}\right.,~m = \{0, 1,\ldots, 2N-1\}$$

(3)$$U_{even}[k]=\sum_{j=0}^{N-1}u[j] \omega_{2N}^{j(2k)}= \sum_{j=0}^{N-1}u[j] \exp{({-}2\pi i \frac{j(2k)}{2N})}= \sum_{j=0}^{N-1}u[j] \omega_{N}^{jk}$$

(4)$$U_{odd}[k]=\sum_{j=0}^{N-1}u[j] \omega_{2N}^{j(2k+1)}= \sum_{j=0}^{N-1}u[j] \exp({-}2\pi i\frac{jk}{N}) \exp(-\pi i\frac{j}{N})= \sum_{j=0}^{N-1}u[j] \omega_{N}^{jk}\omega_{2N}^{j}$$

As a result, we can perform FFT without padding while avoiding aliases. We can also divide the transfer function of the convolution system (i.e., $H(X)$) with even and odd indexes. Consequently, we can perform the convolution procedure without the zero-padding while separating even and odd terms like Eq. (5).

(5)$$(u \ast h)[j]= \sum_{k=0}^{N-1}U_{even}[k] H(2k) \omega^{{-}jk}_{N} + \omega^{{-}j}_{2N}\sum_{k=0}^{N-1}U_{odd}[k] H(2k+1) \omega^{{-}jk}_{N}$$

The concept of convolution without zero-padding is commonly called an implicit convolution, which serves as the foundation of our algorithm. By leveraging the even and odd separation computation properties, we can efficiently utilize multiple SSDs in our computation.

2.2 Properties of solid state drive (SSD)

Since SSD is much faster for sequential and random access than HDD (Hard Disk Drive), most of the recent out-of-core approaches employ SSD [21–23]. Sequential access is generally faster than random access for disk systems, including SSDs. However, random access for the write operation in SSD shows comparable performance to sequential writing [18,22].

In our diffraction algorithm, the read and write operations have different access patterns, such as sequential and random. To efficiently handle these operations, we leverage the advantage of SSDs in random write operations.

2.3 Utilizing multiple disks: RAID vs. manual handling

A prevalent method for harnessing the power of multiple disks involves employing RAID (Redundant Array of Independent Disks) architecture. This technology merges multiple physical disk drives into one logical unit, delivering benefits such as data redundancy, amplified performance, or synergy of both. Among its configurations, RAID-0 stands out as it distributes data uniformly across several disks without any redundancy. In theory, this leads to augmented I/O throughput directly proportional to the disk count. However, actual performance might not align with theoretical predictions due to overheads imposed by the RAID controller and irregular data access patterns. Despite RAID-0’s proficiency in boosting I/O performance with minimal user involvement in disk management, its inflexibility imposes restrictions on performance optimization through customization.

To overcome the limitations of RAID architecture, an alternative approach is to manage multiple disks independently using custom algorithms. Although this manual handling of each disk is more complex than using RAID, it allows for better utilization of I/O performance by leveraging specific knowledge about the target task. This approach enables greater flexibility and control in optimizing performance.

We adopt a manual handling approach and propose an out-of-core algorithm that maximally leverages the potential of utilizing multiple SSDs. By designing our algorithm to distribute the workload across the SSDs effectively, we aim to optimize the performance of diffraction calculations for large-scale problems.

3. Out-of-core diffraction with multiple SSDs

3.1 Problem formulation

Our work targets the diffraction process on CGH (e.g., numerical wave-front propagation), especially for ultra-high-resolution holograms requiring much bigger space than the physical memory size. In a spatial domain like a hologram, the zero frequency component is situated at the top-left corner rather than at the center of the plane. To rectify this, moving the zero frequency component to the center of the plane is necessary. There are two solutions, including applying FFT-shift before and after FFTs [12] and the centered DFT (CDFT) [24,25]. Unlike the FFT-shift approach swapping data in the memory, the CDFT can correct the zero-frequency domain by multiplying twiddle factors. We employ CDFT approach because the data swapping overhead for massive data that requires out-of-core computation is much bigger than a few additional computation. With the CDFT, our target diffraction computation for hologram generation is represented as Eq. (6).

(6)$$(u \ast h)[j]= \omega^{j}_{2}\sum_{k=0}^{N-1}C_{even}[k] H(2k) \omega^{{-}jk}_{N} \omega^{k}_{2} + \omega^{{-}j}_{2N} \omega^{j}_{2} \sum_{k=0}^{N-1}C_{odd}[k] H(2k+1) \omega^{{-}jk}_{N} \omega^{k}_{2}\omega_{4}^{1},$$

where

(7)$$C_{even}[k]=\sum_{j=0}^{N-1}u[j] \omega_{2N}^{(j-0.5N)(2k-N)}= exp(\pi i k) \sum_{j=0}^{N-1}u[j] \omega_{N}^{jk} exp(\pi i j)$$

(8)$$C_{odd}[k]=\sum_{j=0}^{N-1}u[j] \omega_{2N}^{(j-0.5N)(2k+1-N)}= \exp{(\pi i (k+0.5))}\sum_{j=0}^{N-1}u[j] \omega_{N}^{jk} \exp{(\pi i j(1-\frac{1}{N}))}$$

Figure 1-(b) shows the process of the diffraction computation based on our problem formulation for the 1D case. One of the key points of the computation process is that it has two computation streams for the even and odd parts. The $\phi$s are twiddle factors multiplied before and after FFT and inverse FFT. The subscript of the twiddle factor ($\phi _{O}$) indicates either an even or odd stream, and the superscript ($\phi ^{O}$) indicates the point at which multiplication occurs. $\phi ^{pre}$ and $\phi ^{post}$ mean it is multiplied before and after FFT (or inverse FFT), respectively. Equation (9)–Eq. (12) show the definitions of twiddle factors for FFT.

(9)$$\phi^{pre}_{even}(j)=\exp{(\pi i j)}$$

(10)$$\phi^{post}_{even}(k)=\exp{(\pi i k)}$$

(11)$$\phi^{pre}_{odd}(j)=\exp{(\pi i j(1-\frac{1}{N}))}$$

(12)$$\phi^{post}_{odd}(k)=\exp{(\pi i (k+0.5))}$$

And, Eq. (13)–Eq. (16) show the definitions of twiddle factors for inverse FFT.

(13)$${\phi^{{-}1}}^{pre}_{even}(k) = \exp{(-\pi ik)}$$

(14)$${\phi^{{-}1}}^{post}_{even}(j) = \exp{(-\pi i j)}$$

(15)$${\phi^{{-}1}}^{pre}_{odd}(k) = \exp{(-\pi i(k+0.5))}$$

(16)$${\phi^{{-}1}}^{post}_{odd}(j) = \exp{(\pi i j(\frac{1}{N}-1))}$$

Fig. 1. The diffraction computation processes of two methods for the 1D case.

Download Full Size | PDF

Each stream’s process can be thought of as a sequence of FFT, applying a transfer function, and then an inverse FFT. When we handle 2D data like hologram, we can extend the process to the 2D case by performing 2D-FFT.

3.2 Out-of-core diffraction algorithm

As we formulate in Sec. 3.1, we need to perform 2D-FFT, multiplying a transfer function, and then inverse 2D-FFT for even and odd streams, respectively. However, it is challenging to implement the process since the data for ultra-high-resolution holograms exceeds the available memory, even without considering zero padding. We employ strip-based processing like Blinder and Shimobaba [16] to overcome the memory limitation problem. As a result, we manage the calculation of 2D-FFT (or inverse 2D-FFT) with a combination of row-wise 1D-FFTs and column-wise 1D-FFTs.

Figure 2 shows the overview of our algorithm. The calculation of column-wise 1D-FFT must follow the completion of the row-wise 1D-FFT step, which requires synchronization between the two steps. This synchronization implies that the result of the row-wise 1D-FFT must be saved on the disk, as the memory limitations prevent it from being stored in memory. Our algorithm is divided into phases based on disk access synchronization points. The first phase encompasses all steps from loading the source data to completing the first synchronization step. This phase involves three disk accesses, including reading source data from a disk, and writing the results of 1D-FFT for both the even and odd streams to a disk. We can use up to three SSDs simultaneously to maximize disk input/output (I/O) efficiency during this phase.

Fig. 2. This figure shows the workflow and disk accesses in our out-of-core algorithm.

Download Full Size | PDF

The second phase of our algorithm starts with the column-wise FFT, which is performed using the CDFT approach to avoid zero-padding. This results in two sub-streams (i.e., sub-even and sub-odd streams) for each major stream. The following steps are the multiplication of a transfer function and the inverse FFT. The transfer function multiplication is an element-wise operation that can be directly applied to the column-wise 1D-FFT result, which is represented as a 1D vector. We can also apply the inverse 1D-FFT by performing column-wise 1D inverse FFT (IFFT) first, then row-wise 1D-IFFT. This allows us to perform column-wise FFT, transfer function multiplication, and column-wise inverse FFT in memory without saving the intermediate result to a disk. The results of the two streams are merged by addition into a 1D-vector vector. Finally, the result is stored on a disk, and we have the second synchronization here. We can run the even and odd streams parallel in the second phase. Each stream requires two disk access operations, one for reading the output of the first phase from a disk and the other for writing the result of the second phase to a disk. To enhance I/O performance, it is possible to use up to four SSDs during this phase.

In the final step of our algorithm, the row-wise inverse 1D-FFT is performed. This is the third phase of the algorithm, in which the results from the second phase are read and the row-wise inverse 1D-FFT is applied. The results from the even and odd streams are then combined into the final result, which is written to the destination disk. Similar to the first phase, we can simultaneously utilize up to three SSDs for the third phase.

In summary, our out-of-core diffraction algorithm consists of three phases, with two synchronization steps between them. It has two main streams, called even and odd, each with two sub-streams in the second phase. The two main streams can be executed in parallel, allowing us to utilize up to four SSDs simultaneously for efficient processing. The specifics of each phase and how the algorithm further improves the I/O performance are described in the following subsections.

3.3 System configuration and work unit

We design our method to run on a computing system having four SSDs. Although we needs four SSDs to utilize potential I/O performance maximally, it can run with fewer SSDs (or HDDs) by giving multiple roles for an SSD. As we mentioned at Sec. 3.2, we employed the strip-based processing whose basic work unit is a 1D-vector (or line). However, the memory space is generally large enough to hold more than one line, even though it is not sufficient to load the whole plane. Since the larger data block generally leads the better I/O performance [18], we make a chunk with multiple consequent lines as the work unit of our method. The size of a chunk (i.e., the number of lines) can be vary depending on the size of physical memory and how many chunks we handle at once. Equation (17) calculates the maximum size of a chunk, where $|line|$, $|memory|$, and $b$ are the data size of a line, physical memory size, and the number of buffers for holding next chunks, respectively.

(17)$$chunkSize = \frac{|memory|}{|line| \times b}$$

3.4 First phase

Figure 3 provides a detailed overview of the first phase. Initially, the source plane is stored in a disk (i.e., Disk A). To begin, our method loads a chunk of data from Disk A into a buffer in the memory. It pre-fetches the next chunks into other buffers while processing a given chunk. Once a chunk is loaded into a buffer, it creates two copies of the chunk for processing in parallel by even and odd streams. For each stream, it performs a row-wise FFT while multiplying twiddle factors (e.g., $\phi ^{pre}_{even}$ and $\phi ^{post}_{even}$) both before and after the computation.

Fig. 3. This figure shows the process of the first phase (i.e., Phase 1) of our method.

Download Full Size | PDF

The results of the FFT computation for each stream must be written back to disk to make room for processing other chunks. Before writing the result to disk, we transpose the chunk in memory. In the second phase, the first step is to perform a column-wise FFT (Sec. 3.2), which requires accessing the disk in a random read operation along the column direction. As discussed in Sec. 2.2, random access for read operations on a disk, such as a SSD, is significantly slower than sequential access. However, random access for write operations on an SSD shows comparable performance to sequential access. To exploit this property of SSDs and improve disk access efficiency, we transpose the chunk before writing the result to disk in the first phase. This makes a column occupy a contiguous region on disk, which can be accessed using sequential read operations in the second phase. Although this approach requires additional transposition steps, we found that it improves our method’s performance by tens of times compared to our method without transposition. This improvement is due to the fact that memory operations are significantly faster than disk accesses.

As the results of each stream are independent, we can store them on separate disks, such as Disk B and Disk C, to increase the I/O bandwidth. Additionally, we run the processing of the two streams in parallel, using two different threads, while the next chunk is being loaded into a buffer. It allows us to utilize up to three SSDs concurrently and overlap disk I/O operations with computation, reducing the impact of disk access latency. By utilizing multiple disks and overlapping I/O and computation, we can improve our method’s overall performance.

3.5 Second phase

Figure 4 shows a detailed overview of the process in the second phase. In the first phase, the intermediate results for the even and odd streams were stored on separate SSDs (i.e., Disk B and C) while the plane is transposed. Although the two streams are handled independently, their workflows are the same.

Fig. 4. This figure shows the process of the second phase (i.e., Phase 2) of our method.

Download Full Size | PDF

The first step of the second phase is the column-wise FFT. As we avoid zero-padding by using the CDFT approach, the column-wise processing step also generates two sub-streams, the even and odd streams (e.g., $Even_{even}$ and $Even_{odd}$ for the even stream). Since we have already transposed the columns to occupy a continuous region in memory, it is possible to read the columns using sequential read operations instead of random access. Therefore, the computational process is similar to the row-wise FFT in the first phase, using different twiddle factors. As in the first phase, a chunk is loaded from the disk containing the results of the first phase into a buffer, and two copies are made for each sub-stream. It is important to note that two disks (i.e., Disk B and C) are used for each stream simultaneously. The first logical step, the 2D-FFT, is completed at this point.

The next logical steps involve multiplying the transfer function and the inverse 2D-FFT. The second step of the second phase is multiplying the transfer function. For each chunk, it is directly applied to the result of the column-wise FFT in memory. The inverse 2D-FFT, which is the last logical step, is performed through the second and third phases. The order of the row-wise and column-wise FFTs for 2D-FFT calculation using 1D-FFTs does not affect the result. In the second phase, we are working with a chunk containing columns, and it is more efficient to utilize it directly instead of transposing it again for the inverse 2D-FFT. Therefore, for the inverse 2D-FFT, we perform the column-wise inverse FFT first, which is different from the 2D-FFT step. Finally, the results of the two sub-streams are added into a result chunk, completing the column-wise processing step.

The resulting chunk is then written back to a disk to make room for the next chunks. Before writing the result of the second phase, we transpose the chunk to improve the I/O performance for the third phase. The next step (i.e., the first step of the third phase) is row-wise FFT, and this transposition makes the elements of a row lay contiguously in the disk, enabling sequential disk access. Although it requires a random (i.e., column-wise) write operation to the disk, it shows comparable performance to sequential writes for writing operations on SSDs (Sec. 2.2).

With multiple buffers, our method can simultaneously read a chunk from a disk into one buffer, process another chunk, and write the result of the other chunk to another disk. In addition, we can overlap three tasks by using multiple threads. As a result, we can utilize up to four disks (two disks for each stream) simultaneously while maximizing the available I/O bandwidth. Note that we can use the disk containing the source plane (i.e., Disk A) as one of the four disks, as the first and second phases are independent.

3.6 Third phase

The third phase of our algorithm is detailed in Fig. 5. In this phase, the results of the second phase for the even and odd streams are stored on two disks (i.e., Disk D and A), respectively. The first step of the third phase is the row-wise inverse FFT, which is performed for each stream. It reads a chunk from memory to a buffer and performs the inverse FFT, which involves multiplying twiddle factors for the inverse FFT of the stream (e.g., ${\phi ^{-1}}^{pre}_{even}$ and ${\phi ^{-1}}^{post}_{even}$ for the even stream). The results of the two streams are merged by adding the results of the two streams for the chunks at the same position. The merged result, which is the final output of our algorithm, is written to a disk storing the resulting plane.

Fig. 5. This figure shows the process of the third phase (i.e., Phase 3) of our method.

Download Full Size | PDF

Once we have processed all the chunks, we obtain the final result plane in the destination disk (i.e., Disk B). During this third phase, we can simultaneously utilize up to three disks: two for reading chunks for each stream and one for writing the resulting chunk.

4. Result and analysis

We implemented our out-of-core diffraction algorithm using multiple SSDs on two distinct machine configurations: a SATA machine and an NVMe machine. Both machines have identical hardware configurations, consisting of an AMD Ryzen Threadripper 3960X 24-core CPU and 256 GB of system memory, with the only difference being the type of SSDs used. The SATA machine has four SATA SSDs (Samsung 870 EVO 2TB), while the NVMe machine contains four NVMe SSDs (Samsung 980 Pro 2TB). We employed OpenMP to run multiple threads, FFTW [10] for calculating FFT and Windows API for I/O operations.

To compare the performance of our approach with the prior work and its naive extensions, we implemented three alternative methods.

• $Blinder$ is an implementation of the long-distance diffraction algorithm proposed by Blinder and Shimobaba [16] that utilizes a single SSD and performs computation (e.g., 1D-FFT and matrix operations) on a GPU. It serves as the baseline for the performance comparison in our study. Blinder and Shimobaba used a chunk size of 512 in their paper. However, we employed different sizes ranging from 1,024 to 8,192, depending on the machine and data size, as it provided the best performance in our experiments. We also found that using a CPU for computation performs almost the same as using a GPU since the performance bottleneck is I/O. Hence, we used a CPU for this algorithm, like our method, to focus on evaluating the I/O performance improvement.
• $Blinder_{RAID}$ is based on the RAID technology, a conventional approach to improving the I/O performance. For this method, the algorithm is the same as $Blinder$, but it runs with a disk volume composed of four SSDs grouped by RAID-0 (Sec. 2.3).
• $Blinder_{multiSSD}$ is an extended version of the $Blinder$ method. This method handles each SSD manually, assigning a dedicated SSD for each write and read operation. The method increases the overall bandwidth by utilizing two SSDs simultaneously in each phase. In this method, the SSDs switch roles (e.g., write or read) at each phase.

We applied our method and the three alternative methods to the angular spectrum method (ASM) for diffraction planes. To test our algorithm’s performance, we generated diffraction planes with resolutions of $131,072 \times 131,072$ ($2^{17}$) randomly, with a pixel pitch of 5um, a propagation distance of z = −90mm, and a wavelength of 660nm in both single- and double-precision complex values. The data size are 128 GB and 256 GB for single- and double-precision, respectively. Therefore, the required memory space for ASM is 512 GB and 1024 GB for single- and double-precision, respectively.

SATA SSD vs. NVMe SSD: To fully comprehend the results of our experiment, it is essential to elucidate the fundamental distinctions between SATA SSDs and NVMe SSDs. A widely recognized disparity between the two is that a SATA SSD (e.g., 500 MB/s) exhibits considerably slower performance than an NVMe SSD (e.g., 5,000 MB/s). A crucial dissimilarity between these storage devices is their support for parallel I/O operations. NVMe SSDs are specifically engineered to facilitate a high degree of parallelism and high-bandwidth dual-simplex interface, enabling the simultaneous processing of multiple I/O requests [23]. In contrast, SATA SSDs employ a low-bandwidth and half-duplex interface, reducing overall throughput, even if they have internal parallelism. Consequently, utilizing multiple SSDs to execute I/O operations concurrently is an effective strategy in a SATA SSD-based system. However, in the case of an NVMe SSD system, employing multiple SSDs for parallel I/O is less effective than with SATA SSDs, given that a single NVMe SSD inherently supports such functionality.

4.1 Results

Table 1 presents the processing times of each algorithm for the diffraction process, with the best combination of chunk size and number of buffers reported for each algorithm. In the SATA machine, $Blinder_{multiSSD}$ achieves 56% and 33% higher performance than Blinder in single- and double-precision, respectively (Table 1-(a)). However, in the NVMe machine (Table 1-(b)), the performance benefit of using multiple SSDs is not evident for $Blinder_{multiSSD}$. As SATA SSDs process I/O operations serially, utilizing multiple SSDs enables the overlapping of read and write operations, which subsequently reduces the I/O bottleneck in the SATA system. In contrast, NVMe SSDs inherently support parallel I/O and boast sufficient data transfer rates to accommodate these operations. Consequently, employing multiple NVMe SSDs does not confer any substantial advantage over using a single NVMe SSD, given that the latter already facilitates parallel I/O.

Table 1. This table presents each algorithm’s processing times (in seconds), utilizing the optimal chunk size and the number of buffers to achieve the highest performance on each machine configuration.

View Table | View all tables in this article

Although $Blinder_{RAID}$ utilizes four SSDs, the performance improvement over Blinder is only marginal and not significant. Furthermore, $Blinder_{RAID}$ shows lower performance than $Blinder_{multiSSD}$ in most cases. This is because although each I/O operation takes less time than $Blinder_{multiSSD}$ due to the RAID system, all I/O operations are still processed serially as there is only one logical disk volume. Additionally, the overhead of RAID control prevents $Blinder_{RAID}$ from achieving the theoretical maximum bandwidth. Our proposed algorithm, Ours, outperformed all other algorithms in all test cases. Specifically, on the NVMe machine, Ours demonstrated 1.52 and 1.71 times higher performance than Blinder for single- and double-precision cases, respectively. Furthermore, Ours achieved up to 1.52 times higher performance than two extensions of Blinder that utilized multiple SSDs. In the SATA machine, Ours achieved even greater performance improvements than on the NVMe machine, as the I/O bottleneck was more severe for other algorithms in the SATA machine. Specifically, for single-precision, Ours showed 2.43 and 1.56 times higher performance than Blinder and $Blinder_{multiSSD}$, respectively. For double-precision, Ours demonstrated 2.06 and 1.55 times higher performance than Blinder and $Blinder_{multiSSD}$, respectively.

We can achieve high performance with Ours as it effectively reduces the I/O bottleneck by overlapping up to four I/O operations while simultaneously processing up to four computational streams.

4.2 I/O performance

To assess the improvement in I/O performance achieved by our method, we conducted measurements of the time required for I/O operations in Blinder, $Blinder_{multiSSD}$, and Ours. This analysis focused solely on the I/O time by bypassing the FFT computation. It is important to note that computation and I/O operations are performed concurrently in the actual algorithm.

Figure 6 shows the time consumed for I/O operations in three algorithms. On the SATA interface, our method (Ours) demonstrates a reduction in I/O time of 58% and 57% compared to Blinder for single- and double-precision, respectively. While $Blinder_{multiSSD}$ also decreases I/O time compared to Blinder by utilizing multiple SSDs, our method (Ours) outperforms it by achieving 41% and 45% lower I/O time. Similarly, on the NVMe interface, Ours gets a reduction in I/O time of 35% and 32% compared to $Blinder_{multiSSD}$. This improvement in I/O performance is attributed to our out-of-core algorithm, which evenly distributes I/O operations across multiple SSDs.

Fig. 6. This graph compares the time consumed for I/O operations in three algorithms.

Download Full Size | PDF

To further explore the advantages of our method in the I/O process, we conducted an analysis of the individual time requirements for each I/O operation during chunk processing. This analysis involved breaking down the operations within each phase into three distinct components: read time, write time, and waiting time for other I/O operations to complete. By examining these components separately, we gained valuable insights into the specific time allocated to each operation and the synchronization between read and write operations. For example, in the first phase, the read operation for the next chunk can only commence once the write operation for the current chunk has been completed, and the read thread remains idle until the write step has finished before moving on to the next chunk. For example, in the first phase, the write operation for a chunk can only commence once the write operation for the other chunk in another thread has been completed, and the current thread remains idle until the write step in another thread has finished.

Figure 7 presents the analysis results on NVMe SSD for $Blinder_{multiSSD}$ and Ours. Our analysis revealed that in $Blinder_{multiSSD}$, there is a significant waiting time between I/O operations in the first and third phases due to imbalanced data amounts between read and write operations. For example, in the first phase, the write operation has twice the data amount compared to the read operation. In contrast, our method (Ours) minimizes the waiting time by distributing the I/O workload for operations handling larger data (e.g., write in the first phase) across multiple SSDs. These results validate our method’s efficient utilization of multiple SSDs, leading to improved I/O performance.

Fig. 7. This stacked chart shows the time consumed by each component related to I/O operations during the processing of a chunk (8,192 lines). For this analysis, we utilized NVMe SSDs.

Download Full Size | PDF

Additionally, it is important to note that while Ours may appear to have longer processing time in the second phase, the actual processing time is lower than $Blinder_{multiSSD}$ due to the simultaneous handling of two chunks (Sec. 3.4). Furthermore, it is important to note that although Fig. 7 is presented as a stacked chart, the read and write threads actually run in parallel.

4.3 Chunk size and the number of buffers

The performance of our method is influenced by two parameters, including the chunk size and the number of buffers. As expected, allocating more memory generally leads to better performance (Fig. 8). Specifically, when the chunk size is fixed, using more buffers generally results in faster processing than using fewer buffers. Conversely, when the number of buffers is fixed, a larger chunk size generally leads to faster processing.

Fig. 8. Comparison of the computation time for various memory usage.

Download Full Size | PDF

There can be several combinations of the chunk size and the number of buffers in the same memory budget. We measured the processing time with various configurations having the same memory usage to investigate their effect, and Table 2 shows the results. For the SATA machine, two buffers usually achieved the best performance. We found that the I/O time is much larger than the computation time in our algorithm, and using one more buffer is enough to hide the computation time of another buffer. Therefore, increasing the chunk size is more efficient than using more than two buffers. On the other hand, two or four buffers show the best performance on the NVMe machine. It is because, as the gap between I/O and computation cost is decreased, the chance to overlap the computation for other buffers is increased with more than two buffers.

Table 2. This table presents the processing time of our algorithm for various configurations of the chunk size and the number of buffers in a buffer set.^a

View Table | View all tables in this article

4.4 Number of SSDs

To evaluate the impact of using multiple SSDs in our method, we conducted experiments to measure the performance of Ours with different numbers of SSDs (Fig. 9). In the case of using a single SSD, all I/O operations are performed on that SSD. Using RAID-0 with multiple SSDs provides a higher maximum bandwidth than using a single SSD while following the same workflow since it is logically the same as the single SSD case. When we use two SSDs, we use one for read operation and another for write operation.

Fig. 9. This plot shows a comparison of our algorithm’s performance as a function of the number of SSDs employed. In the case of RAID-0, we employed four SSDs, which were configured to function as a single logical disk.

Download Full Size | PDF

On the SATA machine, Ours demonstrates a continued improvement in performance as the number of SSDs increases. Despite the inherent lack of support for parallel I/O operations in SATA SSDs, employing multiple SSDs provides parallel I/O capacity and enables our approach to leverage its parallel processing nature. However, employing RAID-0 with four SATA SSDs results in similar or even lower performance than Ours with a single SSD, similar to the Blinder case. This is due to the serial processing of I/O operations, resulting in reduced performance, and the overhead of RAID control further affecting performance negatively. In contrast, Ours with four SSDs achieves up to 1.99 times higher performance than using a single SSD, and up to 153% higher performance than Ours using RAID-0. By fully exploiting the parallel I/O ability, Ours with four SSDs can perform better than Ours using RAID-0. These results demonstrate the advantage of our approach in utilizing multiple SSDs to improve performance in SATA-based systems.

The performance of Ours on the NVMe machine remains constant regardless of the number of SSDs used, in contrast to the SATA machine. This is because NVMe SSDs already support parallel I/O and have sufficient bandwidth to handle multiple I/O operations in our algorithm, allowing the benefits of the parallel I/O feature of Ours to be fully utilized with only one SSD. Nonetheless, Ours outperforms $Blinder_{multiSSD}$ and $Blinder_{RAID}$ by up to 1.52 times even when using the same number of SSDs. This is because Ours leverages higher parallelism for I/O and computation than those methods. These results clearly demonstrate the advantage of our algorithm over prior approaches.

5. Application to computer generated holography

To verify the benefit of our approach in an actual application, we applied three heterogeneous algorithms into the layer-based hologram generation process [26]. One of the major challenges of generating layer-based holograms using out-of-core diffraction methods (including our method and Blinder’s methods) is the temporal space required to store all diffracted layers before summation into a hologram. Additionally, loading complex wave planes and the hologram incurs high I/O costs and requires excessive memory space. To overcome these challenges, we adopt out-of-core diffraction and layer-based algorithms. Specifically, we incrementally propagate the image plane to the other image plane and, finally, to the hologram, similar to the wavefront recording plane (WRP) method [27], instead of propagating all planes directly to the hologram plane as in the naive method. To propagate and accumulate layer to layer, we load and sum at the read step of phase 1. This modified CGH method avoids multiple read steps of final summation to the hologram since all layers are summed when propagating to each layer.

We generated a hologram with three layers, each located at z= −3, −6, and −9mm, a hologram size of $200K \times 200K$, a pixel pitch of 500nm, and a wavelength of 660nm (Fig. 10). Note that we did not use the iterative carrier wave optimization method, known as the Gerchberg-Saxton algorithm [28] and the stochastic gradient descent with adaptive step size method [29], as our goal in this experiment is to verify the effectiveness of our diffraction algorithm, not to propose a fine hologram generation method. Figure 11 shows the numerically reconstructed images from the hologram, where each image’s propagation distance corresponds to the depth of each layer. Our method successfully computes convolution-based diffraction, such as ASM. Moreover, the hologram generation process with Ours takes 685 minutes, whereas it takes 946 minutes with $Blinder_{multiSSD}$ on a SATA SSD, resulting in a computing time reduction of approximately 38% with our approach. Please note that in our evaluation, we measured the overall time for the layer-based hologram generation process, which includes all the necessary steps in addition to the diffraction calculation [26]. This result demonstrates the practical utility of our method in CGH applications.

Fig. 10. The layer images and their distances of the layer-based hologram.

Download Full Size | PDF

Fig. 11. Numerically reconstructed images at different distances of the layers. (a) Reconstructed at a distance of z=-3mm, (b) reconstructed at a distance of z=-6mm, and (c) reconstructed at a distance of z=-9mm.

Download Full Size | PDF

6. Conclusion

We presented a novel out-of-core diffraction calculation algorithm that utilizes multiple SSDs to improve performance for ultra-high-resolution hologram generation. To utilize multiple SSDs efficiently, we employed an implicit diffraction approach that does not require zero padding while exploiting the even and odd separation characteristics of implicit convolution computations. We implemented our method on two different computing environments with multiple SSDs and compared its performance with prior out-of-core diffraction approaches and a RAID-based solution. Our approach showed up to 2.43 times higher performance than prior approaches for large-scale planes requiring out-of-core processing. Additionally, we found that our method continues to improve performance by adding more SSDs, unlike prior or naive approaches that show even lower performance with more SSDs in some cases. To validate the usefulness of our method, we adapted it to generate layer-based holograms with a resolution of 200K $\times$ 200K. We found that our method reduces generation time by about 38% compared with the prior approach. Our findings demonstrate the benefits of our approach and its potential for practical applications.

In future work, we plan to explore ways to increase I/O performance and take advantage of the full capabilities of SSDs. One potential avenue is to utilize the DirectStorage API, recently introduced by Microsoft [30], which enables the GPU to access NVMe storage directly, reducing I/O overhead and enabling fast I/O and computation. Additionally, we plan to investigate using low-precision formats to reduce data size and improve computation performance in the GPU. This will allow us to create high-resolution holograms more efficiently. We also plan to explore other techniques to improve performance, such as optimizing data layout and memory access patterns. Ultimately, we aim to continue developing and refining our out-of-core diffraction approach to enable real-time or near-real-time hologram generation for various applications.

Funding

National Research Foundation of Korea (2021R1I1A3048263, 2021RIS-004).

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) through the Ministry of Education as part of the Basic Science Research Program under grant 2021R1I1A3048263 (High-Performance CGH Algorithms for Ultra-High Resolution Hologram Generation, 70%) and the Regional Innovation Strategy (RIS) under grant 2021RIS-004 (30%).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. Goodman, Introduction to Fourier Optics (W. H. Freeman, 2005).

2. T. Shimobaba and T. Ito, Computer Holography: Acceleration Algorithms and Hardware Implementations (CRC, 2019).

3. B. J. Jackin, S. Watanabe, K. Ootsu, T. Ohkawa, T. Yokota, Y. Hayasaki, T. Yatagai, and T. Baba, “Decomposition method for fast computation of gigapixel-sized fresnel holograms on a graphics processing unit cluster,” Appl. Opt. 57(12), 3134–3145 (2018). [CrossRef]

4. B. J. Jackin, H. Miyata, T. Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai, and T. Baba, “Distributed calculation method for large-pixel-number holograms by decomposition of object and hologram planes,” Opt. Lett. 39(24), 6867–6870 (2014). [CrossRef]

5. R. P. Muffoletto, J. M. Tyler, and J. E. Tohline, “Shifted fresnel diffraction for computational holography,” Opt. Express 15(9), 5631–5640 (2007). [CrossRef]

6. K. Matsushima, “Shifted angular spectrum method for off-axis numerical propagation,” Opt. Express 18(17), 18453–18463 (2010). [CrossRef]

7. T. Shimobaba, T. Takahashi, Y. Yamamoto, T. Nishitsuji, A. Shiraki, N. Hoshikawa, T. Kakue, and T. Ito, “Efficient diffraction calculations using implicit convolution,” OSA Continuum 1(2), 642 (2018). [CrossRef]

8. J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Math. Comp. 19(90), 297–301 (1965). [CrossRef]

9. Y. Zhou, F. He, N. Hou, and Y. Qiu, “Parallel ant colony optimization on multi-core simd cpus,” Futur. Gener. Comput. Syst. 79, 473–487 (2018). [CrossRef]

10. M. Frigo and S. G. Johnson, “FFTW: An adaptive software architecture for the FFT,” in International Conference on Acoustics, Speech and Signal Processing (Cat. No. 98CH36181), vol. 3 (IEEE, 1998), pp. 1381–1384.

11. NVIDIA, “CUFFT libraries,” https://docs.nvidia.com/cuda/cufft/index.html. Accessed: 2022-10-14.

12. J. Lee, H. Kang, H.-j. Yeom, S. Cheon, J. Park, and D. Kim, “Out-of-core gpu 2d-shift-fft algorithm for ultra-high-resolution hologram generation,” Opt. Express 29(12), 19094–19112 (2021). [CrossRef]

13. H. Kang, J. Lee, and D. Kim, “HI-FFT: Heterogeneous parallel in-place algorithm for large-scale 2d-fft,” IEEE Access 9, 120261–120273 (2021). [CrossRef]

14. J. H. Choi, J. Pi, C. Hwang, J. Yang, Y. Kim, G. H. Kim, H. Kim, K. Choi, J. Kim, and C. Hwang, “Evolution of spatial light modulator for high-definition digital holography,” ETRI Journal 41(1), 23–31 (2019). [CrossRef]

15. J. C. Bowman and M. Roberts, “Efficient dealiased convolutions without padding,” SIAM J. Sci. Comput. 33(1), 386–406 (2011). [CrossRef]

16. D. Blinder and T. Shimobaba, “Efficient algorithms for the accurate propagation of extreme-resolution holograms,” Opt. Express 27(21), 29905–29915 (2019). [CrossRef]

17. B. Jacob and T. Mudge, “Virtual memory: Issues of implementation,” Computer 31(6), 33–43 (1998). [CrossRef]

18. P. Godard, V. Loechner, and C. Bastoul, “Efficient out-of-core and out-of-place rectangular matrix transposition and rotation,” IEEE Transactions on Computers, p. 1 (2020).

19. S. Boboila, Y. Kim, S. S. Vazhkudai, P. Desnoyers, and G. M. Shipman, “Active flash: Out-of-core data analytics on flash storage,” in 28th Symposium on Mass Storage Systems and Technologies (IEEE, 2012), pp. 1–12.

20. K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-generated hologram created by the polygon-based method,” Appl. Opt. 48(34), H54–H63 (2009). [CrossRef]

21. T. Shimobaba, A. Sugiyama, H. Akiyama, S. Hashikawa, T. Kakue, and T. Ito, “Large scale calculation for holography using ssd,” Photonics Lett. Pol. 6(3), 1 (2014). [CrossRef]

22. J. Lee, H. Roh, and S. Park, “External mergesort for flash-based solid state drives,” IEEE Trans. Comput. 65(5), 1518–1527 (2016). [CrossRef]

23. B. Mao, S. Wu, and L. Duan, “Improving the ssd performance by exploiting request characteristics and internal parallelism,” IEEE Transactions on Comput. Des. Integr. Circuits Syst. 37(2), 472–484 (2018). [CrossRef]

24. D. H. Mugler, “The centered discrete fourier transform and a parallel implementation of the fft,” in International Conference on Acoustics, Speech and Signal Processing (IEEE, 2011), pp. 1725–1728.

25. J. G. Vargas-Rubio and B. Santhanam, “On the multiangle centered discrete fractional fourier transform,” IEEE Signal Process. Lett. 12(4), 273–276 (2005). [CrossRef]

26. Y. Zhao, L. Cao, H. Zhang, D. Kong, and G. Jin, “Accurate calculation of computer-generated holograms using angular-spectrum layer-oriented method,” Opt. Express 23(20), 25440–25449 (2015). [CrossRef]

27. T. Shimobaba, N. Masuda, and T. Ito, “Simple and fast calculation algorithm for computer-generated hologram with wavefront recording plane,” Opt. Lett. 34(20), 3133–3135 (2009). [CrossRef]

28. R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the determination of plane from image and diffraction pictures,” Optik 35(2), 237–246 (1972).

29. Y. Peng, S. Choi, N. Padmanaban, and G. Wetzstein, “Neural holography with camera-in-the-loop training,” ACM Trans. Graph. 39(6), 1–14 (2020). [CrossRef]

30. Microsoft, “DirectStorage API,” https://devblogs.microsoft.com/directx/directstorage-api-downloads/.

(a) SATA machine
	Single-precision			Double-precision
Algorithm	ChunkSize	# of Buf.	Proc. time(s)	ChunkSize	# of Buf.	Proc. time(s)
$B l i n d e r$	4,096	2	3944.9	4,096	2	7039.6
$B l i n d e r_{m u l t i S S D}$	4,096	2	2530.7	4,096	2	5294.0
$B l i n d e r_{R A I D}$	4,096	2	3709.6	2,048	2	8635.9
$O u r s$	4,096	2	1622.9	4,096	2	3418.9

(b) NVMe machine
	Single-precision			Double-precision
Algorithm	ChunkSize	# of Buf.	Proc. time(s)	ChunkSize	# of Buf.	Proc. time(s)
$B l i n d e r$	8,192	2	1181.1	4,096	4	2443.3
$B l i n d e r_{m u l t i S S D}$	8,192	2	1183.7	8,192	2	2175.2
$B l i n d e r_{R A I D}$	8,192	2	1165.8	8,192	2	2179.6
$O u r s$	4,096	4	778.4	8,192	2	1432.3

Single-precision (Mem. usage = 16 GB * 4 sets)			Double-precision (Mem. usage = 32 GB * 4 sets)
chunkSize, # of Buf./set	SATA	NVMe	chunkSize, # of Buf./set	SATA	NVMe
16384, 1	2250.1	1060.7	16384, 1	5003.1	1860.6
8192, 2	1631.0	798.0	8192, 2	3480.7	1432.3
4096, 4	1641.0	778.4	4096, 4	3585.5	1437.7
2048, 8	1696.7	854.3	2048, 8	3594.0	1714.9
1024, 16	1756.5	1029.7	1024, 16	3660.9	2078.6

(a) SATA machine
	Single-precision			Double-precision
Algorithm	ChunkSize	# of Buf.	Proc. time(s)	ChunkSize	# of Buf.	Proc. time(s)
$B l i n d e r$	4,096	2	3944.9	4,096	2	7039.6
$B l i n d e r_{m u l t i S S D}$	4,096	2	2530.7	4,096	2	5294.0
$B l i n d e r_{R A I D}$	4,096	2	3709.6	2,048	2	8635.9
$O u r s$	4,096	2	1622.9	4,096	2	3418.9

(b) NVMe machine
	Single-precision			Double-precision
Algorithm	ChunkSize	# of Buf.	Proc. time(s)	ChunkSize	# of Buf.	Proc. time(s)
$B l i n d e r$	8,192	2	1181.1	4,096	4	2443.3
$B l i n d e r_{m u l t i S S D}$	8,192	2	1183.7	8,192	2	2175.2
$B l i n d e r_{R A I D}$	8,192	2	1165.8	8,192	2	2179.6
$O u r s$	4,096	4	778.4	8,192	2	1432.3

Out-of-core diffraction algorithm using multiple SSDs for ultra-high-resolution hologram generation

Abstract

1. Introduction

2. Preliminaries

2.1 Implicit convolution

2.2 Properties of solid state drive (SSD)

2.3 Utilizing multiple disks: RAID vs. manual handling

3. Out-of-core diffraction with multiple SSDs

3.1 Problem formulation

3.2 Out-of-core diffraction algorithm

3.3 System configuration and work unit

3.4 First phase

3.5 Second phase

3.6 Third phase

4. Result and analysis

4.1 Results

4.2 I/O performance

4.3 Chunk size and the number of buffers

4.4 Number of SSDs

5. Application to computer generated holography

6. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (2)

Equations (17)

Optics Express