Special-purpose computer HORN-8 for phase-type electro-holography

Takashi Nishitsuji; Yota Yamamoto; Takashige Sugie; Takanori Akamatsu; Ryuji Hirayama; Hirotaka Nakayama; Takashi Kakue; Tomoyoshi Shimobaba; Tomoyoshi Ito

doi:10.1364/OE.26.026722

1. Introduction

Since D. Gabor invented holography in 1947 [1], electro-holography has shown great promise as a display technology to realize photorealistic three-dimensional (3D) images. However, it has not been put into practical use due to two reasons: (1) The enormous calculation burden of computer-generated holograms (CGHs) and (2) the pixel pitch of the spatial light modulator (SLM), which determines the frame rate and viewing angle of the 3D images. This study focuses on solving first of these two problems.

Studies into the calculation processing of electro-holography are roughly divided into two approaches and can be further classified according to the description of the 3D information that is to be displayed; such as point-cloud [2–11], polygon [12], and multi-view [13, 14]. HORN-8 is designed for CGHs of the point-cloud based 3D model because of the simplicity of its CGH calculation; thus, the following explanations are based on a point-cloud CGH. The first approach is to reduce the computational complexity by devising algorithms [2–11]. For example, Look-up Table (LUT) method [2–5], wavefront recording plane method [6, 7], and the inter frame differences [8,9] method. The second approach is to develop the hardware system, such as the graphics processing units (GPUs) [15–17], many-core processors, [18] and field programmable gate arrays (FPGAs) [19–27].

In 1993, the authors developed a high-performance computation system for electro-holography, namely HOlographic ReconstructioN (HORN) [20–27]. In comparison with the general purpose computers such as GPUs, the computational efficiency of a special-purpose computer is higher because the calculation target has been restricted. Recently, we released the latest model of the HORN series called HORN-8. It is a peripheral board type dedicated computer with eight FPGAs (Fig. 1). We first implemented a circuit for an amplitude-type CGH and successfully synthesized an amplitude-type CGH of 100 million pixels at video rate [27]. However, it was not the final stage of HORN-8. Since the amplitude-type CGH has poor light utilization efficiency in comparison with a phase-type CGH, the sharpness of the 3D image was lower. Therefore, to realize a photorealistic 3D image with the HORN system, it is indispensable to implement a phase-type CGH in HORN-8. In this paper, we report the phase-type HORN-8, which is based on our previous amplitude-type HORN-8. To distinguish between the two types of HORN-8, we call the phase-type HORN-8, simply HORN-8, and the amplitude-type HORN-8 as it is.

Fig. 1 Appearance of the HORN-8 board

Download Full Size | PDF

The remainder of this paper is organized as follows. In Section 2, we introduce the brief history of the HORN-8 project. In Section 3, we describe the algorithms and architectures implemented in HORN-8. In Section 4, we describe the specification of the HORN-8 board and its performance. In Section 5, we discuss the experimental results and in Section 6 we conclude this work.

2. HORN-8 project

The HORN-8 project began in October 2012. We started with the detailed design of HORN-8, and then carried out board production and component mounting. The initial cost of the board production was 300,000 yen. We made the first board in 2013 and in total produced ten boards up until 2014. The primary part cost was 120,000 yen for the calculation FPGA (Xilinx Virtex-5 XC5VLX110-2FF676C), 30,000 yen for the communication FPGA (Xilinx Virtex-5 XC5VLX30T-2FF665C), and 50,000 yen for the main circuit board including implementation fee; thus, the total cost of one HORN-8 board was one million yen (approx. $10,000 (U.S. Dollars)).

We then completed the development of the related software (e.g., a device driver) in 2015, and succeeded in operating a single HORN-8 board with the amplitude-type CGH calculation circuits. Then, we moved to the development phase to increase the scale of the system to a cluster. We succeeded in constructing an amplitude-type HORN-8 cluster system with two boards in 2016, and eight boards in 2017, which can produce 100 million pixels of amplitude-type CGH with 10 million point-light sources (PLSs) [27]. Finally, in 2018, we succeeded in producing a phase-type HORN-8 cluster system that we report in this paper.

3. Hardware design

3.1. Algorithm

CGH is obtained by simulating a spherical wave from each PLS which constitute the 3D images, and the complex amplitude distribution produced by those spherical waves on a CGH plane U(x_α, y_α) on the computer, where (x_α, y_α) are the coordinates of the CGH.

Defining (x_j, y_j, z_j) as the coordinates of the position of the j-th PLS of a total of N_obj points, the complex amplitude distribution on plane U(x_α, y_α) under the condition of z_j ≫ x_j, y_j becomes;

U (x_{α}, y_{α}) = \sum_{j = 1}^{N_{obj}} a_{j} \exp [i \frac{2 π p}{λ} {\frac{x_{α j}^{2} + y_{α j}^{2}}{2 | z_{j} |}}],

= \sum_{j = 1}^{N_{obj}} \exp [i 2 π Θ (x_{α j}, y_{α j})],

where x_αj = x_α − x_j, y_αj = y_α − y_j, i is an imaginary unit, λ is the wave-length of the light source, a_j is the intensity of the j-th PLS which is set as 1 in HORN-8 for simplification and p is the sampling pitch of all the coordinates, so that the all the coordinates are normalized by p. Note that all of the variables described in this paper are implemented in a fixed-point format.

The phase-type CGH is obtained by quantizing the argument of Eq. (1) as

s (x_{α}, y_{α}) = ⌊ \frac{\arg [U (x_{α}, y_{α})]}{2 π} ⌋ \times b,

where s is a pixel-value of CGH at (x_α, y_α), b is a quantization step which is set as 255 in this work, and arg[·] is the operator of taking an argument whose range of output is 0 to 2π.

In the HORN-8 system, and similar to the conventional HORN systems, we pipelined the phase calculation in Eq. (2) with the recurrence relation algorithm [28]. The recurrence relation algorithm first defines an initial phase at a position, which is separated by n pixels from the reference point (X_α,Y_α) as,

Θ_{n} = \frac{p}{2 λ | z_{j} |} {{(X_{α j} + n)}^{2} + Y_{α j}^{2}},

X_{α j} = X_{α} - x_{j}, Y_{α j} = Y_{α} - y_{j} .

The recurrence relation between adjacent pixels becomes

Θ_{n} = Θ_{n - 1} + Δ_{n - 1},

Δ_{n - 1} = Δ_{0} + (n - 1) Γ,

where

Δ_{0} = \frac{p}{2 λ | z_{j} |} (2 X_{α j} + 1),

Γ = \frac{p}{λ | z_{j} |} .

According to the above equations, the algorithm reconfigures the phase calculation to only additions and subtractions; therefore, the HORN-8 system implements the phase calculation in a pipeline structure with a small circuit area so that it can operate at high frequency.

Besides, we developed an approximation method of the cosine and sine function based on Nishitsuji’s approximation method [29] to improve the parallel degree of the calculation circuit by eliminating the use of read-only memory and other related resources (e.g., memory channel), which are used to look-up tables of cosine and sine functions. According to [29], CGHs created with Nishitsuji’s approximation method can reconstruct 3D images with sufficient image quality; thus, the approximation method can replace the conventional LUT for the cos/sin functions.

In this approximation method, we first extract 6 bits from the beginning of the decimal part of Θ and reinterpret it as a two’s complement value with only the decimal bit. Here we define this value as Θ_s whose range is −0.5 ≤ Θ_s < 0.5. Considering Θ_s as a fixed-point number with no integer bits or two’s complements, the approximated cosine function, c(Θ_s), and sine function, s(Θ_s), can be written as:

c (Θ_{s}) = {\begin{matrix} 0.25 - Θ_{s} & (Θ_{s} \geq 0), \\ 0.25 + Θ_{s} & (Θ_{s} < 0), \end{matrix}

s (Θ_{s}) = {\begin{array}{l} - 0.5 - Θ_{s} & (- 0.5 \leq Θ_{s} < - 0.25) . \\ Θ_{s} & (- 0.25 \leq Θ_{s} < 0.25), \\ 0.5 - Θ_{s} & (0.25 \leq Θ_{s} < 0.5) . \end{array}

Although the value range of s(Θ_s) and c(Θ_s) are −0.25 ≤ s(Θ_s), c(Θ_s) < 0.25, which is different from the ground truth cos(2πΘ) and sin(2πΘ)(−0.5 ≤ Θ < 0.5), it does not affect the result because the phase-type CGH outputs the argument value obtained from the relative ratio between the real and imaginary-part of U(x_α, y_α) as Eq. (3).

Figure 2 shows the comparison between the output of the cosine and sine functions by the conventional LUT as a ground truth and the output of this approximation method. Note that the value range of the ground truth is adjusted to the approximation method. According to Fig. 2, the approximate shape of c(Θ_s) and s(Θ_s) follows the ground truths, the positions of each peak completely match, and the phase value and the output values have a one-to-one correspondence within the range of Θ_s; therefore, the approximation method can be an alternative to the LUT.

Fig. 2 Cosine and Sine function approximation method

Download Full Size | PDF

3.2. Implementation

Figure 3 shows the block diagram of the HORN-8 board. The host PC connected to the HORN-8 board controls the entire calculation process. Input and output data (e.g., command, PLS data, CGH) are transferred via PCI-Express. The HORN-8 board has seven FPGAs for calculation, called calculation FPGAs and one FPGA for the control, called the communication FPGA, each of which is connected via a ring-bus. The calculation FPGAs share the CGH calculation among the other calculation FPGAs on the board and the communication FPGA controls the ring-bus and the PCI-Express. The calculation FPGAs are allocated a CGH calculation divided by the unit of the row, and process them in parallel; thus, every calculation FPGA stores the whole PLS data. The remaining of this section describes the details of the calculation FPGAs.

Fig. 3 Block diagram of HORN-8

Download Full Size | PDF

Figure 4 shows the signal flow diagram of the calculation FPGAs. Each calculation FPGA acquires the PLS data and the allocated coordinates of the CGH, and outputs the calculated CGH via the ring-bus. Rx and Tx in Fig. 4 are the modules for reading and writing data from and to the ring-bus, respectively. Each calculation FPGA has a hierarchical structure, which is composed of HORN-CONTROL, a basic phase unit (BPU), an additional phase unit (APU), a complex amplitude unit (CAU), and other accompanying function blocks. In HORN-8, one calculation FPGA has one BPU and 319 APUs, i.e., one calculation FPGA calculates 320 pixels of CGH.

Fig. 4 Block diagram of calculation FPGA

Download Full Size | PDF

HORN-CONTROL controls the overall CGH calculation which consists of the HORN-CORE as a CGH calculation circuit and a communication module that sends and receives PLSs and the calculated CGH (data receive control, data send control), and blocks random access memory (BRAM). In HORN-8, the maximum number of PLSs that can be calculated at one time is 32,768 points due to the restriction of the BRAM capacity. Therefore, in the case of a 3-dimensional (3D) model exceeding 32,768 points, the HORN-8 divides the PLSs and calculates the CGH for a plurality of times and outputs sequentially to reconstruct the visually integrated 3D image using the afterimage effect.

HORN-CORE is a calculation circuit block to operate the CGH calculation with the recurrence relation, which consists of BPU, APU, CAU, selector, and arctangent table. The arctangent table is a two-input LUT of tan⁻¹[I/R] where R and I are the real-part and imaginary-part, respectively. The HORN-CORE calculates successive 320 pixels of CGH in a pipeline with a start coordinate of the recurrence relation (X_α,Y_α) and serially inputs the coordinates of PLSs (x_j, y_j, ρ_j), where $ρ_{j} = \frac{p}{2 λ | z_{j} |}$ , which is pre-calculated to eliminate a division.

Figures 5–7 show the signal flow diagram of the BPU, APU, and CAU. The numbers depicted in those figures express the bit-length of each signal. BPU calculates the initial term Θ₀, which can be defined from Eq. (4). The APU calculates Eq. (6), that is, each of the BPU and APUs are responsible for calculating one pixel out of 320 pixels allocated to each calculation FPGA. The CAU calculates the cosine and sine functions and accumulates the results as a complex amplitude distribution. After processing all of the coordinates of the PLSs, the CAU normalizes the accumulated values and outputs as U, which becomes an index value of the arctangent table. In HORN-8, the first half of U is a real-part and the latter is an imaginary-part.

Fig. 5 Signal flow diagram of the BPU

Download Full Size | PDF

Fig. 6 Signal flow diagram of the APU

Download Full Size | PDF

Fig. 7 Signal flow diagram of the CAU

Download Full Size | PDF

The role of the normalization circuit is to reduce the scale of the arctangent table by shortening the bit-width of the index. The normalization circuit operates according to the following process: (a) Finding a first position where a bit is different from the sign bit in the top bit of the real and imaginary-part, respectively. (b) Extracting 5 bits from the position found in (a). For example, when the bit-strings of a real and imaginary-part are “000 0000 1010 0100 0000” and “111 1111 1101 0011 0000,” the normalization circuit extracts 5 bits from a 7 bit in the real-part and a 9 bit in the imaginary-part, i.e., the normalized bit-strings of the real-part become “01010” and the imaginary-part becomes“10100.”

4. Packaging and performance

4.1. Hardware specification

The HORN-8 board adopted a PCI-Express Gen. 1 as a communication interface to the host PC and mounted seven FPGAs for calculation FPGA (Xilinx Vertex 5 XC5VLX110-2-FF676) and one FPGA for communication FPGA (Xilinx Vertex 5 XC5VLX30T-2-FF665). The number of pixels that can be processed in parallel on a single HORN-8 board is 2,240 pixels because each calculation FPGA can calculate 320 pixels, which is a half of the amplitude-type HORN-8 [27], as the phase-type HORN-8 calculates both the real and imaginary-parts simultaneously. The usage rate of the slice (logic cell) and Block RAM are 98 % of each other, and the operating frequency is 0.25 GHz.

Moreover, we constructed a cluster system with a maximum of eight HORN-8 boards. The cluster system consists of one master node PC, which does not mount the HORN-8 board, and four slave node PCs which mount two HORN-8 boards. We connected the PCs by a Gigabit Ethernet cable and used MPICH3 for message passing.

4.2. Performance

4.2.1. Evaluation setup

For the performance evaluation, we implemented a CGH calculation in a central processing unit (CPU) and GPU for comparing calculation time and image quality to HORN-8. The computer environment used for comparison is as follows: OS: Windows 10 Enterprise 64bit, CPU: Intel Core i7-6700K 4.00 GHz, GPU: NVIDIA Geforce GTX 1080Ti, Memory: 16GB, Compiler (CPU): Intel C++ Compiler 17.0, Compiler (GPU): NVIDIA CUDA Compiler driver 8.0. The resolution of the CGH is 1, 920 × 1, 080 pixels, and the number of PLSs in the 3D image is N_obj = 8, 000.

As for the CPU, we implemented CGH calculation of Eq. (1) with recurrence relation algorithm, look-up tables for cos and sin functions, built-in arctangent function of C++, OpenMP for parallelization, and a fixed-point program for a comparison of computational speed. Additionally, we also implemented CGH calculation with no approximation, double precision floating point program, built-in trigonometric functions of C++ for a comparison of image quality of reconstructed images.

As for the GPU, we implemented CGH calculation of Eq. (1) itself with single precision floating point program, i.e., we didn’t apply recurrence relation and pipelined structure as HORN-8 to GPU because GPU is suitable for the simultaneous execution of independent operations, which is generally called as single instruction multiple data. We adopted the following techniques for performance tuning: (a) using shared memory, (b) using built-in fast trigonometric functions. Although GPU can execute the parallel calculation in faster speed, memory access often becomes a bottleneck for increasing computational efficiency. To avoid it, we used a cash memory, namely shared memory, which is set between the main memory on a GPU and processor’s registers to store PLS data. Also, we didn’t apply Nishitsuji’s approximation algorithm to cos and sin functions because the GPU has special functional circuits for the trigonometric functions, which are more optimized for a GPU program than Nishitsuji’s approximation method. Additionally, we didn’t apply look-up table to arctangent function but used built-in arctangent function because the memory amount of a GPU is limited and the access speed of memory often become a bottleneck of the performance.

4.2.2. Result

Table 1 shows the comparison of the calculation times and frame rates of the CGHs, which were created by a CPU, GPU, and single HORN-8 board. As shown in Table 1, the HORN-8 board succeeded in speeding up the CPU by approximately 100 times, and about 1.03 times for the GPU. Also, we realized a frame rate above the video rate (29.97 fps: a standard frame rate for NTSC broadcasting system). Note that the execution efficiency of the GPU program is sufficient enough to compare the performance of the GPU with HORN-8 due to some profiling indicators of NVIDIA’s profiling tool, e.g., achieved occupancy is 97% which is an actual operation rate of GPU’s processors. The high value of achieved occupancy insists that the threads of the GPU program are well distributed among the processors in the GPU, and the GPU program avoids ineffective phenomenon such as a stall of the memory access.

Table 1. Performance comparison between the HORN-8 board, CPU, and GPU

View Table

Figure 8 shows the performance of the HORN-8 cluster system. As shown in the figure, we succeeded in calculating the CGH with 32,000 points PLSs at video rate. On the other hand, as seen from the result of the cluster system with 6 and 8 HORN-8 boards, there is a section where the change in calculation time is not smooth. This is because the communication time increased because the computation result is divided and transmitted when N_obj exceeds 32,768 due to the limitation of the memory capacity of the calculation FPGA.

Fig. 8 Performance of the HORN-8 cluster system

Download Full Size | PDF

Figure 9 shows the comparison of the reconstructed images of the CGHs with N_obj = 8, 000 obtained by the optical reconstruction and numerical simulation of the CGHs created by HORN-8 and the CPU with double precision floating point program as ground truth. We used a phase-modulated liquid crystal on silicon display (Holoeye PLUTO-2 Spatial Light Modulator) as the SLM, and a high intensity green light emitting diode as an optical source with wavelength λ = 523 nm for optical reconstruction. Also, we used the angular spectrum method with a CWO++ library [30] for numerical simulation. As shown in Fig. 9, the reconstructed optical image of the CGH outputted by HORN-8 matches both the numerical simulation result and the original 3D model well. The peak signal to noise ratio (PSNR) between the numerically simulated results, which are shown in Figs. 9(c) and 9(d) is 29.88 dB. Since the PSNR of the image quality criterion of a two-dimensional image is 30 dB or more [31], the reconstructed image with the HORN-8 system is considered to be of favorably good quality.

Fig. 9 Reconstructed images of CGHs: (a) Original point-cloud model (see Visualization 1), (b) Optically reconstructed images with CGH from the HORN-8 system, (c) and (d) numerically reconstructed images with CGH from the HORN-8 system and CPU with no approximation and double precision floating point calculation.

Download Full Size | PDF

Finally, we describe the computational efficiency of the HORN-8 system. The theoretical value of the CGH calculation with the HORN-8 system T_horn is defined as follows [27]:

T_{horn} = \frac{N_{obj} \times N_{hol}}{f \times P \times C},

where N_hol is the number of total pixels of a CGH, f is the operating frequency, P is the number of pixels that one HORN-8 board can process in parallel, and C is the number of HORN-8 boards in a cluster system. Then we can define the computational efficiency E with T_horn and the total calculation time T_total to create the CGH including the overhead time T_over as follow:

E = \frac{T_{horn}}{T_{total}},

where, T_total = T_horn + T_over. T_over is also defined as T_over = T_in + T_out, where T_in and T_out are the time for data input and output, respectively. In the HORN-8 system, the input processes of the PLS and CGH calculations are performed sequentially, and the output of the calculated CGH is operated in parallel because the data size of the CGH is much bigger than PLSs. Therefore, T_out is concealed and ideally becomes zero, and T_in becomes dominant in T_over.

Figure 10 shows the comparison of the computational efficiency of the HORN-8 system (including both single board and cluster operations). As shown in Fig. 10, the computational efficiency of the HORN-8 system reaches over 90 % in single, two, four, and six board cluster systems, and reaches over 80 % in the eight board cluster system. For example, in the single board operation, T_horn = 29.62 ms, and T_total = 30.01 ms, i.e., E = 98.7 % when N_obj = 8, 000 and N_hol = 1, 920 × 1, 080 pixels. In the cluster operation, T_horn = 29.62 ms and T_total = 31.51 ms, i.e., E = 94.0 % in the four board cluster system when N_obj = 32, 000 and N_hol = 1, 920 × 1, 080 pixels.

Fig. 10 Comparison of the computational efficiency E

Download Full Size | PDF

According to Fig. 10, the computational efficiency increases according to the number of calculation points and converges at a constant value. Besides, as the number of clusters increases, the improvement trend in the computational efficiency is gentle. The HORN-8 system divides the CGH calculation into each HORN-8 board, so T_out is not concealed when the number of PLS is small, i.e., the relation between calculation times becomes T_out > T_horn + T_in. On the other hand, as seen in the result of the cluster system with six and eight HORN-8 boards, the computational efficiency drops when the number of PLS is around 40,000 and 60,000. Since the CGH is created from above N_obj = 32, 768, the PLSs should transmit separately due to the limitation of the memory capacity of each calculation FPGA, and T_out increases by the number of divisions; thus, T_out is not completely concealed in this situation. For example, in the case of N_obj = 40, 000, the HORN-8 divides the PLSs into N_obj = 32, 768 and N_obj = 7, 232. According to Fig. 10, the computational efficiency of N_obj = 32, 768 and 7, 232 are approximately 80% and 20%, respectively; thus the overall efficiency is expected to be 50%, which matches the result shown in Fig. 10. As for the case of N_obj = 70, 000, the HORN-8 divides the calculation into two times N_obj = 32, 768 and the residuals, so that the computational efficiency becomes higher than in the case of N_obj = 32, 768.

5. Conclusion and future work

In this paper, we proposed a HORN-8 system, which is the latest model of the special-purpose computer for phase-type electro-holography, HORN. The HORN-8 can reconstruct 3D videos composed of tens of thousands of PLSs in video rate, which could be realized in interactive systems, such as televisions and telephones.

The performance of HORN-8 is approximately the same as the latest GPU; however, because FPGAs mounted on the HORN-8 are not the newest type, its performance will be significantly improved when the latest model FPGAs are used. For example, if we replace the FPGAs with a Xilinx Vertex Ultrascale+ VU037P of the same structure, we estimate that the performance of HORN-8 will increase 18 times according to a comparison with the number from the configurable logic block look-up table (CLB LUT). In other words, the single HORN-8 system can calculate the CGH created from N_obj = 140,000 PLSs above the video rate and the cluster system with eight HORN-8 boards can calculate the CGH created from over N_obj = 570,000 PLSs. Since the performance of current HORN-8 is almost the same as the latest GPU, this estimated performance is hard to reach in current GPUs. Therefore, it suggests the superiority of HORN-8 architecture.

Funding

Japan Society for the Promotion of Science (Grant-in-Aid No. 25240015)

References

1. D. Gabor, “A new microscopic principle,” Nature 161, 777–778 (1948). [CrossRef] [PubMed]

2. M. E. Lucente, “Interactive computation of holograms using a look-up table,” J. Electron. Imaging 2, 28–34 (1993). [CrossRef]

3. S.-C. Kim and E.-S. Kim, “Effective generation of digital holograms of three-dimensional objects using a novel look-up table method,” Appl. Opt. 47, D55–D62 (2008). [CrossRef] [PubMed]

4. T. Nishitsuji, T. Shimobaba, T. Kakue, N. Masuda, and T. Ito, “Fast calculation of computer-generated hologram using the circular symmetry of zone plates,” Opt. Express 18, 19504–19509 (2010).

5. Y. Pan, X. Xu, S. Solanki, X. Liang, R. B. A. Tanjung, C. Tan, and T.-C. Chong, “Fast CGH computation using S-LUT on GPU,” Opt. Express 17, 18543–18555 (2009). [CrossRef]

6. T. Shimobaba, N. Masuda, and T. Ito, “Simple and fast calculation algorithm for computer-generated hologram with wavefront recording plane,” Opt. Lett. 34, 3133–3135 (2009). [CrossRef] [PubMed]

7. P. W. M. Tsang and T.-C. Poon, “Fast generation of digital holograms based on warping of the wavefront recording plane,” Opt. Express 23, 7667–7673 (2015). [CrossRef] [PubMed]

8. S.-C. Kim, J.-H. Yoon, and E.-S. Kim, “Fast generation of three-dimensional video holograms by combined use of data compression and lookup table techniques,” Appl. Opt. 47, 5986–5995 (2008). [CrossRef] [PubMed]

9. X.-B. Dong, S.-C. Kim, and E.-S. Kim, “MPEG-based novel look-up table for rapid generation of video holograms of fast-moving three-dimensional objects,” Opt. Express 22, 8047–8067 (2014). [CrossRef] [PubMed]

10. T. Nishitsuji, T. Shimobaba, T. Kakue, and T. Ito, “Review of Fast Calculation Techniques for Computer-Generated Holograms With the Point-Light-Source-Based Model,” IEEE Trans. Ind. Inf. 13, 2447–2454 (2017). [CrossRef]

11. T. Shimobaba, T. Kakue, and T. Ito, “Review of Fast Algorithms and Hardware Implementations on Computer Holography,” IEEE Trans. Ind. Inf. 12, 1611–1622 (2016). [CrossRef]

12. K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-generated hologram created by the polygon-based method,” Appl. Opt. 48(34), H54–H63 (2009). [CrossRef] [PubMed]

13. T. Yatagai, “Stereoscopic approach to 3-D display using computer-generated holograms,” Appl. Opt. 15(11), 2722–2729 (1976). [CrossRef] [PubMed]

14. T. Mishina, M. Okui, and F. Okano, “Calculation of holograms from elemental images captured by integral photography,” Appl. Opt. 45(17), 4026–4036 (2006). [CrossRef] [PubMed]

15. N. Masuda, T. Ito, T. Tanaka, A. Shiraki, and T. Sugie, “Computer generated holography using a graphics processing unit,” Opt. Express 14, 603–608 (2006). [CrossRef] [PubMed]

16. Y. Ichihashi, R. Oi, T. Senoh, K. Yamamoto, and T. Kurita, “Real-time capture and reconstruction system with multiple GPUs for a 3D live scene by a generation from 4K IP images to 8K holograms,” Opt. Express 20, 21645–21655 (2012). [CrossRef] [PubMed]

17. H. Niwase, N. Takada, H. Araki, H. Nakayama, A. Sugiyama, T. Kakue, T. Shimobaba, and T. Ito, “Real-time spatiotemporal division multiplexing electroholography with a single graphics processing unit utilizing movie features,” Opt. Express 22, 28052–28057 (2014). [CrossRef] [PubMed]

18. K. Murano, T. Shimobaba, A. Sugiyama, N. Takada, T. Kakue, M. Oikawa, and T. Ito, “Fast computation of computer-generated hologram using Xeon Phi coprocessor,” Comput. Phys. Commun. 185, 2742–2757 (2014). [CrossRef]

19. Z.-Y. Pang, Z.-X. Xu, Y. Xiong, B. Chen, H.-M. Dai, S.-J. Jiang, and J.-W. Dong, “Hardware architecture for full analytical Fraunhofer computer-generated holograms,” Opt. Eng. 54, 095101 (2015). [CrossRef]

20. T. Ito, T. Yabe, M. Okazaki, and M. Yanagi, “Special-purpose computer HORN-1 for reconstruction of virtual image in three dimensions,” Comput. Phys. Commun. 82(2), 104–110 (1994). [CrossRef]

21. T. Ito, H. Eldeib, K. Yoshida, S. Takahashi, T. Yabe, and T. Kunugi, “Special-purpose computer for holography HORN-2,” Comput. Phys. Commun. 93(1), 13–20 (1996). [CrossRef]

22. T. Shimobaba, N. Masuda, T. Sugie, S. Hosono, S. Tsukui, and T. Ito, “Special-purpose computer for holography HORN-3 with PLD technology,” Comput. Phys. Commun. 130(1), 75–82 (2000). [CrossRef]

23. T. Shimobaba, S. Hishinuma, and T. Ito, “Special-purpose computer for holography HORN-4 with recurrence algorithm,” Comput. Phys. Commun. 148(2), 160–170 (2002). [CrossRef]

24. T. Ito, N. Masuda, K. Yoshimura, A. Shiraki, T. Shimobaba, and T. Sugie, “Special-purpose computer HORN-5 for a real-time electroholography,” Opt. Express 13(6), 1923–1932 (2005). [CrossRef] [PubMed]

25. Y. Ichihashi, H. Nakayama, T. Ito, N. Masuda, T. Shimobaba, A. Shiraki, and T. Sugie, “HORN-6 special-purpose clustered computing system for electroholography,” Comput. Phys. Commun. 93, 13–20 (2009).

26. N. Okada, D. Hirai, Y. Ichihashi, A. Shiraki, T. Kakue, T. Shimobaba, N. Masuda, and T. Ito, “Special-purpose computer HORN-7 with FPGA technology for phase modulation type electro-holography,” IDW/AD’12 Proc. Int. Display Workshops 3Dp–26 (2012).

27. T. Sugie, T. Akamatsu, T. Nishitsuji, R. Hirayama, N. Masuda, H. Nakayama, Y. Ichihashi, A. Shiraki, M. Oikawa, N. Takada, Y. Endo, T. Kakue, T. Shimobaba, and T. Ito, “High-performance parallel computing for next-generation holographic imaging,” Nature Electronics 1(4), 254–259 (2018). [CrossRef]

28. T. Shimobaba and T. Ito, “An efficient computational method suitable for hardware of computer-generated hologram with phase computation by addition,” Comput. Phys. Commun. 138, 44–52 (2001). [CrossRef]

29. T. Nishitsuji, T. Shimobaba, T. Kakue, D. Arai, and T. Ito, “Simple and fast cosine approximation method for computer-generated hologram calculation,” Opt. Express 23, 32465–32470 (2015). [CrossRef] [PubMed]

30. T. Shimobaba, J. Weng, T. Sakurai, N. Okada, T. Nishitsuji, N. Takada, A. Shiraki, N. Masuda, and T. Ito, “Computational wave optics library for C++: CWO++ library,” Comput. Phys. Commun. 183, 1124–1138 (2012). [CrossRef]

31. R. Gomes, W. Junior, E. Cerqueira, and A. Abelem, “A QoE Fuzzy Routing Protocol for Wireless Mesh Networks,” in Future Multimedia Networking (SpringerBerlin Heidelberg, 2010), pp. 1–12.

System	Time per hologram [ms]	Speed up ratio	Frame rate [fps]
CPU: Intel Core i7-6700K, 4 cores, 4 GHz	2,975	1.0	0.336
GPU: NVIDIA GTX 1080Ti, 3,584 cores, 1.48 GHz	30.83	96.5	32.4
HORN-8 board (single): 2,240 parallel (cores), 0.25 GHz	30.38	99.1	33.3

Special-purpose computer HORN-8 for phase-type electro-holography

Abstract

1. Introduction

2. HORN-8 project

3. Hardware design

3.1. Algorithm

3.2. Implementation

4. Packaging and performance

4.1. Hardware specification

4.2. Performance

4.2.1. Evaluation setup

4.2.2. Result

5. Conclusion and future work

Funding

References

Supplementary Material (1)

Cited By

Figures (10)

Tables (1)

Equations (13)

Optics Express