Dual-stage hybrid network for single-shot fringe projection profilometry based on a phase-height model

Xuwen Song; Lianpo Wang; Lianpo Wang

doi:10.1364/OE.505544

1. Introduction

In the field of optical measurement, fringe projection profilometry (FPP) [1], which is a high-precision active 3D imaging method, has been widely used in industrial measurement, cultural relic reconstruction, reverse engineering, medical diagnosis, etc. The FPP system comprises of a projector that irradiates fringe patterns onto the object surface, and a camera that captures the fringe pattern scattered by object surface from a different perspective. Subsequently, the phase values of each pixel can be obtained by phase retrieval and phase unwrapping, with which the optical triangulation method or phase height method [2] can reconstruct the 3D shape of the object. To ensure the accuracy of 3D reconstruction, the multi-shot FPP method projects multiple fringe patterns with different frequencies or phase shifts to accurately calculate the unwrapped phase. However, the dynamic performance and computational efficiency of the multi-shot FPP method are barely satisfactory. Therefore, the single-shot FPP method that only needs to project one fringe pattern is widely used in fields of real-time online inspection and rapid modeling of moving objects. The earliest single-shot FPP method used Fourier transform for phase retrieval, hence it is also known as Fourier transform profilometry (FTP). Although FTP methods have been studied for many years [3,4], their reconstruction accuracies are still far inferior to multi-shot FPP methods [5–7]. Especially when there are discontinuous objects in the reconstructed scene, it is infeasible to reconstruct discontinuous objects with different depths using single fringe pattern, since single-shot FPP methods cannot solve for the discontinuous phase relying only on spatial information. However, this is not a problem for deep learning methods because there is no need to compute the phase at all [8]. In summary, with the introduction of deep learning techniques into the field of FPP, a high-precision and high dynamics single-shot FPP method has become possible.

Early works go back to the pioneering approach of Feng et al. [9] in 2019 and since then deep learning methods have been widely applied in image preprocessing [10,11], high reflection avoidance [12,13], phase retrieval [9,14], and phase unwrapping [15–17] in the field of FPP. To fully utilize the advantages of deep learning, the multi-shot FPP method based on deep learning achieves end-to-end 3D shape output [18,19], greatly improving the efficiency and accuracy of the FPP method. This work focuses on implementing end-to-end single-shot FPP methods based on deep learning, so multi-shot and non-end-to-end FPP methods will no longer be elaborated.

With the rapid development of deep learning technology in the field of FPP, achieving a high-precision, high-efficiency, and high-robustness single-shot FPP method has become feasible [20]. Machineni et al. [21] proposed a novel two-stage end-to-end deep learning framework, in which the first stage used a Deep3D network to generate reference fringe pattern, and the second stage used a stereo matching network to reconstruct 3D shapes based on the reference and deformed fringe patterns. Although the two-stage network greatly improves the calculation accuracy, the ground truths of the two stages are different, which increases the difficulty of data acquisition. Jeught et al. [22] trained a complete convolutional neural network (CNN) on simulated fringe patterns to implement end-to-end single-shot FPP method. Later, they [23] replaced the CNN with UNet and proposed a new mixed gradient loss function to improve the accuracy of 3D reconstruction. To find the most suitable network architecture for single-shot FPP methods, Nguyen et al. [8,24] investigated the impact of different network structures such as FCN, AEN, and UNet on reconstruction performance, and pointed out that UNet achieved the optimal results. Since then, UNet has been widely used in single-shot FPP methods and has achieved considerable reconstruction performance [25,26].

To achieve accurate 3D reconstruction, deep learning networks must be able to accurately extract phase information, which is not only related to local fringe patterns, but also to global image features. However, the convolution operation in UNet is a local operation with a small receptive field, which limits the performance of UNet in single-shot FPP. Although the size of the receptive field can be increased by increasing the depth of the network, a deeper network structure will not only introduce the overfitting problem, but also lead to the loss of context information. To tackle this problem, Nguyen et al. [27] introduced a global guidance network path with multi-scale feature fusion into the UNet, also known as HNet. Similarly, Wang et al. [28] proposed a dual-path hybrid network based on UNet, which eliminates the deepest convolution layers to avoid the overfitting problem, and a swin transformer path is introduced to improve the global perception. Afterward, they [29] improved the loss function by combining gradient-based structural similarity to improve reconstruction details. Although swin transformer can greatly improve the receptive field of networks, they often require more datasets to ensure good training results. In addition, the all-existing networks, to the author's knowledge, are trained on a dataset generated by one real or virtual FPP hardware device, due to the inconsistent parameters of each FPP hardware device, the trained models cannot be generalized to other hardware devices.

To improve the reconstruction accuracy and generalization ability of single-shot FPP networks, this paper proposes Deformable Convolution-Based HINet with Attention Connection (DCAHINet), which uses HINet in the image restoration field as its backbone [30]. Compared to UNet, HINet not only introduced half-instance normalized residual block (HIN ResBlock) into each convolutional layer, but also adopted multi-stage fusion module. Therefore, HINet has greater global perception capability than UNet, which is the reason why this work chooses HINet as the backbone. To further enhance the global perception and feature fusion capability of DCAHINet, this work introduces the attention gate [31] into DCAHINet, which improves the information transfer and feature fusion between the encoder and decoder. In addition, the types of fringe pattern deformation are diverse in complex scenes, so DCAHINet replaces the ordinary convolution with deformable convolution (DC) [32], which flexibly combines local and global contextual information. To improve the generalization ability of a single-shot FPP network on different hardware devices, DCAHINet outputs phase difference values, which can be converted into 3D shapes by simply multiplying the phase-height coefficient (which can be calibrated by the phase-height method), rather than directly outputting 3D shapes. This operation allows our network to be applied to hardware devices with different parameters, such as focal length, baseline distance, distance of the tested object, angle between visual axes, etc. More importantly, it is necessary to generate a single-shot FPP dataset containing different hardware devices parameters. The dataset generation method based on phase-height method is proposed in this paper to convert public depth image datasets [33] into deformed fringe patterns, and the deformed fringe patterns and depth maps together form the single-shot FPP dataset. In summary, the contributions of this paper are as follows:

1. A novel end-to-end single-shot FPP network is proposed, which takes HINet as the backbone and incorporates deformable convolution block and attention gate to further enhance its ability to extract global contextual information and describe phase difference information.
2. Similar to parallax in binocular reconstruction, the proposed network end-to-end outputs the phase difference values, which can reconstruct 3D shapes through simple multiplication operations and avoid the influence of hardware devices. In addition, a dataset suitable for different hardware devices and its generation method are also proposed in this paper.
3. This paper also conducted extensive experiments and demonstrated that the accuracy and efficiency of our method are superior to that of UNet [34], HNet [27], and traditional methods.

2. Network architecture

To propose a network model that is more suitable for single-shot FPP methods, this paper takes HINet [30] in the field of image restoration as the backbone and further proposes DCAHINet. HINet is a multi-stage network composed of multiple UNets, which can divide complex tasks into multiple simple tasks and gradually solve them. Although HINet has achieved good results in image restoration tasks, there are still some factors that limit its performance in the single-shot FPP field. First, HINet uses ordinary convolutional kernels, whose fixed shape is difficult to adapt to different types of fringe pattern deformations. Second, stacking too deep convolutional layers without feature fusion can lead to the loss of spatial contextual details. Third, a filtering module is also needed to ensure that the output depth map is smooth. To tackle the above problems, DCAHINet introduces deformable convolutions, attention gates and Kornia modules on the basis of HINet.

2.1 Main backbone

DCAHINet is a dual-stage hybrid network with deformation extraction and depth mapping, as shown in Fig. 1. The first stage is used to extract the pixel-wise deformation values of the fringe pattern, while the second stage is used to establish the relationship between the deformation of fringe pattern and the depth value of object. The input of these two stages is the same deformed fringe pattern, and the final output of the network is phase difference. Both stages are constructed based on the encoder-decoder network similar to UNet [34].

Fig. 1. The proposed DCAHINet, which consists of deformation extraction stage and depth mapping stage. The input of this network is the deformed fringe pattern and the output is the phase difference.

Network	Training dataset	Test dataset	Validation dataset
Our net	0.0901	0.4386	0.4524
Subnet1	0.4906	0.9230	0.9311
Subnet2	0.3678	0.7898	0.8827
Subnet3	0.4856	0.9872	0.8432

Method	Noisy Image	Clean Image
Our net	0.5184	0.3097
Traditional method	1.8570	1.5688

Method	Noisy Image	Clean Image
Our net	0.5184	0.3097
Traditional method	1.8570	1.5688

Abstract

1. Introduction

2. Network architecture

2.1 Main backbone

2.2 Deformable convolution

2.3 Attention gate

3. Dataset generation

4. Experiments and results

4.1 Ablation experiments

4.2 Varying depth ranges and hardware devices

4.3 Comparison experiments

4.4 Verify generalization ability

4.5 Practicality experiment

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (15)

Tables (3)

Equations (4)

Optics Express