Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Dual-stage hybrid network for single-shot fringe projection profilometry based on a phase-height model

Open Access Open Access

Abstract

Single-shot fringe projection profilometry (FPP) is widely used in the field of dynamic optical 3D reconstruction because of its high accuracy and efficiency. However, the traditional single-shot FPP methods are not satisfactory in reconstructing complex scenes with noise and discontinuous objects. Therefore, this paper proposes a Deformable Convolution-Based HINet with Attention Connection (DCAHINet), which is a dual-stage hybrid network with a deformation extraction stage and depth mapping stage. Specifically, the deformable convolution module and attention gate are introduced into DCAHINet respectively to enhance the ability of feature extraction and fusion. In addition, to solve the long-standing problem of the insufficient generalization ability of deep learning-based single-shot FPP methods on different hardware devices, DCAHINet outputs phase difference, which can be converted into 3D shapes by simple multiplication operations, rather than directly outputting 3D shapes. To the best of the author's knowledge, DCAHINet is the first network that can be applied to different hardware devices. Experiments on virtual and real datasets show that the proposed method is superior to other deep learning or traditional methods and can be used in practical application scenarios.

© 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

In the field of optical measurement, fringe projection profilometry (FPP) [1], which is a high-precision active 3D imaging method, has been widely used in industrial measurement, cultural relic reconstruction, reverse engineering, medical diagnosis, etc. The FPP system comprises of a projector that irradiates fringe patterns onto the object surface, and a camera that captures the fringe pattern scattered by object surface from a different perspective. Subsequently, the phase values of each pixel can be obtained by phase retrieval and phase unwrapping, with which the optical triangulation method or phase height method [2] can reconstruct the 3D shape of the object. To ensure the accuracy of 3D reconstruction, the multi-shot FPP method projects multiple fringe patterns with different frequencies or phase shifts to accurately calculate the unwrapped phase. However, the dynamic performance and computational efficiency of the multi-shot FPP method are barely satisfactory. Therefore, the single-shot FPP method that only needs to project one fringe pattern is widely used in fields of real-time online inspection and rapid modeling of moving objects. The earliest single-shot FPP method used Fourier transform for phase retrieval, hence it is also known as Fourier transform profilometry (FTP). Although FTP methods have been studied for many years [3,4], their reconstruction accuracies are still far inferior to multi-shot FPP methods [57]. Especially when there are discontinuous objects in the reconstructed scene, it is infeasible to reconstruct discontinuous objects with different depths using single fringe pattern, since single-shot FPP methods cannot solve for the discontinuous phase relying only on spatial information. However, this is not a problem for deep learning methods because there is no need to compute the phase at all [8]. In summary, with the introduction of deep learning techniques into the field of FPP, a high-precision and high dynamics single-shot FPP method has become possible.

Early works go back to the pioneering approach of Feng et al. [9] in 2019 and since then deep learning methods have been widely applied in image preprocessing [10,11], high reflection avoidance [12,13], phase retrieval [9,14], and phase unwrapping [1517] in the field of FPP. To fully utilize the advantages of deep learning, the multi-shot FPP method based on deep learning achieves end-to-end 3D shape output [18,19], greatly improving the efficiency and accuracy of the FPP method. This work focuses on implementing end-to-end single-shot FPP methods based on deep learning, so multi-shot and non-end-to-end FPP methods will no longer be elaborated.

With the rapid development of deep learning technology in the field of FPP, achieving a high-precision, high-efficiency, and high-robustness single-shot FPP method has become feasible [20]. Machineni et al. [21] proposed a novel two-stage end-to-end deep learning framework, in which the first stage used a Deep3D network to generate reference fringe pattern, and the second stage used a stereo matching network to reconstruct 3D shapes based on the reference and deformed fringe patterns. Although the two-stage network greatly improves the calculation accuracy, the ground truths of the two stages are different, which increases the difficulty of data acquisition. Jeught et al. [22] trained a complete convolutional neural network (CNN) on simulated fringe patterns to implement end-to-end single-shot FPP method. Later, they [23] replaced the CNN with UNet and proposed a new mixed gradient loss function to improve the accuracy of 3D reconstruction. To find the most suitable network architecture for single-shot FPP methods, Nguyen et al. [8,24] investigated the impact of different network structures such as FCN, AEN, and UNet on reconstruction performance, and pointed out that UNet achieved the optimal results. Since then, UNet has been widely used in single-shot FPP methods and has achieved considerable reconstruction performance [25,26].

To achieve accurate 3D reconstruction, deep learning networks must be able to accurately extract phase information, which is not only related to local fringe patterns, but also to global image features. However, the convolution operation in UNet is a local operation with a small receptive field, which limits the performance of UNet in single-shot FPP. Although the size of the receptive field can be increased by increasing the depth of the network, a deeper network structure will not only introduce the overfitting problem, but also lead to the loss of context information. To tackle this problem, Nguyen et al. [27] introduced a global guidance network path with multi-scale feature fusion into the UNet, also known as HNet. Similarly, Wang et al. [28] proposed a dual-path hybrid network based on UNet, which eliminates the deepest convolution layers to avoid the overfitting problem, and a swin transformer path is introduced to improve the global perception. Afterward, they [29] improved the loss function by combining gradient-based structural similarity to improve reconstruction details. Although swin transformer can greatly improve the receptive field of networks, they often require more datasets to ensure good training results. In addition, the all-existing networks, to the author's knowledge, are trained on a dataset generated by one real or virtual FPP hardware device, due to the inconsistent parameters of each FPP hardware device, the trained models cannot be generalized to other hardware devices.

To improve the reconstruction accuracy and generalization ability of single-shot FPP networks, this paper proposes Deformable Convolution-Based HINet with Attention Connection (DCAHINet), which uses HINet in the image restoration field as its backbone [30]. Compared to UNet, HINet not only introduced half-instance normalized residual block (HIN ResBlock) into each convolutional layer, but also adopted multi-stage fusion module. Therefore, HINet has greater global perception capability than UNet, which is the reason why this work chooses HINet as the backbone. To further enhance the global perception and feature fusion capability of DCAHINet, this work introduces the attention gate [31] into DCAHINet, which improves the information transfer and feature fusion between the encoder and decoder. In addition, the types of fringe pattern deformation are diverse in complex scenes, so DCAHINet replaces the ordinary convolution with deformable convolution (DC) [32], which flexibly combines local and global contextual information. To improve the generalization ability of a single-shot FPP network on different hardware devices, DCAHINet outputs phase difference values, which can be converted into 3D shapes by simply multiplying the phase-height coefficient (which can be calibrated by the phase-height method), rather than directly outputting 3D shapes. This operation allows our network to be applied to hardware devices with different parameters, such as focal length, baseline distance, distance of the tested object, angle between visual axes, etc. More importantly, it is necessary to generate a single-shot FPP dataset containing different hardware devices parameters. The dataset generation method based on phase-height method is proposed in this paper to convert public depth image datasets [33] into deformed fringe patterns, and the deformed fringe patterns and depth maps together form the single-shot FPP dataset. In summary, the contributions of this paper are as follows:

  • 1. A novel end-to-end single-shot FPP network is proposed, which takes HINet as the backbone and incorporates deformable convolution block and attention gate to further enhance its ability to extract global contextual information and describe phase difference information.
  • 2. Similar to parallax in binocular reconstruction, the proposed network end-to-end outputs the phase difference values, which can reconstruct 3D shapes through simple multiplication operations and avoid the influence of hardware devices. In addition, a dataset suitable for different hardware devices and its generation method are also proposed in this paper.
  • 3. This paper also conducted extensive experiments and demonstrated that the accuracy and efficiency of our method are superior to that of UNet [34], HNet [27], and traditional methods.

2. Network architecture

To propose a network model that is more suitable for single-shot FPP methods, this paper takes HINet [30] in the field of image restoration as the backbone and further proposes DCAHINet. HINet is a multi-stage network composed of multiple UNets, which can divide complex tasks into multiple simple tasks and gradually solve them. Although HINet has achieved good results in image restoration tasks, there are still some factors that limit its performance in the single-shot FPP field. First, HINet uses ordinary convolutional kernels, whose fixed shape is difficult to adapt to different types of fringe pattern deformations. Second, stacking too deep convolutional layers without feature fusion can lead to the loss of spatial contextual details. Third, a filtering module is also needed to ensure that the output depth map is smooth. To tackle the above problems, DCAHINet introduces deformable convolutions, attention gates and Kornia modules on the basis of HINet.

2.1 Main backbone

DCAHINet is a dual-stage hybrid network with deformation extraction and depth mapping, as shown in Fig. 1. The first stage is used to extract the pixel-wise deformation values of the fringe pattern, while the second stage is used to establish the relationship between the deformation of fringe pattern and the depth value of object. The input of these two stages is the same deformed fringe pattern, and the final output of the network is phase difference. Both stages are constructed based on the encoder-decoder network similar to UNet [34].

 figure: Fig. 1.

Fig. 1. The proposed DCAHINet, which consists of deformation extraction stage and depth mapping stage. The input of this network is the deformed fringe pattern and the output is the phase difference.

Download Full Size | PDF

At the beginning of each stage, the initial deformation features of the fringe pattern are extracted by deformable convolutional block [32], which is composed of 5 × 5 deformable convolution, batch normalization and rectified linear unit. DC block can extract different types of deformations by automatically changing the shape of the convolutional kernel, realizing the extraction of deformation features from more global perspective. Then, the initial deformation features are input into an encoder-decoder architecture with four down-samplings and three up-samplings. In the encoder network, half-instance normalized residual block (HIN ResBlock) [30] is adopted to extract features of different scales from initial deformation feature. HIN ResBlock consists of two 3 × 3 convolutional layers, and the normalization in the HIN ResBlock adopts a combination of instance normalization and batch normalization to keep each image instance independent and retain more scale information. In decoder networks, transposed convolution with a kernel size of 2 is used for up-sampling. In order to directly fuse the features of the encoder into the decoder, an attention gate [31] is introduced to compensate for the loss of information caused by resampling. Compared with simple skip connection, it can give different attention to different channel features to realize efficient information fusion and noise information filtering.

Although these two stages have the same network structure, they have different intentions and can be combined to output phase difference. Moreover, instead of simply cascading these two stages, DCAHINet uses the cross-stage feature fusion (CSFF) [35] and supervised attention module (SAM) [35] to facilitate information transmission between the two stages. SAM provides ground-truth supervisory signals useful for the deformation feature extraction. By introducing SAM, the useful deformation features at the first stage can propagate to the second stage for depth mapping, while the less informative features will be suppressed by the attention masks. The CSFF module is used to transfer the multi-scale deformations extracted from different layers in the first stage to the corresponding layers in the second stage, which enrich the multi-scale deformation features of the depth mapping stage.

After fusing two stage features, a 3 × 3 deformable convolutional layer is used to map the features to phase difference. In order to output a smoother phase difference, filtering is performed using Kornia [36] module before the final output. The Kornia module is a small filtering network that has higher training speed and GPU utilization compared to other filtering methods. Finally, the phase difference output by the DCAHINet can be used to calculate depth values based on the phase height method, The relationship between the phase difference and the height can be established according to the specific hardware device:

$$depth = \Delta {\phi _x}\ast \left( {\frac{d}{l}} \right)\ast \frac{p}{{2\ast \pi }}\; ,$$
where $\Delta {\phi _x}$ represents the phase difference between the reference and the deformed fringe pattern, d represents the height of the camera and projector from the reference plane, l represents the distance between the camera and the projector, and p represents the fringe width. The parameters d, l, p can be calibrated in the phase height model or measured manually if the system is fixed. It is worth mentioning that DCAHINet only needs to input a deformed fringe pattern to obtain the phase difference, and does not need to input additional reference fringe pattern. This is because in all dataset images and physical images, the tested object occupies only a portion of the field of view, and the background fringe in each deformed fringe pattern can represent the corresponding reference fringe pattern. Due to the uniformity of the fringe pattern, deep learning methods can easily deduce the entire reference fringe pattern from the background fringe information. Therefore, it is unnecessary to input the reference fringe pattern to the network. In order to clearly demonstrate the specific network structure of DCAHINet, the following is detailed descriptions of deformable convolution and attention gate.

2.2 Deformable convolution

Deformable convolution is used for feature extraction at the beginning of each stage and for depth mapping at the end. DC can flexibly change the shape of convolution kernel according to the deformation features at different scales, so it can flexibly use spatial context information according to actual situations.

DC adaptively changes the shape of convolutional kernel by adding 2D offsets at the sampling positions of the feature map. The offsets are learned from the preceding feature maps through additional convolutional layers. The offset map has the same spatial resolution as the input feature, but has twice the number of channels as the input, because each channel includes offset maps along the H and W directions. Subsequently, the 2D offset maps are added to the input features, and then ordinary convolution is used for feature extraction, which achieves a convolution operation equivalent to changing the shape of the convolution kernel. Nonetheless, because the implementation of DC is more complex than ordinary convolution and consumes a lot of extra space to record offset positions, this work only uses DC at the beginning and end of each stage. The effectiveness of DC module is verified in Section 4.

2.3 Attention gate

DCAHINet replaces skip connection in the encoder-decoder architecture with an attention gate [31], as shown in Fig. 1. The attention gate ensures that only information useful for depth reconstruction (e.g., texture and edges) is passed to the decoder layer, while useless information (e.g., noise) is suppressed.

The attention gate is shown in Fig. 2. It has two inputs, ${A_{in1}}$ representing the encoder layer and ${A_{in2}}$ representing the upsampled decoder layer. Firstly, the 1 × 1 convolution is performed on ${A_{in1}}$ and ${A_{in2}}$ to ensure that they have the same number of channels, and then they are summed pixel-by-pixel. This operation ensures the signals in the same region of interest are enhanced without neglecting detailed information. Subsequently, the summed feature map is processed through convolution and normalization layers to obtain the attention coefficient, which is called as $mask$. ${A_{mid}}$ is the element-wise multiplication of input feature-maps ${A_{in1}}$ and attention coefficients $mask$, enabling different positions to be given different attention. Eventually, ${A_{mid}}$ and ${A_{in2}}$ are fused in the channel dimension and convolved to get ${A_{out}}$.

 figure: Fig. 2.

Fig. 2. The proposed attention gate is used for skip connections to pass information efficiently between the encoder layers and decoder layers.

Download Full Size | PDF

3. Dataset generation

The generation of large-scale, accurate, and realistic datasets is the key of deep learning based single-shot FPP methods. Although there are already some FPP datasets [25,26,37], most of them output wrapped or unwrapped phases and still require complex post processing to obtain depth values. A few single-shot FPP datasets directly use depth values as ground truth and can output depth values end-to-end. However, these datasets are only collected on single optical system, so the trained model cannot be generalized to other hardware devices. To make the trained network model applicable to different hardware devices, we propose a dataset that uses the phase difference as the ground truth, and the subsequent depth values can be obtained using simple multiplication calculations. Similar to parallax in binocular reconstruction, phase difference can avoid the influence of different hardware devices.

To quickly generate datasets with different hardware parameter, this paper proposes a simulating method to generate a large number of datasets. The phase height method [2] is generally used to reconstruct object depth maps based on phase difference. This paper applies the inverse process of the phase height method to generate a dataset, the known depth map is used to calculate the phase difference, by which the corresponding deformed fringe pattern is generated.

The large-scale and realistic depth maps are the foundation of single-shot FPP dataset generation. Dtu_Training datasets [33], which are obtained by shooting 128 objects from 48 perspectives are chose as the source data of depth maps. Therefore, 6144 depth maps with a size of 128 × 160 pixels and a height range of 0 to 500 mm can be obtained. After that, the depth value of each pixel can be converted into phase difference through Eq. (1). Of course, the parameters of d/l need to be pre-defined, and in order to make the generated dataset suitable for different hardware devices, this work uses d/l values from 3 to 10 to generate phase differences. It is worth mentioning that the d/l value can be set according to the actual optical hardware devices layout, while the d/l value of commonly used hardware devices layout schemes is generally between 3 and 10.

To obtain a deformed fringe pattern, a standard cosine period fringe pattern is used as a reference fringe pattern. Based on the reference fringe pattern, the deformed fringe patterns can be synthesized by adding corresponding phase differences pixel by pixel:

$$I({x,y} )= A({x,y} )+ B({x,y} ){\ast }\cos \left( {2\pi x\cdot \frac{1}{T} + \Delta {\phi_x}} \right),$$
where $A({x,y} )$ represents the background light intensity, $B({x,y} )$ represents the modulated light intensity.To make the synthesized dataset more realistic, the parameters $A({x,y} )$, $B({x,y} )$, and T must be as close to the actual situation as possible. The fringe period T of the fringe pattern is set to 8, which is a commonly used fringe period value when shooting small resolution images in actual hardware devices. To ensure the authenticity, the background light intensity $A({x,y} )$ and the modulated light intensity $B({x,y} )$ are set to random numbers 140 to 160 and 70 to 90 respectively, which are determined after several observations of the deformed fringe patterns actually captured. To make the datasets more practical, random Gaussian noise is added to the deformed fringe pattern.The generated dataset contains complex scenes of discontinuous objects, including fruits, dolls, sculptures, plants, etc., as shown in Fig. 3. The datasets are divided into 60% (3744), 20% (1200), and 20% (1200) as training, test, and validation datasets.

 figure: Fig. 3.

Fig. 3. A sample dataset which contains multiple model types to ensure data diversity and network reliability.

Download Full Size | PDF

In addition to the synthesized dataset, it is also necessary to obtain the real dataset, so the structured light 3D reconstruction system as shown in Fig. 4 is built, which consists of a projector (resolution: 912 × 1140) and a camera (resolution: 2048 × 2448). Then the fringe patterns of the objects are collected, and the corresponding ground truth of depth maps are calculated by twelve-step phase shift and three-frequency heterodyne method.

 figure: Fig. 4.

Fig. 4. The structured light 3D reconstruction system.

Download Full Size | PDF

4. Experiments and results

The proposed network was performed on NVIDIA GeForce 3090 and trained 300 epochs for 20 hours. The initial value of the learning rate was 0.0001 and the Adam optimizer was used with β1 = 0.9 and β2 = 0.999. During the training process, the batch size is set to 16 and the loss function mentioned in [23] is used. The L1 loss in [23] is replaced by smooth L1 loss because of its robustness and low sensitivity to outliers. The specific definition is as follows:

$$Loss = L1({g{t_i} - es{t_i}} )+ ({1 - SSIM({g{t_i} - es{t_i}} )} )+ MGE({g{t_i} - es{t_i}} ),$$
where L1 presents the Smooth L1 Loss between the estimate value $es{t_i}$ and the ground truth $g{t_i}$; SSIM denotes the structural similarity, which is used to measure the brightness, contrast, and structural information; MGE is the mixed gradient error (MixGE), which is the local gradient error of the depth mapping.

In this section, the proposed method was evaluated in the following aspects. First, ablation experiments were performed to demonstrate the effectiveness of (i) attention gates, (ii) deformable convolution blocks, and (iii) dual-stage methods in our network. Secondly, the reconstruction effectiveness of the proposed network was verified in (i) different depth ranges and (ii) different hardware devices. Then, comparison experiments were conducted to compare the results of the proposed method with (i) traditional methods (phase shift method + multifrequency heterodyne), (ii) UNet, and (iii) HNet in complex scenes (e.g., scenes with discontinuous objects and noisy scenes). Finally, the generalization ability of the network is verified in unfamiliar datasets and the performance of the proposed method is verified in practical application scenarios. It should be noted that all the experiments that need to calculate the ground truths are calculated using the twelve-step phase-shift and three-frequency heterodyne method.

MAE was used as a metric for quantitative evaluation in this paper, which can directly represent the reconstruction accuracy.

$$MAE = \frac{{\mathop \sum \nolimits_{i = 1}^n |{g{t_i} - es{t_i}} |}}{n},$$
where n is the total number of pixels in the input image, $est$ represents the predicted value of the depth map, and $gt$ represents the ground truth of the depth map.

4.1 Ablation experiments

To verify the effectiveness of the attention gate, deformable convolution and dual-stage method, subnets 1 to 3 were designed by removing the corresponding modules from DCAHINet. For visual comparison, the reconstructed 3D shapes of subnets 1 to 3 and our network are shown in Fig. 5, with MAE values of 0.52, 0.49, 1.05 and 0.28 mm, respectively. It can be seen that the reconstructed 3D shapes of subnets 1 to 3 are not as smooth as our network and have some distortion in details. As a quantitative comparison, Table 1 shows the average MAE values for the training, test and validation dataset of our network and subnets 1 to 3. It can be observed that the loss after removing any of the above modules is higher than that of our network. In summary, deformable convolution, attention gate and two-stage methods help to improve the accuracy of depth map reconstruction and retain more details.

 figure: Fig. 5.

Fig. 5. As ablation experiments, qualitative illustrations of the 3D shapes reconstructed using subnets 1 to 3 and our network. subnet1 verifies the effectiveness of the attention gate, subnet2 verifies the effectiveness of the deformable convolution, and subnet3 verifies the effectiveness of the dual-stage method.

Download Full Size | PDF

Tables Icon

Table 1. Comparison Results of MAE for Our Net and Different Subnets (mm)

4.2 Varying depth ranges and hardware devices

To verify the reconstruction effectiveness of the proposed method for objects with different depth ranges, the datasets are divided into four categories with depth ranges of 0 to 200 mm, 0 to 300 mm, 0 to 400 mm, and 0 to 500 mm, respectively. The average MAE values of training, validation, and test dataset for objects with different depth ranges are shown in Fig. 6. It can be seen that the error increases within a reasonable range as the depth increases, indicating that our network has good adaptability to scenes with different depth ranges. For a more intuitive comparison, the objects with different depth ranges are reconstructed by our network, as shown in Fig. 7. The average MAE values for the depth ranges of 0 to 200, 0 to 300, 0 to 400, and 0 to 500 mm are 0.341, 0.313, 0.449, and 0.699 mm, respectively. Therefore, it can be concluded that objects within different depth ranges can use our method to obtain authentic 3D shapes.

 figure: Fig. 6.

Fig. 6. The average MAE values for the training, validation, and test datasets with object depths ranging from 0 to 200, 300, 400, and 500 mm.

Download Full Size | PDF

 figure: Fig. 7.

Fig. 7. The 3D shapes reconstructed by our network for objects within different depth ranges.

Download Full Size | PDF

To verify the reconstruction results of fringe patterns taken by hardware devices with different d/l (d is camera/projector height, l is distance between camera and projector), Fig. 8 shows the average MAE values of the training, validation and test dataset taken by hardware devices with different d/l. The loss does not change significantly with the change of d/l value, indicating that our network is applicable to different hardware devices. The reconstructed 3D shapes of the fringe patterns taken by hardware devices with different d/l are shown in Fig. 9. It can be observed that despite the change in d/l value, the 3D shapes reconstructed by our method have high accuracy and smoothness, which proves that the proposed method does not require special restrictions on the distance and height of the camera and projector.

 figure: Fig. 8.

Fig. 8. The average MAE values for the training, validation, and test datasets with fringe patterns captured using different hardware device layouts (d/l of 4 to 8).

Download Full Size | PDF

 figure: Fig. 9.

Fig. 9. The 3D shapes reconstructed by our network from fringe patterns captured using hardware devices with different d/l.

Download Full Size | PDF

4.3 Comparison experiments

Comparison with traditional method: Our network is compared with four-step phase shift and three-frequency heterodyne. For qualitative analysis, when the input is a complex scene with noise, the 3D shape reconstructed by the traditional method still has burrs on the object surface even after filtering. In contrast, our method does not require any pre/post-processing and can still reconstruct high accuracy 3D shapes from noisy fringe patterns, as shown in Fig. 10 (a). When the input is a complex scene with discontinuous objects, our method reconstructs the 3D shapes with better detail quality and without burrs on the object edges than the traditional method, as shown in Fig. 10 (b). This indicates that the proposed network has better results in reconstructing discontinuous objects. For quantitative analysis, Table 2 shows the MAE values of the traditional method and our method when the input is noisy and clean images, which illustrates that our method has lower loss and higher accuracy than the traditional method. It is concluded that although the traditional method uses twelve fringe patterns while our method only used single fringe pattern, our reconstruction effect is still better than the traditional method, especially in complex scenes.

 figure: Fig. 10.

Fig. 10. Qualitative comparison results of the 3D shapes reconstructed by the traditional method and our method when inputting (a) a deformed fringe pattern with random noise and (b) a deformed fringe pattern of noiseless discontinuous objects.

Download Full Size | PDF

Tables Icon

Table 2. Comparison Results of MAE for Our Net and Traditional Method (mm)

Comparison with UNet and HNet: To demonstrate that our network outperforms other deep learning methods, our network is compared with UNet and HNet trained on the same dataset for 300 epochs. The intuitive 3D shapes of the comparison experiment are shown in Fig. 11, including different scenes (e.g., single object and discontinuous objects). It is seen that the depth maps reconstructed by our network are smoother than that reconstructed by UNet and retains more detailed information than that of HNet. For quantitative comparison, the average MAE values of the training, validation, and test datasets for our network, UNet, and HNet are compared in Table 3. The convergence processes of our network, Unet and HNet are shown in Fig. 12. As can be seen from Table 3 and Fig. 12, our network has the smallest loss and the highest convergence speed for different datasets. In summary, our network has better performance than UNet and HNet in terms of reconstruction effect and overall accuracy.

 figure: Fig. 11.

Fig. 11. Qualitative comparison of the 3D shapes reconstructed from (a) single objects and (b) multiple discontinuous objects using our method, UNet, and HNet.

Download Full Size | PDF

 figure: Fig. 12.

Fig. 12. The convergence process of training using UNet, HNet and our network.

Download Full Size | PDF

Tables Icon

Table 3. Comparison Results of MAE for Our Net, UNet, and Hnet (mm)

4.4 Verify generalization ability

In order to illustrate the generalization ability of our method to different hardware devices, two publicly available datasets [25,37] with completely different and unknown hardware device parameters (including internal/external parameters of camera and projector) are selected for testing in this paper. Both datasets use virtual software to obtain virtual fringe patterns, and then use the phase-shift method and grayscale coding method to obtain ground truth.

The reconstructed 3D shapes of the four scenes contained in the dataset in Ref. [25] are shown in Fig. 13, and it can be found that the 3D shapes reconstructed by our method are very close to the ground truth and restore the details very well. The quantitative analysis shows the average loss of the 40 scenes in the dataset in Ref. [25] is 0.014 mm. Qualitative illustration of the reconstructed 3D shapes of the face model and the duck model contained in the dataset in Ref. [37] using our network, UNet and HNet are shown in Fig. 14, and also zoom in to show their facial details. The reconstruction MAE values of the face model using our method, HNet, and UNet are 0.11, 0.36, and 0.28 mm. Similarly, the MAE values of the duck model are 0.13, 1.09, and 0.88 mm, respectively. It can be seen that the surface of the 3D shapes reconstructed by HNet is too smooth leading to a lack of details, and UNet will have wavy errors that produce burrs at the edges, while our network has a higher accuracy and can restore the details of the objects better. It can be concluded that our method can reconstruct the depth of the fringe pattern obtained by the different hardware devices with high accuracy.

 figure: Fig. 13.

Fig. 13. Validating the performance of our network using the dataset from Ref. [25].

Download Full Size | PDF

 figure: Fig. 14.

Fig. 14. Qualitative comparison of 3D shapes reconstructed by our net, HNet and UNet for (a) face model and (b) duck model in the dataset (Ref. [37]).

Download Full Size | PDF

4.5 Practicality experiment

This section verifies the performance of our method in a real-world application scenario. Firstly, our network was pre-trained on our synthesized dataset, and the network's results exhibited slight overfitting. This overfitting is attributed to the simplicity and homogeneity of the virtual datasets, which neglect interference factors such as shadows, reflections, and noise present in the real dataset. Consequently, transfer learning was used to fine-tune the network, and after an additional 300 epochs of training using the real dataset, the MAE of the training data set reached 0.067 mm, and the MAE of the validation and test data set finally reached 0.115 mm.

In the whole collection process of the real datasets, the hardware device parameters (including internal/external parameters and distance/height of camera and projector) are randomly adjusted without special control. Therefore, experiments can further prove the generalization ability of our method to different hardware devices. The reconstructed 3D shapes of the three real scenes (single objects) are shown in Fig. 15(a). It can be found that for blade and vases with complex shapes, our network can still reconstruct high-precision 3D shapes. Taking the results of 12-step phase shift and 3-frequency heterodyne reconstruction results as true values, the reconstruction MAE values of the blade, monitor and vases are 0.114, 0.121, and 0.103 mm. The 3D reconstruction results for discontinuous objects are shown in Fig. 15(b). It can be seen that since the deep learning method does not need to calculate the phase, for discontinuous objects, there are no burrs at the edges of the objects, and the reconstructed 3D shape is relatively ideal. Taking the 12-step phase shift and 3-frequency heterodyne reconstruction results as true values, the reconstruction MAE values from top to bottom in Fig. 15(b) are 0.112, 0.124 and 0.101 mm, respectively.

 figure: Fig. 15.

Fig. 15. Qualitative illustration of 3D shapes of (a) single object and (b) discontinuous objects reconstructed using our network.

Download Full Size | PDF

5. Conclusion

This paper presents a novel end-to-end single-shot FPP neural network, which is a dual-stage hybrid network with deformation extraction stage and depth mapping stage. The unique and remarkable feature of our method is that the deformable convolution and attention gate improve the dynamic perception capability of the network and increase the information transfer efficiency, thus significantly improving the accuracy of 3D reconstruction. Comparison experiments with the 4-step phase shift and 3-frequency heterodyne methods illustrate that the proposed method can obtain higher quality 3D shapes using only single fringe pattern in complex scenes with noisy and discontinuous objects. Comparison experiments with UNet and HNet show that our network outperforms other deep learning methods, and the error of our network is 1/4 of UNet, 1/10 of HNet.

More importantly, this work elegantly solves the long-standing problem of insufficient generalization ability of deep learning based single-shot FPP methods on different hardware devices. Inspired by the parallax in binocular reconstruction, this paper end-to-end outputs the phase difference values, which can reconstruct 3D shapes through simple multiplication operations and avoid the influence of hardware devices. In addition, a dataset suitable for different hardware devices and its generation method are also proposed in the paper. Experimental results on public datasets show that the network can reconstruct high-accuracy 3D shapes, even though the hardware device parameters are different and unknown. Finally, the average reconstruction MAE of the real dataset is approximately 0.1 mm.

Funding

Guangdong Basic and Applied Basic Research Foundation (2022A1515110036); National Natural Science Foundation of China (12302245); Natural Science Basic Research Program of Shaanxi Province (2023-JC-QN-0026); Shuangchuang Program of Jiangsu Province (JSSCBS20220943).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Zhang, “Recent progresses on real-time 3D shape measurement using digital fringe projection techniques,” Optics and Lasers in Engineering 48(2), 149–158 (2010). [CrossRef]  

2. S. Feng, C. Zuo, L. Zhang, et al., “Calibration of fringe projection profilometry: A comparative review,” Optics and Lasers in Engineering 143, 106622 (2021). [CrossRef]  

3. M. Takeda, H. Ina, and S. Kobayashi, “Fourier-transform method of fringe-pattern analysis for computer-based topography and interferometry,” J. Opt. Soc. Am. A 72(1), 156–160 (1982). [CrossRef]  

4. M. Takeda and K. Mutoh, “Fourier transform profilometry for the automatic measurement of 3-D object shapes,” Appl. Opt. 22(24), 3977–3982 (1983). [CrossRef]  

5. V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. 23(18), 3105–3108 (1984). [CrossRef]  

6. X.-Y. Su, W.-S. Zhou, G. Von Bally, et al., “Automated phase-measuring profilometry using defocused projection of a Ronchi grating,” Opt. Commun. 94(6), 561–573 (1992). [CrossRef]  

7. X.-Y. Su, G. Von Bally, and D. Vukicevic, “Phase-stepping grating profilometry: utilization of intensity modulation analysis in complex objects evaluation,” Opt. Commun. 98(1-3), 141–150 (1993). [CrossRef]  

8. H. Nguyen, Y. Wang, and Z. Wang, “Single-Shot 3D Shape Reconstruction Using Structured Light and Deep Convolutional Neural Networks,” Sensors 20(13), 3718 (2020). [CrossRef]  

9. S. Feng, Q. Chen, G. Gu, et al., “Fringe pattern analysis using deep learning,” Adv. Photon. 1(02), 1 (2019). [CrossRef]  

10. K. Yan, Y. Yu, C. Huang, et al., “Fringe pattern denoising based on deep learning,” Opt. Commun. 437, 148–152 (2019). [CrossRef]  

11. M. Cywińska, F. Brzeski, W. Krajnik, et al., “DeepDensity: Convolutional neural network based estimation of local fringe pattern density,” Optics and Lasers in Engineering 145, 106675 (2021). [CrossRef]  

12. G. Yang, M. Yang, N. Zhou, et al., “High dynamic range fringe pattern acquisition based on deep neural network,” Opt. Commun. 512, 127765 (2022). [CrossRef]  

13. W. Li, T. Liu, M. Tai, et al., “Three-dimensional measurement for specular reflection surface based on deep learning and phase measuring profilometry,” Optik 271, 169983 (2022). [CrossRef]  

14. J. Shi, X. Zhu, H. Wang, et al., “Label enhanced and patch based deep learning for phase retrieval from single frame fringe pattern in fringe projection 3D measurement,” Opt. Express 27(20), 28929–28943 (2019). [CrossRef]  

15. J. Zhang, X. Tian, J. Shao, et al., “Phase unwrapping in optical metrology via denoised and convolutional segmentation networks,” Opt. Express 27(10), 14903–14912 (2019). [CrossRef]  

16. G. E. Spoorthi, S. Gorthi, and R. K. S. S. Gorthi, “PhaseNet: A Deep Convolutional Neural Network for Two-Dimensional Phase Unwrapping,” IEEE Signal Processing Letters 26(1), 54–58 (2019). [CrossRef]  

17. K. Wang, Y. Li, Q. Kemao, et al., “One-step robust deep learning phase unwrapping,” Opt. Express 27(10), 15100–15115 (2019). [CrossRef]  

18. S. Feng, C. Zuo, W. Yin, et al., “Micro deep learning profilometry for high-speed 3D surface imaging,” Optics and Lasers in Engineering 121, 416–427 (2019). [CrossRef]  

19. H. Yu, X. Chen, Z. Zhang, et al., “Dynamic 3-D measurement based on fringe-to-fringe transformation using deep learning,” Opt. Express 28(7), 9405–9418 (2020). [CrossRef]  

20. Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-Supervised Deep Learning for Monocular Depth Map Prediction,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), pp. 2215–2223.

21. R. C. Machineni, G. E. Spoorthi, K. S. Vengala, et al., “End-to-end deep learning-based fringe projection framework for 3D profiling of objects,” Computer Vision and Image Understanding 199, 103023 (2020). [CrossRef]  

22. S. V. der Jeught and J. J. J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express 27(12), 17091–17101 (2019). [CrossRef]  

23. S. V. D. Jeught, P. G. G. Muyshondt, and I. Lobato, “Optimized loss function in deep learning profilometry for improved prediction performance,” J. Phys. Photonics 3(2), 024014 (2021). [CrossRef]  

24. H. Nguyen, T. Tran, Y. Wang, et al., “Three-dimensional Shape Reconstruction from Single-shot Speckle Image Using Deep Convolutional Neural Networks,” Optics and Lasers in Engineering 143, 106639 (2021). [CrossRef]  

25. Y. Zheng, S. Wang, Q. Li, et al., “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express 28(24), 36568–36583 (2020). [CrossRef]  

26. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024–8040 (2021). [CrossRef]  

27. H. Nguyen, K. L. Ly, T. Tran, et al., “hNet: Single-shot 3D shape reconstruction using structured light and h-shaped global guidance network,” Results in Optics 4, 100104 (2021). [CrossRef]  

28. L. Wang, D. Lu, R. Qiu, et al., “3D reconstruction from structured-light profilometry with dual-path hybrid network,” EURASIP J. Adv. Signal Process. 2022(1), 14 (2022). [CrossRef]  

29. L. Wang, D. Lu, J. Tao, et al., “Single-shot structured light projection profilometry with SwinConvUNet,” Opt. Express 61(11), 114101 (2022). [CrossRef]  

30. L. Chen, X. Lu, J. Zhang, et al., “HINet: Half Instance Normalization Network for Image Restoration,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2021), pp. 182–192.

31. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention U-Net: Learning Where to Look for the Pancreas,” arXiv, arXiv:1804.03999 (2018). [CrossRef]  

32. T. Lei, R. Wang, Y. Zhang, et al., “DefED-Net: Deformable Encoder-Decoder Network for Liver and Liver Tumor Segmentation,” IEEE Transactions on Radiation and Plasma Medical Sciences 6(1), 68–78 (2022). [CrossRef]  

33. Y. Yao, Z. Luo, S. Li, et al., “MVSNet: Depth Inference for Unstructured Multi-view Stereo,” in V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss, eds., Lecture Notes in Computer Science (Springer International Publishing, 2018), 11212, pp. 785–801.

34. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” 9351, 234–241 (2015).

35. S. W. Zamir, A. Arora, S. Khan, et al., “Multi-Stage Progressive Image Restoration,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 14816–14826.

36. E. Riba, D. Mishkin, D. Ponsa, et al., “Kornia: an Open Source Differentiable Computer Vision Library for PyTorch,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), pp. 3663–3672.

37. X. Zhu, Z. Zhang, L. Hou, et al., “Light field structured light projection data generation with Blender,” in 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA) (2022), pp. 1249–1253.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (15)

Fig. 1.
Fig. 1. The proposed DCAHINet, which consists of deformation extraction stage and depth mapping stage. The input of this network is the deformed fringe pattern and the output is the phase difference.
Fig. 2.
Fig. 2. The proposed attention gate is used for skip connections to pass information efficiently between the encoder layers and decoder layers.
Fig. 3.
Fig. 3. A sample dataset which contains multiple model types to ensure data diversity and network reliability.
Fig. 4.
Fig. 4. The structured light 3D reconstruction system.
Fig. 5.
Fig. 5. As ablation experiments, qualitative illustrations of the 3D shapes reconstructed using subnets 1 to 3 and our network. subnet1 verifies the effectiveness of the attention gate, subnet2 verifies the effectiveness of the deformable convolution, and subnet3 verifies the effectiveness of the dual-stage method.
Fig. 6.
Fig. 6. The average MAE values for the training, validation, and test datasets with object depths ranging from 0 to 200, 300, 400, and 500 mm.
Fig. 7.
Fig. 7. The 3D shapes reconstructed by our network for objects within different depth ranges.
Fig. 8.
Fig. 8. The average MAE values for the training, validation, and test datasets with fringe patterns captured using different hardware device layouts (d/l of 4 to 8).
Fig. 9.
Fig. 9. The 3D shapes reconstructed by our network from fringe patterns captured using hardware devices with different d/l.
Fig. 10.
Fig. 10. Qualitative comparison results of the 3D shapes reconstructed by the traditional method and our method when inputting (a) a deformed fringe pattern with random noise and (b) a deformed fringe pattern of noiseless discontinuous objects.
Fig. 11.
Fig. 11. Qualitative comparison of the 3D shapes reconstructed from (a) single objects and (b) multiple discontinuous objects using our method, UNet, and HNet.
Fig. 12.
Fig. 12. The convergence process of training using UNet, HNet and our network.
Fig. 13.
Fig. 13. Validating the performance of our network using the dataset from Ref. [25].
Fig. 14.
Fig. 14. Qualitative comparison of 3D shapes reconstructed by our net, HNet and UNet for (a) face model and (b) duck model in the dataset (Ref. [37]).
Fig. 15.
Fig. 15. Qualitative illustration of 3D shapes of (a) single object and (b) discontinuous objects reconstructed using our network.

Tables (3)

Tables Icon

Table 1. Comparison Results of MAE for Our Net and Different Subnets (mm)

Tables Icon

Table 2. Comparison Results of MAE for Our Net and Traditional Method (mm)

Tables Icon

Table 3. Comparison Results of MAE for Our Net, UNet, and Hnet (mm)

Equations (4)

Equations on this page are rendered with MathJax. Learn more.

d e p t h = Δ ϕ x ( d l ) p 2 π ,
I ( x , y ) = A ( x , y ) + B ( x , y ) cos ( 2 π x 1 T + Δ ϕ x ) ,
L o s s = L 1 ( g t i e s t i ) + ( 1 S S I M ( g t i e s t i ) ) + M G E ( g t i e s t i ) ,
M A E = i = 1 n | g t i e s t i | n ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.