Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

NMSCANet: stereo matching network for speckle variations in single-shot speckle projection profilometry

Open Access Open Access

Abstract

In single-shot speckle projection profilometry (SSPP), the projected speckle inevitably undergoes changes in shape and size due to variations such as viewing angles, complex surface modulations of the test object and different projection ratios. These variations introduce randomness and unpredictability to the speckle features, resulting in erroneous or missing feature extraction and subsequently degrading 3D reconstruction accuracy across the tested surface. This work strives to explore the relationship between speckle size variations and feature extraction, and address the issue solely from the perspective of network design by leveraging specific variations in speckle size without expanding the training set. Based on the analysis of the relationship between speckle size variations and feature extraction, we introduce the NMSCANet, enabling the extraction of multi-scale speckle features. Multi-scale spatial attention is employed to enhance the perception of complex and varying speckle features in space, allowing comprehensive feature extraction across different scales. Channel attention is also employed to selectively highlight the most important and representative feature channels in each image, which is able to enhance the detection capability of high-frequency 3D surface profiles. Especially, a real binocular 3D measurement system and its digital twin with the same calibration parameters are established. Experimental results imply that NMSCANet can also exhibit more than 8 times the point cloud reconstruction stability (Std) on the testing set, and the smallest change range in terms of Mean~dis (0.0614 mm - 0.4066 mm) and Std (0.0768 mm - 0.7367 mm) when measuring a standard sphere and plane compared to other methods, faced with the speckle size changes, meanwhile NMSCANet boosts the disparity matching accuracy (EPE) by over 35% while reducing the matching error (N-PER) by over 62%. Ablation studies and validity experiments collectively substantiate that our proposed modules and constructed network have made significant advancements in enhancing network accuracy and robustness against speckle variations.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Recently, there has been an increasing number of DL-based structured light 3D reconstruction methods that have achieved promising results. However, most researchers have focused on network design and paid less attention to the relationship between encoding patterns and DL networks. This paper takes a perspective of varying speckle shapes and sizes, we enhance the robustness of the model to variations in speckle shape and size in the context of stereo matching without altering or expanding the training dataset, so as to stimulate researchers’ thinking and exploration on the interplay between structured light and networks. By virtue of non-contact, high accuracy, and robustness to weak textures, the structured light 3D measurement technology based on binocular vision is widely applied in the fields of 3D face reconstruction, industrial parts shape measurement, and digital protection of cultural relics [14]. Among many structured light encoding patterns, fringe patterns [58] and speckle patterns [913] are widely investigated and applied in 3D measurement practices. The 3D reconstruction methods using these two patterns are called fringe projection profilometry(FPP) and speckle projection profilometry(SPP), respectively.

FPP is an advanced technique for highly accurate 3D reconstruction, which has gained significant development in recent years. It is particularly well-suited for applications that require precision but allow for sufficient processing time. FPP can be categorized into two main types based on the number of cameras used: monocular vision fringe profilometry [14,15] and binocular vision (stereo) profilometry [16,17]. In monocular FPP, the 3D measurement is achieved by mapping the demodulated phase to height information. On the other hand, in binocular FPP, the technique leverages the fact that the absolute phase of the same object should remain consistent across different viewpoints. It performs phase unwrapping on the left and right views to obtain a unified phase unwrapping map. Further, phase matching is carried out to generate the disparity map, allowing for accurate 3D shape reconstruction using the principles of triangulation. For generating a high-accuracy ground truth(GT), we employ binocular FPP in this paper. Although FPP enables high-precision 3D reconstruction, it does require the projection of multiple fringe patterns. This limitation restricts its application in scenarios where real-time measurements are necessary, such as dynamic 3D measurement applications.

Due to the global uniqueness of speckle encoding [18], SPP naturally tends to develop towards single-shot 3D reconstruction for many applications, which only requires a good speckle encoding strategy and a well-designed stereo matching network or algorithm to achieve high-precision 3D reconstruction task. Since Zhou et al. [19] have designed an accurate and robust speckle encoding method, the accuracy of the stereo matching algorithm or network becomes the threshold of the reconstruction accuracy. After obtaining the disparity from stereo matching algorithms, we can utilize the disparity and system calibration parameters to implement 3d reconstruction. Recently, the DL network has made great progress in computer vision tasks, many outstanding stereo matching networks have emerged and outperformed the traditional stereo matching algorithm in accuracy and speed. Since the end-to-end [20] stereo matching network was proposed by Kendal et al., many stereo matching researchers followed their matching scheme to enhance the matching result by focusing on the three aspects of stereo matching network: feature extraction, cost volume construction, and disparity regression. Some researchers [2124] devoted themselves to changing feature extraction to improve the accuracy of stereo matching models. Among them, Chang [21] et al. introduced a pyramid pooling structure into feature extraction to obtain features at different scales. Some researchers [20,2528] modified cost volume construction. Guo et al. [26], Mayer et al. [27] employed correlation to construct cost volume; while Kendall et al. [20] directly utilized concatenation to construct cost volume. Recently, Xu [28] et al. innovatively fused the above two ways of constructing cost volume. Their use of correlation volume to generate attention and remove redundant information in concatenation volume greatly improved the accuracy and efficiency of matching.

As for the DL-based single-shot SPP, researchers have made modifications from the perspective of the different perspectives of the end-to-end stereo matching scheme. However, the characteristics of the speckle pattern projected to the objects are ignored. Yin et al. [29] proposed a 3D reconstruction method based on an end-to-end stereo matching network with a single-shot speckle. They introduced multi-scale residual modules into feature extraction to achieve sub-pixel matching results for SPP. Later, a lightweight 3D U-net network was introduced by them to implement efficient 4D cost aggregation [24]. Recently, Wang et al. [23] adopted the dense network to feature extraction so as to further enhance the SPP accuracy and fused both concatenated cost volume and correlation cost volume to boost the matching speed. Even though these prior works reached the goal of SSPP, and improvements have been made in various aspects of stereo matching networks to achieve more accurate matching results. However, they have all overlooked a crucial and intuitive impact factor, namely speckle feature variations on the accuracy of stereo matching networks.

Actually, the projected speckle would be distorted by the complex surface of test objects, various projection ratios of the projector, and perspective differences between the cameras and the projector will incur changes in features extracted by the feature extractor and further lead to stereo matching errors and a decrease in 3D reconstruction accuracy. This is unacceptable for applications that require measurement integrity(coverage) and consistency throughout the measured surface. For this particular issue, the simplest approach would be expanding the training data sets corresponding to the problem, enabling the model to acquire such capability. In this paper, we approach the problem by focusing on the impact of speckle variations on the stereo matching networks and propose a targeted network design to address it. Specifically, we propose a normalized multi-scale spatial and channel attention network(NMSCANet). Firstly, a multi-scale spatial attention module is established with a strong perception ability of distorted-speckle features in the feature extraction stage. Secondly, channel attention that is capable of selecting the most important feature channels for each image to further enhance the 3D restoration ability of surface details is introduced. Meanwhile, the input features are normalized with the Tanh function. Finally, the cost volume construction method of ACVNet is drawn on, which allows full utilization of the information of feature extraction. Comparative experiments indicate that our proposed network has obvious advantages in matching accuracy and 3D shape detail restoration ability. Meanwhile, it also demonstrates robustness in resistance to the change of speckle size (projection ratio) in comparison with the compared methods.

2. Principles

2.1 3D measurement system

In this paper, we adopt the same 3D measurement system and measurement method as Wang’s [23]. We construct a measurement system based on binocular vision, which projects both speckles and sinusoidal fringes onto the object to be measured. The phase is then calculated using the captured fringe images. Phase matching is performed between the left and right views to obtain a high-precision disparity map. A pair of captured speckle images is used as the input of the training and testing process of the model. To reduce the search range of the model for matching pixels and improve the training and prediction speed, we perform epipolar rectification [30] on the captured stereo images pair, ensuring that the same matching point of the same object lies at the same y-axis in both the left and right views. To quantify the process of speckle size variation and better understand its impact on feature extraction, we employ a digital twinning approach to generate images under various sizes of speckle illumination. These images serve a dual purpose: aiding in the analysis of the feature evolution process and conducting effectiveness tests for our constructed network. It allows us to evaluate whether our network performs better in SSPP when detecting different-sized speckles. It’s worth noting that the data generated using digital twinning for different speckle sizes is solely used for testing purposes and is not utilized in training the network. Further details will be elaborated in the subsequent section.

2.2 Speckle size change

In order to achieve better control over speckle size variation, we created a digital twin of our real-world sampling environment within the Blender software. This setup method aligns with the approach adopted by Wang et al. [23]. We employ digital twin techniques to simulate scenarios where the projected speckle size can change. Utilizing the shading tree construction of the projector mentioned in paper [31], we are able to simulate adjustments to the projection ratio by manipulating the x-scale parameter in the first mapping node while other parameters remain constant. This simulation mimics real-world scenarios where the placement of the projector or change of the projection ratios leads to variations in the size of the speckle features projected onto the object. The specific procedure and shading tree of the digital projector is illustrated in the following Fig. 1. We adopt projection ratios ranging from 0.25 to 1.00 with an interval of 0.05 to yield a total of 16 different speckle projection ratios. As we vary the x-scale of the first mapping node in the compositor window of Blender, the height and width of the projected pattern will exhibit a linear relationship. For example, when the projection ratio is set to 0.5, the width and height of the pattern will be halved compared to the dimensions observed at a projection ratio of 1.0. The imaging comparison under these varying speckle sizes is shown in Fig. 2

 figure: Fig. 1.

Fig. 1. The digital twin setting and the parameters setting of projector in Blender Software.Image (a) showcases a virtual capture scene constructed within Blender, illustrating the relative positioning between the object, camera, and projector. Image (b) depicts the shading tree of the projector, where three intermediate blue nodes map the normal direction of the projector from a 3D vector to the yz-plane. This allows direct scaling of the projected image using the x-axis. Image (c) displays the various projection scales utilized within our simulation.

Download Full Size | PDF

 figure: Fig. 2.

Fig. 2. Images of different speckle sizes simulated by digital twin techniques in Blender. From the top left to the bottom right the speckle size (projection ratio) changes from 0.25 to 1.00 with an interval of 0.05. The image within the red circular box is a magnified view of the nose region captured under speckle illumination. The image within the yellow rectangular box is the binary result of a rectangular region with a width and height of 20 pixels, representing the tip of the nose.

Download Full Size | PDF

In the yellow boxes of Fig. 2, we present a binary pattern obtained by binarizing a small 20$\times$20 region around the tip of a face model. As the projection ratio increases, the pattern projected onto the face model also enlarges. Within the yellow boxes, the pattern transitions from a complex and variable encoding (top-left) to a larger area with reduced randomness (bottom-right). Both the shape and size of the speckles have changed. This allows us to simulate different sizes of speckles encoding and verify the effectiveness and robustness of the model in the presence of speckle shape and size changes. It is important to note that the tested images generated from these changed projection ratios are not involved in the training process. The training images are all obtained from speckle illumination at the same projection ratio.

2.3 Feature of speckle

With variations in speckle size and the modulation of speckle patterns by three-dimensional objects, changes in speckle features arise. This section primarily investigates the relationship between changes in speckle size and alterations in features. Both traditional stereo matching and methods based on deep learning necessitate feature extraction from left and right images. One commonly used approach involves block-wise matching using sliding windows [32] [33]. Additionally, shallow features in CNN-based deep learning predominantly encompass local structural information [34] [35], which can also be extracted using sliding window techniques. To understand the regularity of feature changes as speckle size varies, we perform pixel-wise sliding window detection on images illuminated by different speckle patterns in Fig. 2, the height and width of detecting window are both 3 pixels. We identify and count their respective local features while eliminating duplicate local features. Sampling representations and the final feature counts are illustrated in Fig. 3. The results of the blue line indicate a gradual decrease in the overall number of local 3$\times$3 features in captured images as speckle size increases. Moreover, due to modulation caused by certain three-dimensional shapes, smaller speckles (e.g., size 0.90) yield more extracted features. This presents a considerable challenge for subsequent steps in stereo matching: in deep learning, if new features emerge in the test dataset due to variations in speckle size or modulation of three-dimensional shapes, incomplete feature extraction from images may lead to decreased matching accuracy. However, augmenting the training dataset with variously sized illuminations entails significant costs, and it is impractical to include all speckle sizes in the training set. Thus, aiding the network in enhancing its robustness to speckle changes becomes imperative from the perspective of feature extraction. To gain a deeper understanding of how speckle variations affect feature extraction, we conduct more in-depth testing. We test different sizes of sliding windows and observed that as the window size increased, the impact of speckle size variations on the count of local speckle features gradually diminished. When the sliding window reaches a size of 20x20 pixels, the quantity of local speckle features extracted remains consistent across various speckle sizes. We also depict this result in Fig. 3 using a red dashed line. The data from the graph indicates that similar quantities of local features are extracted across different speckle sizes, suggesting that increasing the window size for feature extraction could enhance the network’s robustness. However, simply enlarging the convolutional kernels in feature extraction would substantially escalate the computational load of the entire network. Hence, a more prudent approach is necessary to acquire large-scale feature information.

 figure: Fig. 3.

Fig. 3. Line chart illustrating the count of local features under different speckle sizes. We extract a region of 300 pixels in width and height, covering the nose area in the face image shown in Fig. 2, and perform a binarization same to the yellow box in Fig. 2 (e.g., the detected image when the speckle size is 0.25 are shown in the center of the image). The schematic diagram of the window sliding process is illustrated in the top right corner, where we employ a sliding window of size H$\times$W pixels, moving in steps of delta x pixels in the x-direction or delta y in the y-direction. Each movement involves feature extraction, eliminating duplicate local features, and tallying their occurrences for plotting on the line chart. The blue line shows the feature quantity of the block size of 3$\times$3 (delta x and delta y are both 2). The red dash line represents the feature quantity of the block size 20$\times$20 (delta x and delta y are both 3).

Download Full Size | PDF

3. Network design

This section mainly introduces how to construct a network to overcome the feature change mentioned above. After the epipolar rectification of the image pair, currently, there are three steps for stereo-matching networks based on convolutional neural networks: feature extraction, cost volume construction, and disparity regression. Feature extraction is the very first step of these three steps, which is of great significance. The quality of feature extraction directly determines the quality of the cost volume constructed later, and also further determines the efficiency of disparity regression later. As mentioned in the section 2.3, the feature quantity and type would vary with the change of speckle sizes. Therefore, for more accurate and robust matching result the feature extraction module need to be improved, and the way we take is introducing our Normliazed multi-scale spatial channel attention(NMSCA) module into the usual conv2d-based feature extraction module to perform multi-scale attention calculation on the extracted features in space so that the extracted features can pay attention to speckles of different sizes and shapes. At the same time, channel attention is also calculated to allow the model to pay more attention to channels with obvious features. Finally, the stereo matching network can achieve a good performance on accuracy and robustness.

3.1 Network construction

ACVNet is a popular and highly accurate stereo matching network that innovatively combines two cost volume construction methods. It utilizes patch-match with correlation cost volume to build spatial attention for the cost volume and then updates the concatenated cost volume using this attention. This approach absorbs the advantages of both cost volume construction methods, leading to a significantly improved accuracy in stereo matching. Although ACVNet has shown good reconstruction results on datasets like Kitti [36] and SceneFlow [27], it still fails to meet our requirements of high precision and robustness in SSPP. Therefore, we propose to introduce the NMSCA module into the feature extraction part of ACVNet and construct NMSCANet. The NMSCA module is a special convolutional layer that takes a $C\times H \times W$ feature map as input and outputs a feature map of the same shape ($C$ represents the number of channels, $H$ and $W$ represent the height and width, respectively). The process is like a refinement of the features. So it can be adapted to any network which have a convolutional feature extraction module and need for multi-scale feature extraction ability. We append the NMSCA module after each of the three 2D convolutional layers in the feature extraction process, and also apply the NMSCA module to refine the features used for constructing the concatenated cost volume which is shown in the green box in Fig. 4 and it is formed by concatenating the outputs from the three NMSCA layers of the network. The network structure is illustrated in Fig. 4.

 figure: Fig. 4.

Fig. 4. The overall framework of the proposed NMSCANet. There are three modules including Feature Extraction, Volume Construction, and Disparity Regression in the network. Conv2d+NMSCA module in Feature Extraction contributes a significant improvement to reconstruction accuracy and robustness. A pair of rectified speckle images is considered as the input and their disparity is the output of the network.

Download Full Size | PDF

3.2 NMSCA module

The NMSCA module is a lightweight convolutional attention module(CBAM) [37,38], which was later modified by Chen et al. [39] into multi-scale spatial attention(MSCA) and applied to object recognition in remote sensing imagery. We absorb the advantages of MSCA and CBAM to build NMSCA. It includes two parts: channel attention and spatial attention modules. In channel attention, we introduce both average pooling and maximum pooling at the same time to ensure that channel attention can detect more accurate significant speckle features. Meanwhile, we adopt the Tanh function to normalize input features in order to reduce the difference in attention maps caused by average and maximum pooling. As for the spatial attention module, multi-scale spatial convolution is employed to detect speckle features of different sizes and shapes. Meanwhile, ReLU in traditional convolutional attention modules is also changed into Softplus to avoid too many neurons dying during training.

The calculation process of the NMSCA module is as follows: input a feature map $F^{C \times H \times W}$ and normalize it with $Tanh$ to obtain $F_{norm}^{C \times H \times W}$; Use the NMSCA module to calculate an attention map $A^{C \times H \times W}$ with the same shape as input features; then use this attention map to refine input features to obtain a new feature map $F_{a}$ for downstream tasks. The expression form of the entire process is:

$$F_{a} = F_{norm}^{C \times H \times W} + F_{norm}^{C \times H \times W} \otimes A^{C \times H \times W}$$

The attention map $A^{C \times H \times W}$ comes from two parts: one is spatial attention $A_s$, and the other is channel attention $A_c$. They are expanded to the same size by channel expansion and spatial expansion respectively, then element multiplication is performed, and finally $A^{C \times H \times W}$ can be obtained through the sigmoid activation function. The expression form of the process is shown below, and the module structure is shown in Fig. 5.

$$A^{C \times H \times W} = sigmoid\left( {A_{c} \otimes A_{s}} \right)$$

 figure: Fig. 5.

Fig. 5. NMSCA module. It first performs normalization and element-wise convolution using Tanh and a 1$\times$1Conv2d, then calculates channel attention and multi-scale spatial attention, and extends them on space and channels respectively. After element multiplication, the normalized features are finally refined with the Normalized Multi-Scale Spatial Channel Attention to obtain the output features.

Download Full Size | PDF

3.3 Multi-scale spatial attention

The multi-scale spatial attention module is functioned to extract the spatial attention of feature maps. In the stereo matching network based on SSPP, due to the fact that the projector itself may project speckles of different sizes and the size and shape of speckles may change irregularly with the modulation of 3D objects, this will cause some structural information to be lost in feature extraction and thus affect the accuracy of matching. Multi-scale spatial attention is capable of both extracting speckle structural information of different sizes, and strengthening speckle structural information that undergoes deformation and size changes, so as to avoid feature loss in subsequent processing. To achieve multi-scale spatial attention, we adopt dilated convolution [40] to change the receptive field of convolution and detect features of speckles of different shapes and sizes. For the normalized feature map $F_{norm}$, we perform two 3x3 convolutions with dilation of 1, 2, and 3 respectively to obtain spatial attention $A_{s1}$, $A_{s2}$, $A_{s3}$ of different scales. We then concatenate them and pass them through a 1x1 convolution and the SoftPlus function to obtain the multi-scale spatial attention $A_{s}$ of the input feature map. The expression process is shown below:

$$A_{s1} = {SConv}_{1}\left( {SConv}_{1}\left( F_{norm} \right) \right)$$
$$A_{s2} = {SConv}_{2}\left( {SConv}_{2}\left( F_{norm} \right) \right)$$
$$A_{s3} = {SConv}_{3}\left( {SConv}_{3}\left( F_{norm} \right) \right)$$
$$A_{s} = {SoftPlus}\left( {Conv2d}^{1 \times 1}\left( concat\left( {A_{s1},A_{s2},A_{s3}} \right) \right) \right)$$

$SConv_n$ represents the output after a 3$\times$3 convolution with dilation n followed by a SoftPlus function.

3.4 Channel attention

The channel attention is a module that calculates attention between channels of features. Unlike spatial attention, channel attention focuses on the differences among channels. The feature extraction stage is trained on all image pairs in the training dataset. Its purpose is to extract various features present in the entire training dataset, which are then organized into different feature channels. Therefore the optimal speckle features, for a single stereo image pair used for disparity estimation, may be concentrated in a few particular feature channels and need varying degrees of enhancement.

Similar to the process of the spatial attention module, we first normalize the input features to obtain $F_{norm}$ and then calculate the global avg pool ($AP)$ and global max pool ($MP$) respectively. Then we use a multi-layer perceptron ($MLP$) to obtain the average channel attention $A_{a}$ and maximum channel attention $A_{m}$. Finally, these two attentions are added together and averaged to obtain the channel attention. The calculation process of channel attention is shown below:

$$A_{a} = MLP\left( AP\left( F_{norm} \right) \right)$$
$$A_{m} = MLP\left( MP\left( F_{norm} \right) \right)$$
$$A_{c} = \left( A_{a} + A_{m} \right)/2$$

The $MLP$ consists of two linear transformations with SoftPlus activation functions behind them.

3.5 Loss function

In NMSCANet, we adopt the smoothL1Loss as the stereo matching loss function:

$$smoothL1\left( {x,target} \right) = \left\{ \begin{matrix} {0.5(x - target)^{2},\left( \left| {x - target} \right| < 0.5 \right)} \\ {\left| {x - target} \right| - 0.5,~\left( \left| {x - target} \right| \geq 0.5 \right)} \\ \end{matrix} \right.$$
Where $x$ is the predicted disparity value and $target$ is the GT disparity value.

4. Experiment and analysis

Here, we primarily introduce the various experiments we designed, the datasets, experimental configurations, as well as evaluation metrics. We conduct experiments to qualitatively and quantitatively analyze the accuracy and robustness of our proposed method in SSPP. Additionally, we compared our method with recently proposed approaches. Furthermore, we conduct ablation experiments to validate the effectiveness of the individual components(multi-scale spatial attention, channel attention) in our NMSCA module. Finally, validity experiments are designed to confirm that the improvements in accuracy and robustness achieved by our proposed method are not solely due to an increase in parameter number or network depth.

4.1 Data set

Given the primary focus of this work on 3D reconstruction using single-frame speckle projection, existing public datasets like SceneFlow [27] and Kitti [36] lack speckle projection data. Consequently, a binocular stereo vision device was constructed for speckle projection, employing FPP to generate datasets [23]. Additionally, a digital twin in Blender aided in producing simulated images to augment the training set. In the real scene, 465 pairs of diverse masks ($Data_{m}$), 153 pairs of real human faces ($Data_{r}$), and 1680 pairs of distinct face models ($Data_{h}$) in Blender were obtained. Notably, 417 pairs of $Data_{m}$ and 1600 pairs of $Data_{h}$ were utilized for training, with the remainder reserved for test datasets.

Additionally, to verify the robustness of our network to the change of speckle size, we select 49 3D models without participating in the generation of the training dataset to generate a dataset with different sizes of speckles projected in Blender. Each 3D model has 16 pairs of images generated under different sizes of speckle projection, and finally, 784 (49$\times$16) pairs of images ($Data_{s}$ ) are prepared for testing. The speckle size can be adjusted by changing the projection ratio of the virtual projector(shown in Fig. 1). The 16 projection ratios vary from 0.25 to 1.00 with an interval of 0.05.

In summary, the training dataset of our network consists of 1600 pairs of images from $Data_{h}$ and 417 pairs from $Data_{m}$. The testing datasets includes 80 pairs from $Data_{h}$, 48 pairs from $Data_{m}$, 153 pairs from $Data_{r}$, and 784 pairs from $Data_{s}$. The ratio of the testing dataset to the training dataset is approximately 2:1. To better visualize our training and testing data sets, we have listed the data used in the Table 1:

Tables Icon

Table 1. Data sets used in our training and testing procedure

It is worth mentioning that the size of each image in the above datasets is 1280$\times$1024. The speckle size projected by $Data_{m}$ and $Data_{h}$ is the same, which is the same as the speckle size of $Data_{s}$ when the projection ratio is 0.65.

4.2 Experiment design & configurations

We select three stereo matching networks as our benchmark models. The first one is the classic but effective PSMNet [21], which has a pyramid feature extraction module and an hourglass 3D convolution regression module. The reason it is chosen is that it possesses strong multi-scale spatial feature detection capabilities similar to NMSCANet. The second network is ACVNet [28], which serves as our backbone network. It focuses on optimizing the cost volume construction method by integrating two different cost volume construction methods, resulting in decent matching results. However, improvements are still required for the task of speckle stereo matching. The third network is the recently published DCSMNet [23], which is the latest and most effective stereo matching network for SSPP during preparing this paper. The comparison of this network aims to highlight the obvious advantages of our NMSCANet in SSPP. In experiments, the training set of $Data_{m}$ and $Data_{h}$ is used for training. We cut each image into four equal parts horizontally (256$\times$1280) for training while the main part of the face or mask is cropped out for testing (688$\times$768).

The learning rate of training is set to 0.001, the training epoch is set to 146, and the batch size is set to 3. The training loss function is SmoothL1. The training and testing environment is Windows 10, Pytorch, and Nvidia’s 3090GPU.

4.3 Evaluation metrics

We evaluate the accuracy and robustness of the proposed method from the perspectives of disparity and point cloud. In terms of disparity, we choose two evaluation metrics: n-pixel error rate($N$-$PER$) and end point error($EPE$). In terms of the point cloud, two evaluation metrics are adopted: mean nearest neighbor distance ($Mean~dis$) and distance standard deviation ($Std$). Numerically speaking, the smaller these metrics are, the higher the disparity prediction accuracy is, and the more robust the model is to speckle feature changes in shape and size. Specifically, we utilize $Std$ as an evaluation metric to assess the model’s stability when confronted with varying speckle sizes or varying speckle shapes. A smaller $Std$ indicates higher accuracy in predicting disparity. In single pair of images under one speckle size, variations in speckle size and shape are present as a consequence of the modulation of the complex surface of the tested model. Therefore, a decrease in $Std$ indicates that the model exhibits greater stability when confronted with different speckle features within a given scene. Furthermore, the change range of $Std$ observed when altering the speckle size can also serve as an indicator of the model’s reconstruction stability. A smaller range of $Std$ changes in response to different speckle sizes signifies a more stable network. By evaluating the minimal variations in these metrics during speckle size alterations, we can gain insights into the network’s robustness and its ability to consistently generate accurate reconstructions.

$$N-PER = \frac{1}{n}{\sum_{i = 1}^{n}\left\lbrack {\left( {\left| {{pre}_{i} - {gt}_{i}} \right| > N} \right)\&\left( \frac{\left| {{pre}_{i} - {gt}_{i}} \right|}{{gt_{i}}} \right) > thres_N} \right\rbrack}$$
$$EPE = \frac{1}{n}{\sum_{i = 1}^{n}\left| {pre}_{i} - {gt}_{i} \right|}$$
$$Mean~dis = \frac{1}{m}{\sum_{i = 1}^{m}\left. {dis\left( est \right.}_{i},{gt}_{i} \right)}$$
$$Std = \left\lbrack {\frac{1}{m}{\sum_{i = 1}^{m}\left. \left( {{distance\left( est \right.}_{i},{gt}_{i}} \right) - Mean~dis \right)^{2}}} \right\rbrack^{\frac{1}{2}}$$

The $thres_{N}$ is 0.01,0.02, or 0.05 when the $N$ is 0.5,1, or 3 respectively. The $dis$ represents the Euclidean distance between a certain point in the reconstructed point cloud and the nearest neighbor point in the GT point cloud.

4.4 Visual analysis of results

In order to validate the effectiveness and stability of our designed network facing speckle size variations, we conduct compare experiments on $Data_s$. We obtain the networks’ matching results under various speckle sizes and used the inferred disparity maps for three-dimensional point cloud reconstruction. This allows us to visually observe the matching results through point clouds. We select and arrange three representative speckle sizes and exhibit them in Fig. 6. The projection ratios corresponding to the three speckle sizes are 0.35, 0.7, and 1, representing small, normal, and large speckles, respectively. Additionally, we also reconstruct the GT point cloud using the high-precision disparity maps obtained from FPP, and calculate $Mean~dis$ and the distance $Std$ between the reconstructed point clouds of each network and the GT. The calculation results and point cloud visualizations are shown in Fig. 7. It can be clearly seen from the reconstructed point cloud in Fig. 6 and Fig. 7 that NMSCANet’s overall reconstruction results are smoother, more complete, and more detailed than those of other three networks. Moreover, when the speckle size is enlarged or reduced to a larger extent, it still maintains good reconstruction results, while the results of the compared networks are visibly and sharply degraded. At the same time, when the speckle size changes, NMSCANet’s low-error area, which is more biased towards green, covers a larger area. And the yellow area, standing for larger error, of the three comparison networks occupies a larger area when enlarging or shrinking the speckle size, and obvious speckle-like errors appear, proving that these networks fail to extract complete speckle features, resulting in obvious errors in 3D reconstruction. This proves that NMSCANet is more robust to speckle changes.

 figure: Fig. 6.

Fig. 6. The reconstructed point clouds of three speckle sizes and GT. The first four columns of this figure show the point cloud results of each model under different speckle projections, with the speckle sizes from top to bottom being 0.35, 0.70, and 1.00 respectively. The fifth column shows the GT point cloud and the sixth column shows the corresponding input left images. The text under each image indicates two evaluation indicators for disparity $0.5PER$/$EPE$.

Download Full Size | PDF

 figure: Fig. 7.

Fig. 7. The nearest neighbor distance of point clouds corresponding to Fig. 6. The first four columns of this figure show the results of the nearest neighbor distance between the reconstructed point clouds and the GT point clouds under different sizes of speckles. The closer the predicted points are to the GT points, the greener the color is. The fifth column shows the color map corresponding to different distances and the image generated by the model under white ambient light. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.

Download Full Size | PDF

In order to verify the network’s inference capability in real-world scenarios, we also performed point cloud reconstruction and distance evaluation with GT on the testing sets of $data_{m}$ and $data_{r}$ captured in a real scene. The results are displayed in Fig. 8 and Fig. 9. From the figures, we can see that NMSCANet has the smoothest and closest-to-GT global reconstruction result. Its green region covers the most area and is more complete. Additionally, from the zoomed-in area in Fig. 8, we can clearly identify that NMSCANet still enjoys the best performance of reconstruction in areas with small abrupt changes such as edges of eyes and eyelashes, and yields relatively smaller $Mean~dis$ and $Std$ than other models. This also implies that NMSCANet is more robust to the speckle shape changes caused by the 3D surface modulation.

 figure: Fig. 8.

Fig. 8. 3D reconstruction result of the tested model including corresponding left input speckle image (right bottom) and the point cloud distance of $Data_{m}$. The first row of the first four columns in this figure represents the reconstruction results of various networks when inputting real mask images. The second row shows the nearest neighbor distance between the reconstructed point clouds and GT point clouds for each model. The three pictures in the fifth column represent the GT point clouds, input left image, and image of the mask under white ambient light, respectively. The text beneath the sub-images is the same meaning as Fig. 7.

Download Full Size | PDF

 figure: Fig. 9.

Fig. 9. 3D reconstruction result of the tested model including corresponding left input speckle image (right bottom) and the point cloud distance of $Data_{r}$. The first row of the first four columns in this figure represents the reconstruction results of various networks when inputting real mask images. The second row shows the nearest neighbor distance between the reconstructed point clouds and GT point clouds for each model. The three pictures in the fifth column represent the GT point clouds, input left image, and image of the mask under white ambient light, respectively. The text beneath the sub-images is the same meaning as Fig. 7.

Download Full Size | PDF

4.5 Quantitative analysis of results

In addition to visualizing the reconstruction results to evaluate the superiority of the network, we also quantify the differences in matching accuracy and robustness among different networks by calculating $Mean~dis$ and $Std$ between predicted point clouds and ground truth point clouds. These data are annotated below the reconstructed point clouds corresponding to Fig. 7, Fig. 9 and Fig. 8, allowing for better observation and correlation with the visual effects. From the metrics shown in the three figures, we can conclude that NMSCANet has the smallest $Mean~dis$ and $Std$ in the reconstruction of speckles of all sizes. Meanwhile, In order to visualize the matching accuracy and matching errors of each network under all speckle sizes more intuitively, we calculate the disparity evaluation metrics for each network under different speckle pattern sizes and plot them in Fig. 10. For each speckle size, we have 49 pairs of input images. We calculate $0.5PER$, $1PER$, $3PER$, and $EPE$ for each pair and calculate their averages under different speckle sizes, respectively. From the plots, it is easy to observe that our NMSCANet achieves the best matching results under all test speckle sizes. In particular, we take out and list the matching result of all models at speckle size 0.75 which lies within the optimal interval [0.65,0.75] in Table 2. The result shows that the $EPE$ of our model is reduced by over 35%, and $0.5PER$, $1PER$, and $3PER$ are reduced by over 62%, 65%, and 69%, respectively. However, the prediction time (0.3864s) is increased by 33% in comparison with ACVNet(0.2895s), but slightly faster than PSMNet(0.3954s).

 figure: Fig. 10.

Fig. 10. Comparison results of different models on $data_{s}$. The x-axis of the four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively.

Download Full Size | PDF

Tables Icon

Table 2. Disparity comparison under speckle size 0.75

From Figs. 10(a)-(d), we can conclude that NMSCANet maintains the best prediction results and the lowest error under different speckle sizes. Meanwhile, as the speckle size changes(the minimal and maximum of the metrics are shown on the right side of the y-axis), NMSCANet’s accuracy suffers from the minimal change and always remains within a minimal range, with pixel errors all below 15% and $EPE$ below 3.16(pixels). While networks involved in comparison have a significant drop in accuracy when the speckle size changes from the optimal interval [0.65,0.75] to both sides. According to the data of $Mean~dis$ and $Std$ of point cloud shown in Fig. 7, it can be concluded that when the speckle changes to a larger extent, our method reduces the $Mean~dis$ of reconstructed point clouds by over 93% and improves the overall error stability (see from $Std$) by over 8 times.

Additionally, in order to provide a more standard quantitative analysis of the matching accuracy and robustness, we conduct experiments using a standard sphere (Diameter: 100 mm) and a plane (Dimension: 100mm $\times$ 80mm) which are 862 mm away from the projector under all sizes of speckles. The compared results of all 16 speckle sizes of all models are illustrated in Fig. 11. From these sub-figures, it is evident that our NMSCANet(red circle marker in the plots) consistently achieves the highest matching accuracy(the smallest EPE and NPER) and remains a minimal change range across all speckle sizes. For a more detailed comparison, we specifically select the matching results obtained under four different speckle sizes (0.25, 0.45, 0.65, 0.85) to reconstruct the point clouds. As shown in Fig. 12, we visualize the nearest neighbor differences between the reconstructed point clouds and GT to compare the overall reconstruction accuracy of different networks under those speckle sizes. This allows us to investigate the robustness of our network to size changes in speckle patterns. It can be concluded that our NMSCANet has the best reconstruction result. Furthermore, we quantify the $Mean~dis$ and $Std$ and annotate them below the figures to provide a more accurate assessment of the reconstruction capabilities of each network. The results show that our method exhibits the smallest change range in terms of $Mean~dis$ (0.0614 mm - 0.4066 mm) and $Std$ (0.0768 mm - 0.7367 mm) compared to other methods.

 figure: Fig. 11.

Fig. 11. Comparision results of different models on the sphere and planer. The x-axis of the four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively.

Download Full Size | PDF

 figure: Fig. 12.

Fig. 12. The nearest neighbor distance of point clouds of the tested sphere (Diameter: 100 mm) and planer (size: 100mm $\times$ 80 mm). The first four columns show the distance between the reconstructed point cloud and the GT point cloud under different sizes of speckles. The closer the predicted points are to the GT points, the greener the color is. The fifth column shows the corresponding input images of the left camera. The sixth column shows the color map corresponding to different distances and the image generated by the model under white ambient light. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.

Download Full Size | PDF

To analyze, DCSM fails to have the ability to detect multi-scale and multi-shape speckle features, so it obtains the worst result confronted the change of speckle change. Although PSMNet adopts multi-scale pooling to obtain multi-scale speckle features, it disables channel attention to optimize features for each image so the accuracy is still not satisfactory. ACVNet mainly utilizes cost volume construction to make full use of features but does not include an efficient feature extraction module for SSPP.

4.6 Ablation experiments

To evaluate the effectiveness of each component in our proposed NMSCA module, we conduct ablation experiments to check the impact of our multi-scale spatial attention mechanism and channel attention mechanism on the network’s performance and efficiency. We train three networks by removing the multi-scale spatial attention module and channel attention module from NMSCANet, respectively. These networks are denoted as Net A, Net B. The settings of the compared networks are listed in Table 3. The experimental results are shown in Fig. 13.

 figure: Fig. 13.

Fig. 13. Comparison results of ablation experiments. The x-axis of four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively. The partially enlarged results are shown in subplots on the top right when the speckle size changes from 0.5 to 0.8.

Download Full Size | PDF

Tables Icon

Table 3. Ablation experiment models settings

From Fig. 13, we can distinguish that the proposed method in this paper significantly boosts the matching accuracy of stereo matching networks. When using either multi-scale spatial attention or channel attention, the network’s robustness to speckle size variations is greatly enhanced. And the overall accuracy of the network reaches its peak when both of them function. Surprisingly, the performance improvement of channel attention in speckle variation scenarios is on par with the stability improvement of multi-scale spatial attention in handling different speckle patterns. We attribute this to the fact that in the feature extraction process of multiple convolutional layers, different channels are trained to capture different speckle features. With the inclusion of Channel Attention, Net A is able to optimize the feature channels and select a few channels that are most suitable for each image pair, resulting in more accurate feature representation. This optimization leads to improved global matching accuracy for Net A. Hence, We further conducted a comparative analysis of the point cloud reconstruction results, as shown in Fig. 14. We selected eight speckle-illuminated images at regular intervals for reconstruction. Each row in the figure represents the reconstruction results of a network model under different speckle illuminations. By comparing the second and third rows, Net A, with its channel attention mechanism that optimizes feature channels for different image pairs, significantly improves the matching results. From a global perspective, Net A achieves higher reconstruction accuracy compared to Net B. But it can be observed that even though Net A has lower mean dis and std, visually, its point cloud reconstruction results tend to have more speckle-like protrusions under speckle illuminations ranging from 0.3 to 0.5 and 0.8 to 1. On the other hand, Net B, with its ability to detect multi-scale speckle features, allows for better local matching and results in smoother point cloud reconstruction. The fourth row demonstrates that the NMSCANet reconstruction results incorporate the advantages of both Net A and Net B. It achieves the reconstruction accuracy of Net A while also achieving the smoother and more accurate point cloud reconstruction of Net B.

 figure: Fig. 14.

Fig. 14. The comparative results of the point cloud reconstruction in the ablation experiments are shown in the figure below. Every row represents the reconstruction results of a model under speckle illuminations ranging from 0.3 to 1.0. The last column shows the GT point cloud of the tested object. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.

Download Full Size | PDF

In summary, both channel attention and multi-scale spatial attention contribute to the performance improvement of NMSCANet in terms of disparity accuracy, and they are equally important. Channel attention has the ability to optimize feature channels, resulting in significant improvement in global accuracy. On the other hand, multi-scale spatial attention focuses on the variations of local speckle features, leading to smoother point cloud reconstruction. By combining channel attention and multi-scale spatial attention, our network achieves high accuracy globally and produces good reconstruction results locally.

4.7 Validity experiments

Considering that our NMSCANet module inevitably increases the number of parameters in the model, we also design validity experiments to verify whether the improved accuracy and robustness of NMSCANet stem from our proposed method, or just the deepening of the model or an increase in parameters. Two basic blocks [41] are introduced to the original feature extraction of ACVNet, making its parameter count slightly bigger than NMSCANet’s. Here we denote this deepened network as ACVDeepNet. The comparison results of ACVNet, NMSCANet, and ACVDeepNet on $data_{s}$ are given in Table 4.

Tables Icon

Table 4. Validity experiments results

From Table 4, we can see that NMSCANet has a similar number of parameters as ACVDeep, even slightly fewer by 30,000 parameters. However, our network outperforms ACVDeep in both point cloud and disparity evaluation metrics. It is evident that the superior matching performance of NMSCANet does not stem from increased network parameters or improved network depth instead from better network module design.

5. Conclusions

In summary, this paper investigates the relationship between speckle size variations in SSPP and feature extraction. It addresses the accuracy degradation issue caused by changes in speckle size and shape by introducing the NMSCA module in the network design. Spatial attention is able to effectively extract and enhance the features of different shapes and sizes of speckles to prevent the loss of local structural information in subsequent processing. On the channel side, NMSCA optimally selects speckle features for each input image pair, enhancing the most structurally rich feature channels and further helping the model reach high global matching accuracy. This paper compares NMSCANet with several existing competitive stereo matching networks, showing that the disparity $N$-$PER$ is reduced by over 62%, and the stability of the reconstructed point cloud’s nearest point distance is maintained over 8 times more stable(see from $Std$) than compared networks. The ablation experiment and the validity experiment also show the effectiveness of our proposed method. However, our proposed network is 33% slower than ACVNet in inference speed. Further research is desirable to improve the prediction speed while maintaining accuracy.

Funding

National Natural Science Foundation of China (62101364, 61901287); China Postdoctoral Science Foundation (2021M692260); Key Research and Development Project of Sichuan Province (2021YFG0195, 2022YFG0053); The central government guides local funds for science and technology development (2022ZYD0111).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Zhang, “High-speed 3d shape measurement with structured light methods: A review,” Opt. Lasers Eng. 106, 119–131 (2018). [CrossRef]  

2. Z. Ma and S. Liu, “A review of 3d reconstruction techniques in civil engineering and their applications,” Adv. Eng. Inform. 37, 163–174 (2018). [CrossRef]  

3. Z. Sun, Y. Jin, M. Duan, et al., “3-d measurement method for multireflectivity scenes based on nonlinear fringe projection intensity adjustment,” IEEE Trans. Instrum. Meas. 70, 1–14 (2021). [CrossRef]  

4. Y. Guo, Z. Duan, Z. Zhang, et al., “Fast and accurate 3d face reconstruction based on facial geometry constraints and fringe projection without phase unwrapping,” Opt. Lasers Eng. 159, 107216 (2022). [CrossRef]  

5. S. Zhang, “Absolute phase retrieval methods for digital fringe projection profilometry: A review,” Opt. Lasers Eng. 107, 28–37 (2018). [CrossRef]  

6. C. Zuo, T. Tao, S. Feng, et al., “Micro fourier transform profilometry (uftp): 3d shape measurement at 10,000 frames per second,” Opt. Lasers Eng. 102, 70–91 (2018). [CrossRef]  

7. W. Yin, C. Zuo, S. Feng, et al., “High-speed three-dimensional shape measurement using geometry-constraint-based number-theoretical phase unwrapping,” Opt. Lasers Eng. 115, 21–31 (2019). [CrossRef]  

8. J. Wang, Y. Zhou, and Y. Yang, “A novel and fast three-dimensional measurement technology for the objects surface with non-uniform reflection,” Results Phys. 16, 102878 (2020). [CrossRef]  

9. H. Nguyen, T. Tran, Y. Wang, et al., “Three-dimensional shape reconstruction from single-shot speckle image using deep convolutional neural networks,” Opt. Lasers Eng. 143, 106639 (2021). [CrossRef]  

10. R. Kulkarni and P. Rastogi, “Optical measurement techniques - a push for digitization,” Opt. Lasers Eng. 87, 1–17 (2016). [CrossRef]  

11. W. Yin, C. Zuo, S. Feng, et al., “An opencl-based speckle matching on the monocular 3d sensor using speckle projection,” in 4th International Conference on Photonics and Optical Enginerring, vol. 11761 of Proceedings of SPIEJ. She, ed. (2021).

12. X. Ma, “Binocular vision three-dimensional imaging technology by using structural light projection,” in MATEC Web of Conferences, vol. 227 (2018), p. 02006.

13. W. Yin, L. Cao, H. Zhao, et al., “Real-time and accurate monocular 3d sensor using the reference plane calibration and an optimized sgm based on opencl acceleration,” Opt. Lasers Eng. 165, 107536 (2023). [CrossRef]  

14. M. Duan, B. Wang, Y. Zheng, et al., “Long-depth-volume 3d measurements using multithread monocular fringe projection profilometry,” Proc. SPIE 12169, 12169B9 (2022). [CrossRef]  

15. M. Zhang, Q. Chen, T. Tao, et al., “Robust and efficient multi-frequency temporal phase unwrapping: optimal fringe frequency and pattern sequence selection,” Opt. Express 25(17), 20381–20400 (2017). [CrossRef]  

16. L. Zhang, Q. Chen, C. Zuo, et al., “High dynamic range and real-time 3d measurement based on a multi-view system,” in Proceedings of 2nd Target Recognition and Artificial Intelligence Summit Forum, vol. 11427 (2020).

17. C. Yu, F. Ji, J. Xue, et al., “Adaptive binocular fringe dynamic projection method for high dynamic range measurement,” Sensors 19(18), 4023 (2019). [CrossRef]  

18. W. Yin, S. Feng, T. Tao, et al., “High-speed 3d shape measurement using the optimized composite fringe patterns and stereo-assisted structured light system,” Opt. Express 27(3), 2411 (2019). [CrossRef]  

19. P. Zhou, J. Zhu, and J. Hailong, “Optical 3-d surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]  

20. A. Kendall, H. Martirosyan, S. Dasgupta, et al., “End-to-end learning of geometry and context for deep stereo regression,” International Conference on Computer Vision pp. 66–75 (IEEE, 2017).

21. J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 5410–5418 (2018).

22. G. Xu, H. Zhou, and X. Yang, “Cgi-stereo: Accurate and real-time stereo matching via context and geometry interaction,” arXiv, arXiv.2301.02789 (2023). [CrossRef]  

23. R. Wang, P. Zhou, and J. Zhu, “Accurate 3d reconstruction of single-frame speckle-encoded textureless surfaces based on densely connected stereo matching network,” Opt. Express 31(9), 14048–14067 (2023). [CrossRef]  

24. W. Yin, Y. Hu, S. Feng, et al., “Single-shot 3d shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]  

25. G. Xu, X. Wang, X. Ding, et al., “Iterative geometry encoding volume for stereo matching,” IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 21919–21928 (2023).

26. X. Guo, K. Yang, W. Yang, et al., “Group-wise correlation stereo network,” IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 3268–3277 (2019).

27. N. Mayer, E. Ilg, P. Häusser, et al., “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” Conference on Computer Vision and Pattern Recognition pp. 4040–4048 (IEEE, 2016).

28. G. Xu, J. Cheng, P. Guo, et al., “Attention concatenation volume for accurate and efficient stereo matching,” IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12971–12980 (2022).

29. W. Yin, C. Zuo, S. Feng, et al., “An end-to-end speckle matching network for 3D imaging,” in SPIE/COS Photonics Asia (2020).

30. R. Hartley and R. Gupta, “Computing matched-epipolar projections,” in Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 1993), pp. 549–555.

31. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024 (2021). [CrossRef]  

32. X. Mei, X. Sun, M. Zhou, et al., “On building an accurate stereo matching system on graphics hardware,” in International Conference on Computer Vision Workshops (IEEE, 2011), pp. 467–474.

33. C. Barnes, E. Shechtman, A. Finkelstein, et al., “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph. 28(3), 1–11 (2009). [CrossRef]  

34. B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature 381(6583), 607–609 (1996). [CrossRef]  

35. S. Zhang, M. Wu, Y. Wu, et al., “Fixed window aggregation ad-census algorithm for phase-based stereo matching,” Appl. Opt. 58(32), 8950–8958 (2019). [CrossRef]  

36. A. Geiger, P. Lenz, C. Stiller, et al., “Vision meets robotics: The kitti dataset,” The Int. J. Robotics Res. 32(11), 1231–1237 (2013). [CrossRef]  

37. S. Woo, J. Park, J.-Y. Lee, et al., “Cbam: Convolutional block attention module,” Proceedings of Computer Vision 11211, 3–19 (2018).

38. Y. Li, F. Luo, and C. Xiao, “Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module,” Comp. Visual Media 8(4), 631–647 (2022). [CrossRef]  

39. J. Chen, L. Wan, J. Zhu, et al., “Multi-scale spatial and channel-wise attention for improving object detection in remote sensing imagery,” IEEE Geosci. Remote Sensing Lett. 17(4), 681–685 (2020). [CrossRef]  

40. Z. Zhang, X. Wang, and C. Jung, “Dcsr: Dilated convolutions for single image super-resolution,” IEEE Trans. on Image Process. 28(4), 1625–1635 (2019). [CrossRef]  

41. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (14)

Fig. 1.
Fig. 1. The digital twin setting and the parameters setting of projector in Blender Software.Image (a) showcases a virtual capture scene constructed within Blender, illustrating the relative positioning between the object, camera, and projector. Image (b) depicts the shading tree of the projector, where three intermediate blue nodes map the normal direction of the projector from a 3D vector to the yz-plane. This allows direct scaling of the projected image using the x-axis. Image (c) displays the various projection scales utilized within our simulation.
Fig. 2.
Fig. 2. Images of different speckle sizes simulated by digital twin techniques in Blender. From the top left to the bottom right the speckle size (projection ratio) changes from 0.25 to 1.00 with an interval of 0.05. The image within the red circular box is a magnified view of the nose region captured under speckle illumination. The image within the yellow rectangular box is the binary result of a rectangular region with a width and height of 20 pixels, representing the tip of the nose.
Fig. 3.
Fig. 3. Line chart illustrating the count of local features under different speckle sizes. We extract a region of 300 pixels in width and height, covering the nose area in the face image shown in Fig. 2, and perform a binarization same to the yellow box in Fig. 2 (e.g., the detected image when the speckle size is 0.25 are shown in the center of the image). The schematic diagram of the window sliding process is illustrated in the top right corner, where we employ a sliding window of size H$\times$W pixels, moving in steps of delta x pixels in the x-direction or delta y in the y-direction. Each movement involves feature extraction, eliminating duplicate local features, and tallying their occurrences for plotting on the line chart. The blue line shows the feature quantity of the block size of 3$\times$3 (delta x and delta y are both 2). The red dash line represents the feature quantity of the block size 20$\times$20 (delta x and delta y are both 3).
Fig. 4.
Fig. 4. The overall framework of the proposed NMSCANet. There are three modules including Feature Extraction, Volume Construction, and Disparity Regression in the network. Conv2d+NMSCA module in Feature Extraction contributes a significant improvement to reconstruction accuracy and robustness. A pair of rectified speckle images is considered as the input and their disparity is the output of the network.
Fig. 5.
Fig. 5. NMSCA module. It first performs normalization and element-wise convolution using Tanh and a 1$\times$1Conv2d, then calculates channel attention and multi-scale spatial attention, and extends them on space and channels respectively. After element multiplication, the normalized features are finally refined with the Normalized Multi-Scale Spatial Channel Attention to obtain the output features.
Fig. 6.
Fig. 6. The reconstructed point clouds of three speckle sizes and GT. The first four columns of this figure show the point cloud results of each model under different speckle projections, with the speckle sizes from top to bottom being 0.35, 0.70, and 1.00 respectively. The fifth column shows the GT point cloud and the sixth column shows the corresponding input left images. The text under each image indicates two evaluation indicators for disparity $0.5PER$/$EPE$.
Fig. 7.
Fig. 7. The nearest neighbor distance of point clouds corresponding to Fig. 6. The first four columns of this figure show the results of the nearest neighbor distance between the reconstructed point clouds and the GT point clouds under different sizes of speckles. The closer the predicted points are to the GT points, the greener the color is. The fifth column shows the color map corresponding to different distances and the image generated by the model under white ambient light. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.
Fig. 8.
Fig. 8. 3D reconstruction result of the tested model including corresponding left input speckle image (right bottom) and the point cloud distance of $Data_{m}$. The first row of the first four columns in this figure represents the reconstruction results of various networks when inputting real mask images. The second row shows the nearest neighbor distance between the reconstructed point clouds and GT point clouds for each model. The three pictures in the fifth column represent the GT point clouds, input left image, and image of the mask under white ambient light, respectively. The text beneath the sub-images is the same meaning as Fig. 7.
Fig. 9.
Fig. 9. 3D reconstruction result of the tested model including corresponding left input speckle image (right bottom) and the point cloud distance of $Data_{r}$. The first row of the first four columns in this figure represents the reconstruction results of various networks when inputting real mask images. The second row shows the nearest neighbor distance between the reconstructed point clouds and GT point clouds for each model. The three pictures in the fifth column represent the GT point clouds, input left image, and image of the mask under white ambient light, respectively. The text beneath the sub-images is the same meaning as Fig. 7.
Fig. 10.
Fig. 10. Comparison results of different models on $data_{s}$. The x-axis of the four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively.
Fig. 11.
Fig. 11. Comparision results of different models on the sphere and planer. The x-axis of the four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively.
Fig. 12.
Fig. 12. The nearest neighbor distance of point clouds of the tested sphere (Diameter: 100 mm) and planer (size: 100mm $\times$ 80 mm). The first four columns show the distance between the reconstructed point cloud and the GT point cloud under different sizes of speckles. The closer the predicted points are to the GT points, the greener the color is. The fifth column shows the corresponding input images of the left camera. The sixth column shows the color map corresponding to different distances and the image generated by the model under white ambient light. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.
Fig. 13.
Fig. 13. Comparison results of ablation experiments. The x-axis of four charts represents the size of the speckle projected in Blender, and the y-axis of (a), (b), (c), and (d) represents the $0.5PER$, $1PER$, $3PER$, and $EPE$, respectively. The partially enlarged results are shown in subplots on the top right when the speckle size changes from 0.5 to 0.8.
Fig. 14.
Fig. 14. The comparative results of the point cloud reconstruction in the ablation experiments are shown in the figure below. Every row represents the reconstruction results of a model under speckle illuminations ranging from 0.3 to 1.0. The last column shows the GT point cloud of the tested object. The text under each image indicates two evaluation indicators for the reconstructed point clouds $Mean~dis$ / $Std$.

Tables (4)

Tables Icon

Table 1. Data sets used in our training and testing procedure

Tables Icon

Table 2. Disparity comparison under speckle size 0.75

Tables Icon

Table 3. Ablation experiment models settings

Tables Icon

Table 4. Validity experiments results

Equations (14)

Equations on this page are rendered with MathJax. Learn more.

F a = F n o r m C × H × W + F n o r m C × H × W A C × H × W
A C × H × W = s i g m o i d ( A c A s )
A s 1 = S C o n v 1 ( S C o n v 1 ( F n o r m ) )
A s 2 = S C o n v 2 ( S C o n v 2 ( F n o r m ) )
A s 3 = S C o n v 3 ( S C o n v 3 ( F n o r m ) )
A s = S o f t P l u s ( C o n v 2 d 1 × 1 ( c o n c a t ( A s 1 , A s 2 , A s 3 ) ) )
A a = M L P ( A P ( F n o r m ) )
A m = M L P ( M P ( F n o r m ) )
A c = ( A a + A m ) / 2
s m o o t h L 1 ( x , t a r g e t ) = { 0.5 ( x t a r g e t ) 2 , ( | x t a r g e t | < 0.5 ) | x t a r g e t | 0.5 ,   ( | x t a r g e t | 0.5 )
N P E R = 1 n i = 1 n [ ( | p r e i g t i | > N ) & ( | p r e i g t i | g t i ) > t h r e s N ]
E P E = 1 n i = 1 n | p r e i g t i |
M e a n   d i s = 1 m i = 1 m d i s ( e s t i , g t i )
S t d = [ 1 m i = 1 m ( d i s t a n c e ( e s t i , g t i ) M e a n   d i s ) 2 ] 1 2
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.