Real-time 3D shape measurement of dynamic scenes using fringe projection profilometry: lightweight NAS-optimized dual frequency deep learning approach

Yueyang Li; Zhouejie Wu; Junfei Shen; Qican Zhang

doi:10.1364/OE.506343

1. Introduction

Due to the advantages of non-contact, low cost and high accuracy, 3D shape measurement methods based on fringe projection profilometry (FPP) [1–4] have been widely used in computer vision, biomedical engineering, and other fields [5–7]. In some applications, such as industrial online detection, it is important to perform 3D shape measurement for dynamic scenes in low latency, which brings challenges to traditional FPP techniques for their accuracy and real-time capability.

The conventional 3D reconstruction process of FPP consists of four key steps: fringe pattern modulation, phase demodulation, phase unwrapping, and 3D calibration. To tackle the demands of dynamic scenes, these steps need to be adapted and optimized for improved accuracy and real-time performance. Fourier transform profilometry (FTP) [8,9] and phase-shifting profilometry (PSP) [10,11] are the most common approaches to demodulate wrapped phase from deformed fringe patterns. PSP-based methods utilize the least square algorithm (LSA) and usually require at least three frames to demodulate the phase value. Such kind of methods are robust to surface variation and non-uniform reflection but sensitive to motion-induced error in the measurement of dynamic scenes [12]. FTP-based methods adopt a properly designed bandpass filter to separate the desired components from the frequency domain of the fringe pattern. The influence of moving objects’ motion can be avoided radically since only one frame fringe is required. However, due to the influence of filter operations, it is difficult for FTP-based methods to perform high-accuracy measurement on complex surfaces [13].

In FPP, to eliminate the ambiguity of the demodulated wrapped phase, phase unwrapping should be carried out, which can mainly be divided into spatial phase unwrapping (SPU) [13–16], and temporal phase unwrapping (TPU) [17–19], etc. SPU methods usually determine the phase unwrapping path by analyzing the adjacent region of one pixel, or optimize an absolute phase by minimizing the difference of phase gradients. They are usually implemented based on a single wrapped phase, and are efficient for measurement. However, constrained by the Itoh condition [20], they are not suitable to deal with locally isolated discontinuous regions. On the other hand, TPU methods can offer robust pixel-wise results with the aid of additional wrapped phases. With the increase of required frames for wrapped phase calculation, TPU methods have to make compromises with time resolution [18].

In order to overcome the difficulties of traditional methods in measuring dynamic scenes, researchers have developed various kinds of strategies [3,21–26] to reduce the number of required projected fringe patterns. One intuitive approach is to utilize the redundant information to replace the original phase-shifting fringe patterns with fewer images, such as Zuo’s method (3 + 2) [22] shares the same background intensity within two sets of phase-shifting fringe patterns, his other method (2 + 2) [23] uses ramp patterns with a linear variation of intensity to compute phase. However, the motion-induced errors occurred in phase demodulation still exist, since the height of the same point on the tested object may change between frames. Another approach is to modulate multiple fringe patterns by different carrier frequencies along the orthogonal direction, and composite them into one pattern image [25,26]. The corresponding wrapped phases could be separated from the spectrum of the composited fringe pattern by FTP and unwrapped by TPU. Although frequency multiplexing can significantly improve the efficiency of FPP, the spectrum aliasing problem occurred in FTP will make it unreliable to apply this approach in high-accuracy measurement.

Compared with traditional methods, many recently developed deep learning-based FPP methods have achieved excellent performance in various aspects, such as fringe enhancement [27], phase unwrapping [28–31], high dynamic range measurement [32–34] and fringe analysis [35–38]. For dynamic scene measurement with fewer patterns, Sam Van der Jeught et al. [39] constructed an end-to-end convolutional neural network (CNN) that transforms one fringe pattern to depth directly. This one-stage way can predict depth without any additional intermediate processing steps, but is difficult to learn the accurate mapping function between the fringe intensity and height distribution, especially in the case of complex surface. Nguyen et al. [40] designed a fringe-to-fringe CNN that separated out multiple phase-shifted grayscale fringe patterns from a composite color coding pattern, which are further used for PSP and TPU. Li et al. utilized a dual-frequency composite approach [41] and frequency multiplexing approach [42], in which the numerator and denominator of LSA and a coarse phase map can be predicted from a single composite fringe pattern, and the unwrapped phase can be obtained by TPU. The processing of such multi-stage way mainly follows the traditional FPP and the deep learning technique is incorporated in the phase demodulation step for higher accuracy. However, because of the large amount parameters of neural networks and the complexity of network structures, it is time-consuming to perform one interface and hard to achieve real-time processing for these methods.

Based on the neural architecture search (NAS) technique [43,44], a modified dual frequency heterodyne fringe projection profilometry is proposed for real-time high-accuracy 3D measurement in this paper. For phase demodulation, U-shaped architecture is selected as the backbone network, and the down- and up-sampling layers are replaced by corresponding cell architectures, which are searched from the space that contains suitable primitive operations. For phase unwrapping, one pre-computed phase of a reference plane is employed to enable a heterodyne method of dual high frequency patterns. After proper training for the searched architecture, the wrapped phase can be recovered from two sinusoidal fringe patterns rapidly and robustly with fewer amount of parameters in the neural network. By reusing patterns from adjacent frames in dynamic measurement, each newly captured fringe pattern can update one corresponding 3D result. Experimental results show that our method can achieve a reconstruction rate of up to 58 frame per second (fps) and realize real-time high-accuracy measurement for moving objects.

2. Principle

2.1 Basic principle of fringe projection profilometry (FPP)

A typical monocular FPP system consists of a camera and a projector. Fringe patterns are projected onto a target object by the projector and the distorted fringe patterns that contained the wanted depth information of the object’s surface are captured by the camera. A sinusoidal fringe pattern can be represented as:

(1)$$I({x,y} )= A({x,y} )+ B({x,y} )\cos \phi ({x,y} ).$$

where (x, y) donates the pixel coordinate, A is the background intensity, B is intensity modulation, and ϕ is the desired phase map. In PSP, the captured N-step phase shifted fringe patterns can be expressed as:

(2)$${I_n}({x,y} )= A({x,y} )+ B({x,y} )\cos [{\phi ({x,y} )- {\delta_n}(x,y)} ],$$

(3)$${\delta _n}(x,y) = 2\pi ({n - 1} )/N,n = 1,2,\ldots ,N.$$

where ${\delta _n}$ is the phase shift with the index of n. The phase map and intensity modulation can be calculated by LSA:

(4)$$\phi ({x,y} )= \arctan \frac{{M({x,y} )}}{{D(x,y)}} = \arctan \frac{{\sum\limits_{n = 1}^N {{I_n}({x,y} )\sin {\delta _n}(x,y)} }}{{\sum\limits_{n = 1}^N {{I_n}({x,y} )\cos {\delta _n}(x,y)} }},$$

(5)$$B({x,y} )= \frac{2}{N}\sqrt {M{{({x,y} )}^2} + D{{({x,y} )}^2}} .$$

where M(x,y) and D(x,y) represent the numerator and denominator of the arctangent function respectively. Owing to the property of the arctangent function, the phase provided from Eq. (4) will be truncated from -π to π, and a phase unwrapping process should be performed to obtain a continuous phase distribution. The relationship between the wrapped phase ϕ and unwrapped phase Φ can be represented as:

(6)$$\Phi ({x,y} )= \phi ({x,y} )+ 2\pi k({x,y} ).$$

where k is the integer fringe order. The traditional dual frequency TPU method [45] utilizes the heterodyne technique to construct an unambiguous phase and guide the process of phase unwrapping. The equivalent phase ${\phi _{eq}}$, which is constructed by the difference of the phase of low frequency ${\phi _l}$ and that of high frequency ${\phi _h}$, can be expressed as:

(7)$${\phi _{eq}}({x,y} )= {\phi _h}({x,y} )- {\phi _l}({x,y} ).$$

where the beat frequency ${f_{eq}} = {f_h}{f_l}/({{f_h} - {f_l}} )$, in which the high frequency f_h and the low frequency f_l should be selected properly to make sure that ${\phi _{eq}}$ is unambiguous within the whole measurement field of view (FOV). Then the fringe order k_l and k_h corresponding to ${\phi _l}$ and ${\phi _h}$ be calculated by:

(8)$${k_l}({x,y} )= Round\left[ {\frac{{({{f_l}/{f_{eq}}} ){\phi_{eq}}({x,y} )- {\phi_l}}}{{2\pi }}} \right],$$

(9)$${k_h}({x,y} )= Round\left[ {\frac{{({{f_h}/{f_{eq}}} ){\phi_{eq}}({x,y} )- {\phi_h}}}{{2\pi }}} \right].$$

where $Round[{\cdot} ]$ means the value to the nearest integer. Finally, the 3D shape can be recovered from the unwrapped phase using pre-calibrated parameters based on triangulation.

While traditional FPP techniques have demonstrated their efficacy in 3D shape measurement, they often encounter challenges in achieving both high-accuracy and real-time processing. On the other hand, although deep learning-based FPP have shown great potential in achieving high-accuracy reconstruction with few fringe patterns, the manual design of neural architectures is often heuristic and time-consuming. Therefore, we seek to advance 3D shape measurement by combining traditional FPP with NAS-assisted deep learning, offering real-time processing capabilities and enhanced accuracy in phase demodulation.

2.2 Basic principle of neural architecture search (NAS)

Since the invention of deep learning techniques, numerous successful neural architectures have been introduced in the field of CNN, such as ResNet [46], UNet [47], and so on. However, the design of the neural architecture is heuristic in most cases, and heavily relies on the prior knowledge of the designer. As a result, NAS was developed to find the optimal architecture automatically with minimal human intervention.

The NAS involves three key components [44]: the search space, search strategy, and performance estimation strategy. These components collaborate with each other, as illustrated in Fig. 1(a), to accomplish the task of searching optimal architecture. The search process is similar to the parameter optimization in usual CNN training, except that it seeks optimized architecture rather than hyperparameter. The search strategy details how to explore the search space, which typically includes predefined primitive operations and neural architecture hyperparameters, to construct a candidate architecture. The candidate architecture found by the search strategy is then quantitatively analyzed by a performance evaluation strategy to obtain a score on the target task. The information on the performance is then fed back to the search strategy to adjust inner parameters for the next search. The whole process is repeated until the given termination condition is satisfied. Once the final optimal architecture is searched, it will be retrained as a normal neural network and applied to practical inference.

Fig. 1. (a) General flowchart of NAS. (b) Flowchart of NAS with cell-based search space and gradient descent search strategy.

Download Full Size | PDF

To search for a proper architecture, one direct approach is to start from scratch and employ reinforcement learning or recurrent neural networks [46] to discretely search among all possible combinations of candidate operations, which means a very large search space and inefficient adjustment of neural architecture. Rather than search all necessary components of the neural architecture, we adopt the cell-based search space of differentiable architecture search (DARTS) [48]. The best cell architectures are searched first and the overall neural architecture is constructed by repeatedly stacking the cells on the backbone network. In this way, the existing excellent neural architecture can be utilized to greatly save the corresponding search cost.

As shown in Fig. 1(b), every cell in the cell-based search space can be regarded as a directed acyclic graph (DAG) consisting of a topologically ordered sequence of nodes. There are two input nodes and one output node in each cell, where the input node is the output of two previous cells and the results of all intermediate nodes are concatenated to generate the cell output. Each intermediate node has a latent representation x_i to be learned. Each edge in DAG is a mixed operation ${\bar{o}^{({i,j} )}}(x )$ of candidate primitive set that transforms x_i to compose x_j, which can be expressed as a weighted sum of all possible operations $o(x )$:

(10)$${\overline o ^{(i,j)}}(x) = \sum\limits_{o \in O} {w_o^{(i,j)}} \cdot o(x)$$

where O represents the search space, w indicates the weight of operation o, and w is the softmax of the parameters of all operations in DARTS and is the constant 1 in one-shot search strategy. Considering the lower requirement of computational resources and a reduced time to search for optimal architectures compared to the multi-shot search strategy, the one-shot strategy ProxylessNAS [49], a low-memory-consuming optimized version of differentiable architecture search, is selected to guide the selection of models. After completing the training of architecture parameters, a compact architecture can be derived by pruning redundant paths. By stacking the searched cell architecture on the backbone network, the final neural network is obtained.

NAS explores a wide range of architectures and discovers patterns and configurations that may not have been considered by human designers, which has the potential to find more reliable and robust architectures for fringe analysis and 3D reconstruction. Besides, during the continuous optimization of the architecture, NAS eliminates unnecessary operations and identifies more efficient configurations, which enables the identification of complex and effective combinations of operations that accurately capture and represent features of fringe patterns. In summary, NAS can revolutionize fringe analysis by automating the search for optimal neural architectures. By leveraging its ability to explore vast search spaces, refine architectures, and optimize computational efficiency, NAS can offer improved reliability, accuracy, and speed with limited model size, ultimately enhancing the effectiveness of fringe analysis and reconstruction processes. For this reason, NAS is the first choice of our neural architecture in this research work.

2.3 NAS-optimized dual frequency real-time 3D reconstruction method

As discussed before, traditional methods face challenges in achieving high-accuracy 3D reconstruction with a limited number of fringe patterns. Meanwhile, the inference times of most deep learning-based methods are struggling to meet the requirements of real-time 3D measurement in dynamic scenes, especially when dealing with high-resolution images. To address this problem, an efficient phase demodulation using NAS-optimized lightweight neural network was proposed for accurate fringe analysis, which couples the characteristics of high-resolution, high information utilization and low computational density, and a modified dual frequency heterodyne phase unwrapping strategy is used to increase the fringe frequencies, which offers more information for the feature extraction in the neural network. The diagram of the proposed NAS-based real-time 3D reconstruction method is shown in Fig. 2.

Fig. 2. Proposed NAS-optimized dual frequency real-time 3D reconstruction method. Steps 1 to 4 represent NAS-assisted phase demodulation, wrapped phase calculation, reference plane-assisted dual-frequency heterodyne phase unwrapping and 3D shapes reconstruction, respectively.

Download Full Size | PDF

Step 1: To feed the low-frequency and high-frequency distorted fringe patterns to a well-trained NAS-optimized lightweight network respectively, and obtain the numerator M and denominator D of the corresponding frequency of the arctangent function in Eq. (4). As introduced in section 2.1, the traditional dual-frequency heterodyne method has to construct a beat unit frequency, which limits its application in high-precision real-time 3D reconstruction. More specifically, it exhibits poor noise resistance in high frequency which makes it prone to phase unwrapping failures. Besides, existing neural networks for fringe analysis often demonstrate lower accuracy when dealing with single-frame low-frequency fringe, which is primarily due to the spectral overlapping that makes it difficult to extract reliable information. Therefore, two frequencies with a heterodyne larger than 1 are selected for higher frequency capability and phase unwrapping robustness, where the unwrapping process of the equivalent phase is handled with the phase of a reference plane. In this paper, we choose f_l = 56 and f_h = 64 to realize phase unwrapping.

Step 2: To calculate the wrapped phase ${\phi _h}$ and ${\phi _l}$ according to Eq. (3).

Step 3: To calculate the unwrapped phase ${\mathrm{\Phi }_h}$ and ${\mathrm{\Phi }_l}$ assisted by the precomputed unambiguous phase of a reference plane ${\mathrm{\Phi }_r}$. The beat frequency chosen for the proposed method is set to larger than 1, which means we can use one pair of frequencies like 56 and 64 to modulate the object’s information with more phase variation, resulting in a higher accuracy. Before the measurement, a reference plane is placed at the maximum measurement depth, and its unwrapped phase is recovered using the conventional method. During the measurement, the phase of the reference plane is used to assist the unwrapping process of the equivalent phase obtained by heterodyning, which further eliminates the ambiguity of the phase of low and high frequencies. The introduction of the reference plane is aimed at eliminating the requirement of unit frequency in the traditional three-frequency phase unwrapping method, but without increasing the error amplification rate during phase unwrapping, enabling the achievement of the equivalent single-shot processing. As shown in Fig. 3, since the equivalent phase obtained by subtracting the phase of selected frequencies is still wrapped with the frequency of ${f_{eq}} = {f_h} - {f_l} = 8$, an unwrapped phase of the reference plane with the same frequency is employed to assist in eliminating this ambiguity. The plane is placed on the maximum depth of the measurement range, and its unwrapped phase ${\mathrm{\Phi }_r}$ of frequency ${f_{eq}}$ is stored in advance, so the unwrapped equivalent phase can be solved as:

(11)$${\Phi _{eq}} = {\phi _{eq}} + 2\pi \cdot Round\left[ {\frac{{{\Phi _r} - {\phi_{eq}}}}{{2\pi }}} \right]$$

Fig. 3. Proposed dual-frequency heterodyne phase unwrapping method assisted by the reference plane. (a) Equivalent phase obtained by subtracting the phase of low and high frequencies. (b) Unwrap the equivalent phase assisted by the reference plane. (c) Unwrap the wrapped phase of high frequency.

Download Full Size | PDF

Furthermore, we can perform phase unwrapping for both high-frequency and low-frequency wrapped phases using the following equations:

(12)$${\Phi _h} = {\phi _h} + 2\pi \cdot Round\left[ {\frac{{({{f_h}/{f_{eq}}} ){\Phi _{eq}} - {\phi_h}}}{{2\pi }}} \right]$$

(13)$${\Phi _l} = {\phi _l} + 2\pi \cdot Round\left[ {\frac{{({{f_l}/{f_{eq}}} ){\Phi _{eq}} - {\phi_l}}}{{2\pi }}} \right]$$

Step 4: To reconstruct 3D shapes from the unwrapped phases using the traditional calibration method.

The predictions of M and D of low and high frequency fringes share the same well-trained neural network, which is searched by NAS to implement phase demodulation with low latency and high accuracy. By cyclically projecting dual-frequency fringes, it is possible to obtain a high-precision reconstruction result with each newly captured fringe pattern. Since the small difference in frequency between 56 and 64, the accuracy of the reconstruction results obtained at corresponding frequencies will also be similar. From a temporal perspective, this approach achieves single-shot 3D reconstruction.

The U-shaped architecture illustrated in Step 1 of Fig. 2 is chosen as the backbone network to construct the final network with various cells, as it is the most commonly used architecture in the field of combining FPP and deep learning due to its proven effectiveness in diverse areas, such as fringe analysis and fringe enhancement. Inspirited by NAS-UNet [50] in medical image segmentation, the original UNet's convolutional layers are substituted with cell structures, integrating the down- and up-sampling operations into these cells, as shown in Fig. 4(a). In the contracting path, the down-sampling cells are traversed three times to capture features at different levels. Conversely, the expansive path employs symmetric up-sampling cells to restore spatial information from the contracting path's multi-level features. Notably, the skip connection is considered a learnable operation within the up-sampling cell. Lastly, a 1 × 1 convolutional operation precedes the final up-sampling cell, reducing the filter dimensionality to 2 for phase demodulation.

Fig. 4. (a) U-shaped backbone of NAS network. The green arrow “identity” represents passing the feature maps directly to the next cell, as all the down- and up-sampling operations are complete inside the cells. The orange arrow “connection” represents the learnable operations in up-sampling cells corresponding skip connection in UNet. (b) Cell architecture of down- and up-sampling cells, where the red arrows are searched from different primitive operations in these two types of cells.

Download Full Size | PDF

The cell architecture is shown in Fig. 4(b). As introduced in section 2.2, it includes two input nodes and one output node in the cell. The input node receives the output of two previous cells with 1 × 1 Conv to keep dimensions consistent. In the down-sampling cell, Cell_k- ₁ and Cell_k- ₂ represent the two previous cells. In the up-sampling cell, Cell_k- ₁ indicates the previous cell and Cell_k- ₂ means the corresponding cell from the horizontal connection. To construct the input of the first down-sampling cell, the input fringe pattern is passed to a regular 1 × 1 convolution and a 3 × 3 convolution with a stride of 2 and padding of 1. The operations adjacent to the input nodes (marked by the red line) can reduce the spatial resolution and double the dimension of the feature map, or increase the resolution and halve the dimension. The former is called Down operation, and the latter is called Up operation. The normal operations (marked by the green line) are used to extract features while keeping the spatial resolution unchanged. The concatenate operations (marked by the blue line) collect all the feature maps from intermediate nodes and pass them to the next cell.

In order to realize real-time and high-accuracy processing of the searched neural network, our approach focuses on selecting candidate primitive operations that prioritize two critical factors: large respective fields and fewer parameters. By emphasizing a large receptive field, the network can effectively capture and analyze a broader context, enabling robust feature extraction and a more comprehensive understanding of complex visual patterns. Simultaneously, reducing the number of parameters helps optimize computational efficiency, allowing for faster inference and real-time performance, which usually means a better capability to process high-resolution images and faster interface speed. The selected operations are listed in Table 1, in which the up and down operations are searchable in up- and down-sampling cells respectively. The ‘up’ and ‘down’ in operation type mean that it can be derived from the corresponding ‘normal’ operation by setting the stride value of the inner first convolution or transposed convolution as 2.

Table 1. Candidate primitive set of normal, up and down operations. The depth Conv indicates depth-wise separable convolution. The SE, ECA and SGE indicate squeeze-and-excitation block, efficient channel attention block and spatial group-wise enhance block respectively

View Table | View all tables in this article

In addition to the original convolution, the depth-wise separable convolution [51] is selected due to its computational efficiency. The purpose of introducing dilated convolution [52] is to increase the respective field without increasing the number of parameters since high-resolution input is usually required for high-accuracy FPP.

Due to the fact that not all regions in the captured deformed fringes contain valid object information, the attention mechanism [53] is used in candidate operations to selectively focus on informative features and improve the overall performance of the final model, as shown in Fig. 5. The squeeze-and-excitation (SE) block [54] learns a set of weights using squeeze and excitation operations and emphasizes the most useful features by assigning learned weights to the channel dimension. Besides, the efficient channel attention (ECA), which utilizes a 1 × 1 convolution to implement local channel attention with fewer parameters, is added to the candidate operation sets to explore possibilities of achieving a lower latency. The spatial group-wise enhance (SGE) block is selected for a similar reason. The main idea of the SGE is to group feature maps and consider each group as representing a semantic feature. By leveraging the similarity between local and global features, an attention mask is generated to guide the enhanced spatial distribution of semantic features. The attention factor is determined based on the similarity between global and local features within each group. As a result, SGE is highly lightweight.

Fig. 5. Schematic diagram of the attention blocks. (a) squeeze-and-excitation (SE) block. (b) Efficient channel attention (ECA) block. (c) Spatial group-wise enhance (SGE) block.

Download Full Size | PDF

In addition, some other commonly used operations, such as identity, max pooling, and average pooling, have also been included in the search space. All the convolution operations have a kernel size of 3 × 3. The kernel sizes of max pooling and average pooling are 2 × 2. A combination of batch normalization [55] and ReLU [56] are placed after convolution-type operations.

We estimate the performance of the searched cells using the following simple criterion:

(14)$$L = \frac{1}{2}[{{{({{M_p} - {M_g}} )}^2} + {{({{D_p} - {D_g}} )}^2}} ]$$

where L represents the least squares solution which minimizes the L²-norm for phase demodulation [35]. The subscripts p and g represent the prediction and ground truth, respectively. Once the cell architectures are determined, the final network architecture is constructed from the backbone network and retrained using the same loss function.

By exploring different architectural choices, NAS can identify more efficient operations and structures for processing fringe patterns. This results in the model requiring fewer parameters and calculations, leading to faster inference times of phase demodulation and lower resource requirements. Meanwhile, combined with the reference plane-assisted dual-frequency heterodyne phase unwrapping strategy, real-time and high-accuracy 3D reconstruction can be performed in complex measurement scenes.

3. Experiment

3.1 Data acquisition

The experimental setup of the FPP system contained a camera (Baumer, VCXU-31 M) with a resolution of 1280 × 800 pixels and a projector (TI, DLP4500) with a resolution of 912 × 1140 pixels. The objective lens of the camera had a focal length of 12 mm. The exposure time of the camera and the projector were both set to 10 ms. The angle between two optical axes of the camera and the projector was about 16.32°, which is obtained by the triangular stereo calibration process. The field of view of this system was about 300 × 150 mm², and the work distance was about 600 mm. The schematic diagram of the experimental FPP system is shown in Fig. 6(a).

Fig. 6. (a) Schematic diagram of experimental FPP system. (b) Collection of the training data and validation data. The 1st column shows the different measured objects including toys, masks, sculptures, etc. The 2nd, and 3rd columns show the ground truth numerator and denominator of the arctangent function of low frequency 56, which are obtained by 12-step phase-shifting fringe patterns. The 4th, and 5th show the corresponding data of high frequency 64.

Download Full Size | PDF

Twelve-step phase-shifting fringe patterns of frequencies 56 and 64 were projected to generate the ground truth of the numerator and denominator in the phase demodulation task and build the dataset for neural network searching, training, and validation. As shown in Fig. 6(b), we collected 100 sets of data in different scenes, including objects with diffuse surfaces such as dolls, masks, sculptures, etc, and divided them into training data and validation data with a split ratio of 9:1. The training set was used for architecture search and fixed neural network optimization. The validation set that has never been seen by the model was used to evaluate the performance of the neural network. To avoid unreliable learning in low-modulation regions, the threshold for average intensity was set to 10. All the fringe patterns were normalized by dividing by 255 (the max intensity of an 8-bit grayscale image) before starting searching and training. A cutoff transform, that randomly selects a square region with a random width of 64-256 in the image and sets the pixels within the region to zero, was applied during the search and training as a data augmentation technique to improve the generalization and robustness of the model, where the random numbers are generated by sampling from uniform distribution.

Similar to the absolute phase unwrapping method using geometric constraints [57], our approach that unwrapping phase through a precomputed unambiguous reference plane has to make a compromise on measurement depth, since the phase variation of the same point on the target object between maximum and minimum depth should be limited within 2π to avoid the incorrect fringe order determination. As the projected fringe pattern of our system occupies the entire FOV, according to trigonometry, the physical size of the period of heterodyne fringe can be calculated by dividing the width of FOV by equivalent frequency $300/({64 - 56} )= 37.5\textrm{mm}$, and the approximate maximum allowed depth range is $37.5/\textrm{tan}16.32^\circ{\approx} 128\textrm{mm}$, which is reasonable for many applications.

3.2 Network searching and training

The network was implemented by Pytorch 1.12.1 [58]. All search and train processes were completed in the hardware environment with Intel Xeon Gold 6226R 2.90 GHz CPU and four 16 GB NVIDIA Tesla P100 GPUs. The latency time was tested on a laptop with Intel i9-13800HX 2.20 GHz CPU and 8GB NVIDIA GeForce RTX 4070 Laptop GPU through TensorRT 8.6.0. The channel dimensions of all operations are 16. We used AdamW [59] optimizer with an initial learning rate of 10⁻³ to minimize the loss function of Eq. (14) for both neural network searching and training. The batch size was set to 4. We searched the architecture of down- and up-sampling cells in 250 epochs and train it in 500 epochs, where the corresponding process will be terminated when there is no improvement on the training set after 25 and 50 epochs respectively. A cosine annealing scheduler with a learning rate from 10⁻³ to 10⁻⁵ was applied to reduce the learning rate according to the epoch.

The total time for network search and training is approximately 9.3 hours. The loss curves are shown in Fig. 7(a). It can be found that the loss value during the search is higher than that during training. This is because that network structure search involves exploring a wide range of possible architectures, which can include ineffective ones that cannot generalize well to the training data. Although the loss function values during the search may be higher, the selected architecture is expected to perform better after a full training process. The searched cell architectures are shown in Fig. 7(b) and (c). It can be seen that dilated convolution and the ECA block are the most commonly used operations in the searched results, which indicates that a lightweight neural network for phase demodulation requires a large receptive field to process high-resolution fringe patterns, and a powerful attention mechanism to effectively locate feature maps of interest.

Fig. 7. (a) Loss curves of neural network searching, training and validation. (b)-(c) Searched down-sampling and up-sampling cell architectures of low frequency and high frequency.

Download Full Size | PDF

For comparison purposes, Feng's method [35] and a UNet with a depth of 5 [36] have also been trained and validated using the same dataset. Figure 8 shows the latency time and parameter amount of Feng’s method, UNet and the proposed network. The details of the latency of the three methods are given in Fig. 8(d). The representation of floating-point numbers on GPUs is generally similar to that on CPUs. Both CPUs and GPUs use IEEE 754 standard formats for representing floating-point numbers, which includes FP16 (half-precision floating-point format) and FP32 (single-precision floating-point format). The FP32 is commonly used for most computations during training and inference due to its higher precision. For certain scenarios where memory usage or computational efficiency is a concern, such as on resource-constrained devices or in large-scale distributed training, FP16 may be employed to speed up computations and reduce the memory footprint, since it uses only half the number of bits as FP32 to represent numbers. However, the reduced precision in FP16 can lead to some loss of accuracy and potential numerical instability.

Fig. 8. Performance comparison of Feng’s method, UNet and the proposed method. (a) Inference time of three methods in precision FP32. The MAE/RMS/STD of phase error (rad) are given below the methods’ names. (b) Corresponding results of three methods in precision FP16. (c) Model size of three methods. (d) Details of parameters, GPU usage (MiB) and latency (ms) of three methods.

Download Full Size | PDF

As seen in Fig. 8, the mean inference speed of our model can realize 35 fps in FP32 and 58 fps in FP16, while the UNet and Feng’s method are hard to meet the requirement of real-time 3D reconstruction if the exposure time of the camera is considered. Moreover, the proposed method exhibits a remarkably smaller model size, approximately 1/234 compared to UNet, and approximately 1/5 compared to Feng's method. Besides, our method occupies approximately half of the GPU memory compared to the other two methods. This compelling reduction in size renders our model exceptionally well-suited for deployment on edge devices.

The mean absolute error (MAE), root mean square error (RMS) and standard error (STD) of the phase error on the validation set are also shown in Fig. 8. With the reduction in model size and acceleration in inference speed, compared to Feng's method and UNet, the proposed method shows an increase of approximately 0.02 rad in MAE, 0.03 rad in RMS, and 0.03 rad in STD. It can be seen that there is minimal change in the accuracy of the model when utilizing FP16 data representation. The slight reduction in MAE and STD when using FP16 compared to FP32 can be attributed to the regularization effect introduced by the noise inherent in FP16 representation. This noise can sometimes aid in improving the generalization performance of the model, resulting in better accuracy when applied to unseen data. The experimental results in the following section were all tested in FP16.

3.3 Quantitative evaluation of standard components

To verify the effectiveness of the proposed method, we first conducted a quantitative evaluation experiment on two standard spheres. As shown in Fig. 9(a), the diameters of the two standard spheres are 50.7991 mm and 50.7970 mm respectively, and the center distance is 100.2537 mm. The ground truth values for these parameters are obtained by high-precision contact-based measurement using a three-coordinate measuring machine. Figure 9(b)-(c) illustrate the reconstructed results, which are obtained by the proposed NAS phase demodulation method and reference plane-assisted dual-frequency heterodyne phase unwrapping method. The corresponding error distributions and their histograms are shown in Fig. 9(d)-(e) and Fig. 9(f)-(i) respectively. The MAE errors of the two spheres are 0.0816 mm, 0.0722 mm in low frequency fringe and 0.0724 mm, 0.0869 mm in high frequency fringe, while their RMS errors are 0.0821 mm, 0.0731 mm and 0.0809 mm, 0.0919 mm. It's worth noting that as the frequency transitions from low to high, the error of the right sphere increases. This could be attributed to the right sphere being positioned farther from the camera, and experiencing a more severe reduction in modulation compared to the left sphere, leading to decreased measurement accuracy. The results demonstrate that the proposed method can achieve high-accuracy 3D reconstruction in low latency with an absolute error of less than 0.1 mm. Moreover, the accuracy of the obtained results is very similar in both low-frequency and high-frequency cases. This provides robust support for stable measurements in dynamic scenes.

Fig. 9. Measurement results of two standard spheres. Symbols d_A and d_B represent the diameter of the left and right spheres respectively, and c represents the center distance. (a) Standard spheres to be measured. (b)-(c) Reconstructed results of the proposed method using low frequency and high frequency respectively. (d)-(e) Corresponding phase error distribution of (b)-(c). (f)-(g) Histogram of the error of the left and right sphere in low frequency. (h)-(i) Corresponding histogram of high frequency.

Download Full Size | PDF

3.4 Quantitative evaluation of complex scenes

The experiments of four scenarios, which include sculptures, masks and toys, were further conducted to show the performance of our method on complex surfaces. These scenes have never been seen in the search and training process. The wrapped phases calculated from the 12-step phase shift method and the corresponding 3D shapes reconstructed were used as the ground truth. The phase error distributions and their histograms of the proposed method and traditional windowed Fourier transform (WFT) method [60] are shown in Fig. 10. And the MAE and RMS errors of both methods in low frequency f_l = 56 and high frequency f_h = 64 are given in Table 2.

Fig. 10. Phase error comparison results of the proposed method and traditional method in four scenarios. The phase error distributions and their histogram are shown in odd and even rows, respectively. The wrapped phase error of low frequency and high frequency are calculated in the 2nd and 4th column respectively, and the corresponding results of the traditional WFT method are shown in the 3rd and 5th column.

Download Full Size | PDF

Table 2. MAE and RMS of the phase error of the proposed method and the traditional method in four different scenarios

View Table | View all tables in this article

The proposed method can achieve high-accuracy phase demodulation with the maximum MAE, and RMS values of 0.0607 rad and 0.0802 rad, while the traditional WFT method has a low precision with RMS errors larger than 0.2 rad in all four scenes, especially when dealing with the complex part with depth variation. Due to the scarcity of pixels with low modulation depth in the training samples, it is also challenging for the neural network to learn the mapping relationship between the fringe patterns in these regions and the corresponding ground truth. As a result, the regions with low modulation depth in the output may struggle to obtain accurate phase values. Besides, there is a small difference with RMS errors less than 0.04 rad between the phase results obtained using high- and low-frequency fringe patterns. The phase error obtained by the proposed method from the high-frequency fringe pattern is generally smaller than that of the low-frequency fringe pattern, and the difference between them is approximately in the range of 0.001 to 0.009 rad, which can be accepted in actual measurement, while the WFT method shows similar trends. Since the two frequencies used in our method are relatively close, their accuracy difference is slight and can be acceptable for actual measurement.

The 3D reconstruction results from the predicted phases of the NAS network, as illustrated in Fig. 11 are obtained by the proposed dual frequency heterodyne phase unwrapping method and triangular stereo calibration model. When the phase errors become large, such as the phase of the object’s edges processed by the WFT method, the amplification of errors during the phase unwrapping process will cause phase variation exceeding 2π, resulting in incorrect fringe order determination. Therefore, multi-frequency phase unwrapping method are employed to unwrap the wrapped phase of WFT for better visual comparison.

Fig. 11. 3D reconstruction results of the proposed method and traditional WFT method.

Download Full Size | PDF

It can be observed that the proposed dual-frequency method can successfully realize high-frequency phase unwrapping with the assistance of the phase of the reference plane. In comparison with the traditional method, our method demonstrates a significant improvement in phase extraction and provides robust phase unwrapping results in complex scenes.

3.5 Real-time 3D measurement of a dynamic scene

To verify the performance of the proposed method for real-time applications, a measurement system was developed with the same hardware environment with test cases. In the measurement of dynamic scenes, two fringe patterns of frequencies 56 and 64 were projected cyclically and processed using the corresponding NAS network to obtain phase information. The phase unwrapping can then be performed using a dual-frequency heterodyne method assisted by the stored phase of the reference plane. This approach allows for updating a new 3D reconstruction result with each newly captured fringe pattern, enabling real-time processing. A sculpture with complex surface variation placed on a rotary table was measured in the dynamic scene. The point clouds were displayed through OpenGL, as shown in Fig. 12. The real-time measurement result is shown in Visualization 1, which demonstrates the potential of the proposed approach for real-time 3D measurement.

Fig. 12. Real-time measurement process and the result of the rotating sculpture with complex surface variation (Visualization 1).

Download Full Size | PDF

For the implementation details of the proposed method, we summarized using a flowchart as shown in Fig. 13. The calibration parameters and the phase of the reference plane were stored on the GPU in advance. The process begins with the camera capturing modulated fringes, which are initially transferred from the CPU to the GPU. Subsequently, the image undergoes normalization and is passed to the neural network for processing. Then, the well-trained NAS neural network is employed to obtain the numerator and denominator of the arctangent function. To reduce the number of pixels to be processed in later stages and speed up the reconstruction, the modulation B is calculated from the arctangent terms according to Eq. (5), and a threshold is applied to generate a mask. Any position with a mask value of 0 is excluded from further processing. After obtaining the wrapped phase, and combining it with the phase of another frequency calculated from the previous frame, the improved dual-frequency unwrapping method is utilized to obtain the unwrapped phase quickly and accurately. Finally, 3D point clouds are computed using calibration parameters.

Fig. 13. Flowchart of the proposed algorithm implementation for real-time 3D measurement and the time consumed by each step.

Download Full Size | PDF

All of the aforementioned steps are implemented using C++ and CUDA. We measured the average time for each step after running it 1000 times, as shown in the last row of Fig. 13. It can be observed that the time for the entire 3D reconstruction process is mainly consumed by the inference of the neural network, highlighting the necessity for network optimization. The total time for 3D reconstruction is approximately 14.2 ms (about 70.4 fps), which completely satisfies the real-time requirements. Taking into account other time-consuming operations such as image transfer between CPU and GPU and point cloud visualization, during actual operations, we set the projector's time interval to 17 ms (about 58.8 fps).

4. Conclusions and discussion

In this paper, we utilize the NAS technique to automatically search for lightweight neural networks from carefully selected candidate operation sets. This approach enables us to discover efficient network architectures for real-time high-precision phase demodulation. A pre-measured reference plane's phase is introduced to assist in the dual-frequency phase unwrapping process. By incorporating this additional information, we enhance the accuracy and reliability of the phase unwrapping step, leading to improved overall measurement quality. Our main contributions can be summarized in the following three points:

1. Efficient phase demodulation with the NAS-optimized lightweight neural network: integrated with high-resolution, high information utilization and low computational density for fringe analysis, NAS enables the automatic search and selection of the most suitable network architecture for phase demodulation. Moreover, the phase retrieval still maintains high accuracy within the effectively reduced computational complexity.
2. Absolute modified dual-frequency heterodyne phase unwrapping strategy: with the help of the pre-computed phase of the reference plane, the selection range of the frequencies for heterodyning can be increased greatly, which means higher accuracy and benefits the network's feature extraction from input fringes.
3. Real-time 3D reconstruction with rate equivalent to single-shot: by cyclically projecting dual-frequency fringe patterns, the phase information predicted by NAS network from adjacent frames mutually assists in phase unwrapping, allowing for one 3D reconstruction result for each newly captured image, which enables high accuracy with an error of less than 0.1 mm, and achieves real-time measurement based on deep learning for the first time. The experiment results demonstrate our method can achieve low latency time with a reconstruction speed of 58 fps.

Compared with prior works that utilize deep learning to extract the phase information (e.g. the method reported in Ref. [41] achieves a reconstruction speed of 15 fps with a resolution of 640 × 480 pixels), a significant step has been taken forward in this paper by addressing the challenges of high resolution and real-time processing, culminating in the proposal of a feasible and lightweight solution. The presented method extends beyond the initial step of obtaining arctangent terms to the broader context of achieving real-time high-precision 3D reconstruction in dynamic scenes. The real-time measurement architecture proposed in the paper can be potentially applied to other FPP methods, accelerating the development of real-time measurement methods based on deep learning in various applications. Compared with the classical three-frequency heterodyne method, the proposed modified dual-frequency phase unwrapping method does have certain limitations concerning measurement depth and fringe frequency, as indicated in Section 3.1. Nevertheless, the proposed requires a reduced number of fringe patterns, making it more suitable for the measurement of dynamic scenes.

However, several aspects need to be further improved in the future investigation. Firstly, compared with neural networks with larger parameter amounts, although the proposed lightweight NAS networks can achieve faster inference speed, a compromise in accuracy has to be made to reduce the model complexity. There are also many lightweight networks available, such as SCNN [61] and BiSeNet [62], but they are usually not specifically designed for FPP measurements in dynamic scenarios, which may result in unsatisfied performance in tasks like fringe analysis. Further research is needed to ensure that lightweight models can achieve the accuracy of large models while enabling real-time inference. Meanwhile, it is also crucial to design the network architecture that achieves a well-balanced trade-off between accuracy and speed, in accordance with the actual requirements and the physical model of fringe analysis.

Secondly, it should be noted that certain factors in dataset construction, such as fringe frequency, intensity noise and object shape, can have a measurable impact on the precision of phase demodulation. When the fringe frequency is excessively high (resulting in reduced modulation) or when intensity noise is significant, phase errors become large, leading to inaccuracies in the ground truth of the training dataset. Conversely, when fringe frequency is too low, from a spectral perspective, the information of the object and the background is heavily mixed, making it challenging for neural networks to extract useful features and generate accurate outputs. When there are significant mismatches between the training dataset and the testing scenarios, the performance of neural network may not be optimal due to the inherent data dependency of supervised learning. Hence, the quantitative selection of appropriate parameters during dataset construction necessitates more in-depth discussion and research to mitigate the adverse effects mentioned above.

Thirdly, while fixed thresholding was utilized for fast mask generation, it may introduce inaccuracies and result in the presence of noisy points in the final results. The development of robust and adaptive thresholding methods is needed to enhance the reliability of the measurements in real-time performance.

Fourthly, while this paper has indeed achieved successful training and testing of the NAS network with datasets of different frequencies, it is necessary to acknowledge that supervised learning inherently carries a degree of data dependency, which can be further addressed by constructing larger simulated datasets [38,63,64] or exploring unsupervised learning approaches [37].

Lastly, there are several model compression algorithms used within a well-trained network to further decrease the parameter amount or accelerate the model, such as pruning, quantization, knowledge distillation and so on, which are much friendlier for the on-chip systems, or personal computers with limited hardware environment.

Funding

National Natural Science Foundation of China (62075143, 62105227, 62205226); National Postdoctoral Program for Innovative Talents (BX2021199); Sichuan Science and Technology Program (2022YFS0113); Key Science and Technology Research and Development Program of Jiangxi Province (20224AAC01011); Young Elite Scientists Sponsorship Program by CAST (2022QNRC001).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [65].

References

1. S. S. Gorthi and P. Rastogi, “Fringe projection techniques: Whither we are?” Opt. Laser Eng. 48(2), 133–140 (2010). [CrossRef]

2. J. Salvi, S. Fernandez, T. Pribanic, et al., “A state of the art in structured light patterns for surface profilometry,” Pattern Recognit. 43(8), 2666–2680 (2010). [CrossRef]

3. S. Van der Jeught and J. J. J. Dirckx, “Real-time structured light profilometry: a review,” Opt. Laser Eng. 87, 18–31 (2016). [CrossRef]

4. Z. Wu, W. Guo, Y. Li, et al., “High-speed and high-efficiency three-dimensional shape measurement based on Gray-coded light,” Photon. Res. 8(6), 819–829 (2020). [CrossRef]

5. J. Xu and S. Zhang, “Status, challenges, and future perspectives of fringe projection profilometry,” Opt. Laser Eng. 135, 106193 (2020). [CrossRef]

6. K. Harding, “Engineering precision,” Nature Photon. 2(11), 667–669 (2008). [CrossRef]

7. J. Geng, “Structured-light 3D surface imaging: a tutorial,” Adv. Opt. Photonics 3(2), 128–160 (2011). [CrossRef]

8. X. Su and W. Chen, “Fourier transform profilometry: a review,” Opt. Laser Eng. 35(5), 263–284 (2001). [CrossRef]

9. W.-H. Su and H. Liu, “Calibration-based two-frequency projected fringe profilometry: a robust, accurate, and single-shot measurement for objects with large depth discontinuities,” Opt. Express 14(20), 9178–9187 (2006). [CrossRef]

10. V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. 23(18), 3105–3108 (1984). [CrossRef]

11. C. Zuo, S. Feng, L. Huang, et al., “Phase shifting algorithms for fringe projection profilometry: A review,” Opt. Laser Eng. 109, 23–59 (2018). [CrossRef]

12. L. Lu, V. Suresh, Y. Zheng, et al., “Motion induced error reduction methods for phase shifting profilometry: A review,” Opt. Laser Eng. 141, 106573 (2021). [CrossRef]

13. Z. H. Zhang, “Review of single-shot 3D shape measurement by phase calculation-based fringe projection techniques,” Opt. Laser Eng. 50(8), 1097–1106 (2012). [CrossRef]

14. X. Su and W. Chen, “Reliability-guided phase unwrapping algorithm: a review,” Opt. Laser Eng. 42(3), 245–261 (2004). [CrossRef]

15. D. C. Ghiglia and M. D. Pritt, Two-Dimensional Phase Unwrapping: Theory, Algorithms, and Software (Wiely-Interscience, 1998).

16. M. Zhao, L. Huang, Q. Zhang, et al., “Quality-guided phase unwrapping technique: comparison of quality maps and guiding strategies,” Appl. Opt. 50(33), 6214–6224 (2011). [CrossRef]

17. H. O. Saldner and J. M. Huntley, “Temporal phase unwrapping: application to surface profiling of discontinuous objects,” Appl. Opt. 36(13), 2770–2775 (1997). [CrossRef]

18. C. Zuo, L. Huang, M. Zhang, et al., “Temporal phase unwrapping algorithms for fringe projection profilometry: A comparative review,” Opt. Laser Eng. 85, 84–103 (2016). [CrossRef]

19. Z. Wu, W. Guo, and Q. Zhang, “Two-frequency phase-shifting method vs. Gray-coded-based method in dynamic fringe projection profilometry: A comparative review,” Opt. Laser Eng. 153, 106995 (2022). [CrossRef]

20. K. Itoh, “Analysis of the phase unwrapping algorithm,” Appl. Opt. 21(14), 2470 (1982). [CrossRef]

21. C. Guan, L. G. Hassebrook, and D. L. Lau, “Composite structured light pattern for three-dimensional video,” Opt. Express 11(5), 406–417 (2003). [CrossRef]

22. C. Zuo, Q. Chen, G. Gu, et al., “High-speed three-dimensional shape measurement for dynamic scenes using bi-frequency tripolar pulse-width-modulation fringe projection,” Opt. Laser Eng. 51(8), 953–960 (2013). [CrossRef]

23. C. Zuo, Q. Chen, G. Gu, et al., “High-speed three-dimensional profilometry for multiple objects with complex shapes,” Opt. Express 20(17), 19493–19510 (2012). [CrossRef]

24. G. Sansoni, M. Trebeschi, and F. Docchio, “Fast 3D profilometer based upon the projection of a single fringe pattern and absolute calibration,” Meas. Sci. Technol. 17(7), 1757–1766 (2006). [CrossRef]

25. H.-M. Yue, X.-Y. Su, and Y.-Z. Liu, “Fourier transform profilometry based on composite structured light pattern,” Opt. Laser Eng. 39(6), 1170–1175 (2007). [CrossRef]

26. M. Takeda, Q. Gu, M. Kinoshita, et al., “Frequency-multiplex Fourier-transform profilometry: a single-shot three-dimensional shape measurement of objects with large height discontinuities and/or surface isolations,” Appl. Opt. 36(22), 5347–5354 (1997). [CrossRef]

27. H. Yu, D. Zheng, J. Fu, et al., “Deep learning-based fringe modulation-enhancing method for accurate fringe projection profilometry,” Opt. Express 28(15), 21692–21703 (2020). [CrossRef]

28. K. Wang, Q. Kemao, J. Di, et al., “Deep learning spatial phase unwrapping: a comparative review,” Adv. Photon. Nexus 1(01), 014001 (2022). [CrossRef]

29. J. Zhang and Q. Li, “EESANet: edge-enhanced self-attention network for two-dimensional phase unwrapping,” Opt. Express 30(7), 10470–10490 (2022). [CrossRef]

30. K. Wang, Y. Li, Q. Kemao, et al., “One-step robust deep learning phase unwrapping,” Opt. Express 27(10), 15100–15115 (2019). [CrossRef]

31. G. E. Spoorthi, Rama Krishna Sai Subrahmanyam Gorthi, S. S. Gorthi, et al., “PhaseNet 2.0: Phase Unwrapping of Noisy Data Based on Deep Learning Approach,” IEEE Trans. on Image Process. 29, 4862–4872 (2020). [CrossRef]

32. K. Ueda, K. Ikeda, O. Koyama, et al., “Absolute phase retrieval of shiny objects using fringe projection and deep learning with computer-graphics-based images,” Appl. Opt. 61(10), 2750–2756 (2022). [CrossRef]

33. J. Zhang, B. Luo, F. Li, et al., “Single-exposure optical measurement of highly reflective surfaces via deep sinusoidal prior for complex equipment production,” IEEE Trans. Ind. Inf. 19(2), 2039–2048 (2023). [CrossRef]

34. L. Zhang, Q. Chen, C. Zuo, et al., “High-speed high dynamic range 3D shape measurement based on deep learning,” Opt. Laser Eng. 134, 106245 (2020). [CrossRef]

35. S. Feng, Q. Chen, G. Gu, et al., “Fringe pattern analysis using deep learning,” Adv. Photon. 1(02), 1 (2019). [CrossRef]

36. S. Feng, C. Zuo, L. Zhang, et al., “Generalized framework for non-sinusoidal fringe analysis using deep learning,” Photonics Res. 9(6), 1084–1098 (2021). [CrossRef]

37. H. Yu, B. Han, L. Bai, et al., “Untrained deep learning-based fringe projection profilometry,” APL Photonics 7(1), 016102 (2022). [CrossRef]

38. Y. Li, W. Guo, J. Shen, et al., “Motion-induced phase error compensation using three-stream neural networks,” Appl. Sci. 12(16), 8114 (2022). [CrossRef]

39. S. Van der Jeught and J. J. J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express 27(12), 17091–17101 (2019). [CrossRef]

40. H. Nguyen and Z. Wang, “Accurate 3D shape reconstruction from single structured-light image via fringe-to-fringe network,” Photonics 8(11), 459 (2021). [CrossRef]

41. Y. Li, J. Qian, S. Feng, et al., “Deep-learning-enabled dual-frequency composite fringe projection profilometry for single-shot absolute 3D shape measurement,” Opto-Electron. Adv. 5(5), 210021 (2022). [CrossRef]

42. Y. Li, J. Qian, S. Feng, et al., “Composite fringe projection deep learning profilometry for single-shot absolute 3D shape measurement,” Opt. Express 30(3), 3424–3442 (2022). [CrossRef]

43. P. Ren, Y. Xiao, X. Chang, et al., “A comprehensive survey of neural architecture search: challenges and solutions,” ACM Comput. Surv. 54(4), 1–34 (2022). [CrossRef]

44. D. Baymurzina, E. Golikov, and M. Burtsev, “A review of neural architecture search,” Neurocomputing. 474, 82–93 (2022). [CrossRef]

45. Y.-Y. Cheng and J. C. Wyant, “Two-wavelength phase shifting interferometry,” Appl. Opt. 23(24), 4539–4543 (1984). [CrossRef]

46. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 770–778.

47. O. Ronneberger, P. Fischer, T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, et al., eds., Lecture Notes in Computer Science (Springer International Publishing, 2015), 9351, pp. 234–241.

48. H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” arXiv, arXiv: 1806.09055 [cs.LG] (2018) .

49. H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct neural architecture search on target task and hardware,” arXiv, arXiv:1812.00332 [cs.LG] (2018) .

50. Y. Weng, T. Zhou, Y. Li, et al., “NAS-Unet: Neural architecture search for medical image segmentation,” IEEE Access 7, 44247–44257 (2019). [CrossRef]

51. F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1251–1258.

52. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv, arXiv:1511.07122 (2015). [CrossRef]

53. M.-H. Guo, T.-X. Xu, J.-J. Liu, et al., “Attention mechanisms in computer vision: A survey,” Comp. Visual Media 8(3), 331–368 (2022). [CrossRef]

54. J. Hu, L. Shen, S. Albanie, et al., “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2019), pp. 7132–7141.

55. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv, arXiv:1502.03167 [cs] (2015). [CrossRef]

56. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 807–814.

57. Y. An, J.-S. Hyun, and S. Zhang, “Pixel-wise absolute phase unwrapping using geometric constraints of structured light system,” Opt. Express 24(16), 18445–18459 (2016). [CrossRef]

58. A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems. 32 (2019).

59. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv, arXiv:1711.05101 (2017). [CrossRef]

60. Q. Kemao, “Windowed Fourier transform for fringe pattern analysis,” Appl. Opt. 43(13), 2695–2702 (2004). [CrossRef]

61. R. P. K. Poudel, S. Liwicki, and R. Cipolla, “Fast-SCNN: Fast semantic segmentation network,” arXiv, arXiv:1902.04502. (2019). [CrossRef]

62. C. Yu, J. Wang, C. Peng, et al., “BiSeNet: Bilateral segmentation network for real-time semantic segmentation,” Proceedings of the European conference on computer vision (ECCV) (2018), pp. 325–341.

63. Y. Zheng, S. Wang, Q. Li, et al., “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express 28(24), 36568–36583 (2020). [CrossRef]

64. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024–8040 (2021). [CrossRef]

65. Y. Li, “12-step phase-shifting fringes of freq 56 and 64,” https://www.kaggle.com/datasets/lyyscu/12-step-phase-shifting-fringes-of-freq-56-and-64.

Operation type	Normal operation	Up operation	Down operation
Convolution	conv	up conv	down conv
	dilated conv	up delated conv	down delated conv
	depth conv	up depth conv	down depth conv
Attention	SE	up SE	down SE
	ECA	up ECA	down ECA
	SGE	up SGE	down SGE
Other	identity		max pooling
Other			average pooling

Low freq	Scenario 1		Scenario 2		Scenario 3		Scenario 4
Low freq	WFT	Proposed	WFT	Proposed	WFT	Proposed	WFT	Proposed
MAE (rad)	0.3433	0.0598	0.2053	0.0346	0.1512	0.0401	0.1631	0.0341
RMS (rad)	0.4607	0.0795	0.2796	0.0449	0.2052	0.0513	0.2243	0.0443
High freq	Scenario 1		Scenario 2		Scenario 3		Scenario 4
High freq	WFT	Proposed	WFT	Proposed	WFT	Proposed	WFT	Proposed
MAE (rad)	0.3130	0.0607	0.1988	0.0330	0.1504	0.0313	0.1610	0.0328
RMS (rad)	0.4215	0.0802	0.2727	0.0429	0.2029	0.0406	0.2209	0.0429

Operation type	Normal operation	Up operation	Down operation
Convolution	conv	up conv	down conv
	dilated conv	up delated conv	down delated conv
	depth conv	up depth conv	down depth conv
Attention	SE	up SE	down SE
	ECA	up ECA	down ECA
	SGE	up SGE	down SGE
Other	identity		max pooling
Other			average pooling

Low freq	Scenario 1		Scenario 2		Scenario 3		Scenario 4
Low freq	WFT	Proposed	WFT	Proposed	WFT	Proposed	WFT	Proposed
MAE (rad)	0.3433	0.0598	0.2053	0.0346	0.1512	0.0401	0.1631	0.0341
RMS (rad)	0.4607	0.0795	0.2796	0.0449	0.2052	0.0513	0.2243	0.0443
High freq	Scenario 1		Scenario 2		Scenario 3		Scenario 4
High freq	WFT	Proposed	WFT	Proposed	WFT	Proposed	WFT	Proposed
MAE (rad)	0.3130	0.0607	0.1988	0.0330	0.1504	0.0313	0.1610	0.0328
RMS (rad)	0.4215	0.0802	0.2727	0.0429	0.2029	0.0406	0.2209	0.0429

Real-time 3D shape measurement of dynamic scenes using fringe projection profilometry: lightweight NAS-optimized dual frequency deep learning approach

Abstract

1. Introduction

2. Principle

2.1 Basic principle of fringe projection profilometry (FPP)

2.2 Basic principle of neural architecture search (NAS)

2.3 NAS-optimized dual frequency real-time 3D reconstruction method

3. Experiment

3.1 Data acquisition

3.2 Network searching and training

3.3 Quantitative evaluation of standard components

3.4 Quantitative evaluation of complex scenes

3.5 Real-time 3D measurement of a dynamic scene

4. Conclusions and discussion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (13)

Tables (2)

Equations (14)

Optics Express