FPP-SLAM: indoor simultaneous localization and mapping based on fringe projection profilometry

Yang Zhao; Yang Zhao; Yang Zhao; Haotian Yu; Haotian Yu; Haotian Yu; Kai Zhang; Kai Zhang; Yucheng Zheng; Yucheng Zheng; Yi Zhang; Yi Zhang; Dongliang Zheng; Dongliang Zheng; Jing Han; Jing Han

doi:10.1364/OE.483667

1. Introduction

Simultaneous localization and mapping (SLAM) is of great importance in many fields that range from autonomous driving to augmented reality [1–4]. It can construct a map of an unknown environment and simultaneously locate the sensor in the constructed map, which can be generally divided into two categories outdoor [5] and indoor ones [6]. In practice, both of them require high-definition scene perceiving for path planning, autonomous navigation, and obstacle evasion, etc. High-definition perceiving means accurate and dense 2D/3D information acquisition of the applied scene. Outdoor SLAM has been widely used for autonomous driving [2]. Although it is difficult to achieve high-definition perceiving caused by large and complex outdoor scenes, GPS signals can be used to pre-load high-definition maps constructed in advance, e.g., the Vehicle Mapping Service provided by Google [7]. However, indoor SLAM usually works under narrow scenes without GNSS signals, and high-definition perceiving is necessary for its proper functioning [8].

A typical SLAM system includes the front-end 2D or 3D sensors and back-end mapping and locating algorithms [9]. The performance of SLAM heavily relies on the perceiving quality of the employed front-end sensors. LiDAR [10], stereo vision [11] and time of flight (ToF) [12] have been introduced for SLAM. LiDAR is usually used to locate the user/robot for outdoor SLAM, but it cannot construct high-definition maps caused by sparse data structure and computational complexities [13]. Stereo vision is difficult to perceive indoor scenes with less texture such as white walls and cabinets [14]. ToF has been used for indoor applications, such as the Kinect of Microsoft [15], but it suffers from measurement distortion and insufficient resolution. Fringe projection profilometry (FPP) plays an important role in fast and high-fidelity 3D perceiving due to its attributes of high resolution, accuracy and speed [16–18], which also shows great potential as the front-end sensor for indoor SLAM. The availability of FPP shows great potential to revolutionize the fields of indoor SLAM. But the related works have not prevailed because of the lack of suitable back-end algorithms to fully leverage FPP.

Traditional back-end algorithms can be divided into 2D [19,20], 3D [12,21,22], and 2D-3D [5,23] categories. The 2D one locates the sensor by tracking 2D features extracted from the captured image and simultaneously constructs a map by triangulating these 2D features into 3D map points [19]. The 2D map is sparse and inaccurate, and only rudimentary scene information can be represented [20]. The 3D one maps the scene by using 3D point cloud fusion (e.g., the Iterative Closest Point (ICP) algorithm [21] or its variants [24]), and simultaneously locates the sensor based on rotation and translation transformation matrices [12,22]. The 3D back-end algorithm is more accurate but more time-consuming caused by its computational complexity. The 2D-3D one uses both the 2D and 3D data, where the 2D texture information can be used to accelerate the 3D data fusion [23]. The combination of stereo vision and LIDAR systems are used to perceive the 2D and 3D data, respectively [5]. The 2D-3D algorithm performs more robustly but increases system cost and computational complexity. For our FPP system, it can simultaneously perceive both high-definition 2D and 3D data without increasing other perceiving systems. But traditional back-end algorithms cannot suit the requirements of FPP-SLAM, such as dense point cloud processing, real-time mapping and sensor localization, etc. Specifically, first, the perceived data of FPP is several times greater than that of Kinect or LiDAR [25,26], which will obviously reduce the SLAM efficiency. Second, FPP can achieve an accuracy of less than 0.1 mm/m [17], but traditional algorithms only achieve a cm- [27] or dm- [28] level SLAM accuracy. Thus, more efficient and accurate back-end algorithms are required for FPP-SLAM.

In this paper, we propose a new FPP-SLAM for indoor mapping and localization applications, which includes the front-end data perceiving system and the back-end mapping and localization algorithms. FPP system flexibly perceives the desired high-definition 2D and 3D data for the subsequent algorithm. To improve the SLAM efficiency, a sparse-to-dense mapping strategy is designed to reduce computational complexity. To improve the SLAM accuracy, a local-to-global localization strategy is designed to achieve accurate localization and simultaneously improve the mapping. The provided experiments validate that the proposed FPP-SLAM achieves mm-level SLAM in a room-sized (30 m${^2}$) scene within ten minutes.

The rest of this paper is organized as follows. Section 2 presents the principle of the proposed FPP-SLAM. Section 3 provides experiments. Section 4 concludes this paper.

2. Principle of the proposed FPP-SLAM

The framework of the proposed FPP-SLAM is provided in Fig. 1. FPP system is introduced for high-definition indoor scene perceiving. The corresponding back-end mapping and localization algorithms are specially designed to make full use of FPP and achieve efficient and accurate indoor SLAM. We emphasize that the mapping and localization collaborate with each other for global scene mapping and consistent sensor localization.

Fig. 1. The overall framework of the proposed FPP-SLAM.

Download Full Size | PDF

In the mapping, a sparse map is first constructed from sparse 2D and 3D features and then refined by dense geometric consistency given by complete 3D data. The mapping is basically implemented on sparse features rather than dense full-size data, thus obviously reducing computational complexity. In the localization, the state of the FPP system is initially estimated from the inversion relationship between the camera and world coordinate systems and then optimized by using local common features and a global localization array. The optimized adjacent frame localization transformation matrices also improve the mapping accuracy.

The details of our FPP-SLAM are provided as follows.

2.1 High-definition 2D and 3D data perceiving based on FPP

For SLAM, FPP is still affected by ambient light, dynamic, complex scenes, etc. With the development of technologies and devices, these issues can be basically addressed. First, we limit the application scenes to indoor to avoid ambient light interference. Second, dynamic issue can be allieviated by using the defocuing FPP [16,29] or compensated by motion errors analysis [30,31]. Third, there are many compensation methods [32] can be introduced for sensing complex scenes.

For a relatively large indoor scene, the FPP system requires perceiving the desired data multiple times. For each time of data perceiving, the FPP system can be flexibly placed under different views. For each perceiving view, the FPP system perceives both the 2D image and 3D point cloud of the measured scene. Taking a scene including a cube as an example, Fig. 2 illustrates the data-perceiving process of FPP-SLAM. The perceived data can be considered as a live frame stream $\mathfrak {f} = \{ {f_1};{f_2};\cdots ;{f_N}\}$, where $f_N$ denotes the perceived $n-th$ frame, and $N$ denotes the total number of perceiving times. For each frame, we have $f_N = \{ I_n, P_n\}$ , where $I$ and $P$ denote 2D image and 3D point cloud of each frame, respectively.

Fig. 2. Schematic diagram of data perceiving process of FPP-SLAM.

Download Full Size | PDF

As we know, the FPP system mainly includes a projector and a camera. The former projects patterns designed by a computer, and the latter captures the patterns reflected from the object surface [33,34]. 2D texture images and 3D point clouds can be simply perceived by adjusting the projection sequence provided in Fig. 3.

Fig. 3. Schematic diagram of projection sequence.

Download Full Size | PDF

The 3D point clouds are then perceived by projecting phase-shifting combined with Gray-code patterns, which are shown in the red and blue box of Fig. 3, respectively. The 2D texture image is extracted by calculating the modulation of the phase-shifting patterns. Phase-shifting sinusoidal patterns are used to calculate the initial phase [35], and Gray-code patterns are used to retrieve the absolute phase for the 3D reconstruction [36–38]. In the future, to achieve higher speed indoor 3D perception, we will introduce jump errors compensation strategy in our FPP-SLAM to eliminate the jump errors of the Gray coded method [39,40].

Figure 4 illustrates the schematic diagram of FPP. When the desired phase is retrieved from the captured phase-shifting and Gray-code patterns [18,41], the correspondence between the projector and the camera can be established. When combined with system parameters calibrated before, the desired 3D data can be calculated by using the typical triangulation method [42,43]. From Fig. 4, we randomly select a surface point B for the analysis. The light emitted from projector pixel A is reflected from the surface point B and reaches the camera pixel C. The encoded phase of A does not change during the light transmission, and the unique correspondence of pixels A and C can be easily established from the retrieved phase.

Fig. 4. Schematic diagram of FPP.

Download Full Size | PDF

To be close to the real scene, the texture image is also rendered on the 3D point cloud [44] and shown at the top of Fig. 4.

However, due to the multiple-shot characteristics of FPP, the resulting motion errors are indeed inevitable during dynamic measurements. However, benefiting from the development of devices and algorithms, DFPP can achieve the perceiving speed of up to hundreds of frames per second [16], and the motion-induced errors can be roughly alleviated. In addition, our FPP-SLAM is applied to indoor scenes, so the movement speed of the measurement device can be controlled to a low level (<0.05m/s). Thus, the motion-induced errors can be roughly solved. However, to achieve higher quality 3D perceiving, the dynamic problems still need to be solved. Fortunately, many state-of-art methods have been proposed, such as Zuo et al. proposed micro Fourier transform profilometry [45], Wu et al. proposed cyclic complementary Gray-code light [46], deep-learning-based methods [47,48], and so on. To make our FPP-SLAM more practical, we will consider combining these dynamic methods in the future. In addition, since the motion of the measurement device is controllable, it can predict and compensate for this motion by adjusting the projection patterns of the FPP, which is more favorable to solve the dynamic problem of FPP-SLAM.

2.2 Sparse-to-dense mapping

The mapping process is to calculate matrices that can transform the perceived each frame of the point cloud into the same world coordinate system [49]. The map can be constructed by

(1)$$M = {\cal M}\left( {\mathbb{P},\mathfrak{m}} \right),$$

where $\cal M$ describes the incremental mapping process, $\mathbb {P} = \{ {P_1};{P_2};\cdots ;{P_N}\}$ denotes the 3D data included in the data stream $\mathfrak {f}$, $\mathfrak {m} = \{ {m_1};{m_2};\cdots ;{m_N}\}$ denotes transformation matrices required to be solved, and ${m_n} = [R_n^P,{\rm {T}}_n^P]$ denotes a 3 $\times$ 4 transformation matrix that can transform $P_n$ into the same coordinate. In $m_n$, $R_n^P$ and $T_n^P$ are respective the rotation and translation matrices, where the superscript ’$P$’ refers to the 3-D features. Accordingly, the superscript “$I$” used as follows refers to the 2-D features. For example, $R_n^I$ and $T_n^I$ are respective 2 $\times$ 2 rotation and 2 $\times$ 1 translation matrices that describe the 2D affine transformation of adjacent images [50].

To solve the desired transformation matrices, sparse 2D and 3D features are first extracted from the perceived frame stream by using feature corner detection. When the feature correspondence is established by matching similar features, the transformation matrix is first calculated by using iterative matching minimization and then refined by using dense geometric structures given by the complete 3D data.

The above-mentioned feature extraction, correspondence establishment, and transformation matrix calculation are detailed as follows.

2.2.1 Feature extraction

The 2D features are first extracted by Harris corner detection [51], and then the corresponding 3D features are extracted based on the 2D-to-3D coordinate transformation [42]. For clarity, the features of the object’s surface are described by $\mathbb {F} = \{ F_1^{};\cdots ;F_K^{}\}$. These extracted 2D and 3D features are respective $\mathbb {F}_n^I = \{ F_{n\_1}^I;F_{n\_2}^I;\cdots ;F_{n\_K}^I\}$ and $\mathbb {F}_n^P = \{ F_{n\_1}^P;F_{n\_{\rm {2}}}^P;\cdots ;F_{n\_K}^P\}$ , where $K$ denotes the total number of features, and the subscript $n{\_}k$ refers to the $k-th$ feature of $f_n$.

Features are sparse points that describe the surface properties of the scene. For example, 2D features contain more drastic grayscale changes in 2D texture images. Thus these 2D features can be first extracted by calculating the curvature and gradient of the pixels [52]. From which, the desired 3D features can be extracted by [53]

(2)$$\mathbb {F}_n^P{\rm{ = }}s \cdot {\left[ {A \cdot E} \right]^{ - 1}} \cdot \mathbb {F}_n^I,$$

where $A$, $E$ and $s$ are the camera intrinsic matrix, extrinsic matrix, and scaling factor, respectively.

Figure 5 illustrates a cube including seven corner features, and we have $\mathbb {F} = \{ F_1^{};\cdots ;F_7^{}\}$, which are marked by different color dots. Taking the first frame ${f_1} = \{ {I_1},{P_1}\}$ as an example, we can extract four 2D features $\mathbb {F}_1^I = \{ F_{1\_1}^I;F_{1\_2}^I;F_{1\_3}^I;F_{1\_4}^I\}$ from the texture image $I_1$, and then extract 3D features $\mathbb {F}_1^P = \{ F_{1\_1}^P;F_{1\_{\rm {2}}}^P;F_{1\_3}^P;F_{1\_4}^P\}$ from the point cloud $P_1$.

Fig. 5. Schematic diagram of 2D$and$3D features extraction.

Download Full Size | PDF

2.2.2 Sparse mapping based on feature matching

The 3D features describe the complete 3D point cloud, and their corresponding 3D transformation matrix can be used for coarse mapping. To reduce calculation complexity, the 2D features are first matched to accelerate 3D feature matching.

For clarity, we select two frames ${f_1} = \{ {I_1},{P_1}\}$ and ${f_2} = \{ {I_2},{P_2}\}$ from Fig. 5 and illustrate the coarse mapping process in Fig. 6. To match 2D features, the feature correspondences can be established based on color and location similarity search [54]. As shown at the top of Fig. 6, the similar 2D features $\mathbb {F}_1^I = \{ F_{1\_1}^I;F_{1\_2}^I;F_{1\_3}^I;F_{1\_4}^I\}$ and $\mathbb {F}_2^I = \{ F_{2\_1}^I;F_{2\_2}^I;F_{2\_3}^I;F_{2\_4}^I\}$ are paired matched and are connected by color dashed lines. The 2D correspondence between $\mathbb {F}_1^I/I_1^{}$ and $\mathbb {F}_2^I/I_2^{}$ can be established by

(3)$$\mathbb {F}_1^I = [R_2^I,T_2^I] \cdot \mathbb {F}_2^I.$$

Fig. 6. Schematic diagram of sparse mapping.

Download Full Size | PDF

Then, the 3D features will be matched to solve the desired 3D transformation matrix ${m_2} = [R_2^P,{\rm {T}}_2^P]$. However, due to the poor initial position, 3D matching is easy to fall into local optimum [55] and results in low efficiency and accurate mapping. Thus, the above calculated 2D transformation matrix $\left [ {R_2^I , T_2^I} \right ]$ is converted to an initial 3D transformation matrix $\left [ {R_2^P , T_2^P} \right ]$ based on Eq. (2) to improve 3D features matching.

Next, the similar 3D features are matched on the basis of structure-consistency search of topological structure and spatial location [56], where the matrix $\left [ {R_2^P , T_2^P} \right ]$ is used both as a priori and recomputed to obtain a more accurate 3D transformation matrix. From which, the Singular Value Decomposition (SVD) algorithm [57] is adopted to iteratively recomputed $\left [ {R_2^P , T_2^P} \right ]$, which can be simply formulated by following Appendix A.

(4)$$J(R_2^P,T_2^P)\mathop = _{} \mathop {\arg \min }_{}^{} \sum_{}^{} {{{\left\| {\mathbb {F}_1^P - (R_2^P\mathbb {F}_2^P + T_2^P)} \right\|}^2}},$$

where $J$ denotes the iteratively minimizing solver for mapping matrix, $\mathbb {F}_1^P = \{ F_{1\_1}^P;F_{1\_2}^P;F_{1\_3}^P;F_{1\_4}^P\}$ and $\mathbb {F}_2^P = \{ F_{2\_1}^P;F_{2\_2}^P;F_{2\_3}^P;F_{2\_4}^P\}$ are the paired 3D features extracted from $P_1$ and $P_2$, respectively. As shown in the second row of Fig. 6, similar 3D features $\mathbb {F}_1^P$ and $\mathbb {F}_2^P$ are matched, and connected by color dashed lines. The 3D correspondence between $\mathbb {F}_1^P/P_1^{}$ and $\mathbb {F}_2^P/{P_2}$ can be established by

(5)$$\mathbb {F}_1^P = [R_2^P,T_2^P] \cdot \mathbb {F}_2^P = {m_2} \cdot F_2^P.$$

Finally, by applying the recomputed ${m_2} = \left [ {R_2^P , T_2^P} \right ]$ to $P_2$, the adjacent complete point clouds can be initially fused. As shown at the bottom of Fig. 6, the coarse map $M = {\cal M}\left ( {{P_1},{m_1};{P_2},{m_2}} \right )$ is initially constructed from $P_1$ and $P_2$.

2.2.3 Refining based on dense geometric consistency

In practice, due to the effect of random noise and different viewpoint illumination [58], the appearance of wrong 3D features is inevitable, thus resulting inaccurate 3D transformation matrix. To improve the mapping accuracy, the coarse matrix $\left [ {R_2^P , T_2^P} \right ]$ is refined by using the dense geometric structure given by the complete 3D data. The transformation matrix refinement can be described by

(6)$$\mathop {J(R_2^P,T_2^P) = }_{} \mathop {\arg \min }_{} \sum_{}^{} {{{\left\| {P_1^{} - (R_2^PP_2^{} + T_2^P)} \right\|}^2}}.$$

Taking the coordinate of $P_1$ as the map coordinate, $P_2$ is transformed into the same world coordinate by using the refined $\left [ {R_2^P , T_2^P} \right ]$, thus constructing $M = {\cal M}\left ( {{P_1},{m_1};{P_2},{m_2}} \right )$. To transform the subsequent frame of point clouds into the same map, we repeat the above map initialization and refinement processes to construct the final map $M = {\cal M}\left ( {{P_1},{m_1};{P_2},{m_2};\cdots ;{P_N},{m_N}} \right )$.

2.3 Local-to-global localization

The sensor localization is implemented during the incremental mapping. We map the current frame into the same world coordinate, and simultaneously locate the current sensor in the constructed map. The sensor localization estimates matrices that describe the sensor state of each perceived frame in the map. In essence, this state matrix is the extrinsic matrix of the camera, which can be achieved by

(7)$$L = {\cal L}\left( {\mathbb {I},\mathbb {E}} \right),$$

where ${\cal L}$ describes the localization operation, $\mathbb {E} = \{ {E_1};{E_2};\cdots ;{E_N}\}$ denotes the desired state matrices of $\mathbb {I} = \{ {I_1};{I_2};\cdots ;{I_N}\}$. The $n-th$ state matrix can be described by ${E_n} = [r_n^{},t_n^{}]$, where $r$ and $t$ are orientation and location matrices of the camera, respectively [42].

To solve the desired state matrices, the 2D and 3D features of each frame after incremental mapping are reused to simultaneously perform the single-view sensor localization. However, for the incremental calculation processes, the accumulation of errors between adjacent frames is inevitable [59]. To eliminate these errors, the obtained state matrices are locally optimized by analyzing common features of adjacent frames, and synchronously globally optimized to obtain global consistent results by loop closing correction of all frames. We emphasize that the obtained state matrices also can be used to improve the mapping results, thus enabling high-accuracy SLAM.

2.3.1 Single-view sensor localization

The single-view localization calculates the sensor state/external matrix from partial 2D points and the corresponding 3D points by using the pre-calibrated system parameters [60]. The above-extracted 2D and 3D features in the mapping process are reused.

Taking these features extracted from $f_2^{}$ as an example, the single-view localization process is provided in Fig. 7. As shown in Fig. 5, we have six 2D features $\mathbb {F}_2^I = \{ F_{2\_1}^I;F_{2\_2}^I;F_{2\_3}^I;F_{2\_4}^I;F_{2\_5}^I;F_{2\_6}^I\}$ and six 3D features $\mathbb {F}_2^P = \{ F_{2\_1}^P;F_{2\_2}^P;F_{2\_3}^P;F_{2\_4}^P;F_{2\_5}^P;F_{2\_6}^P\}$. Because the sensor state needs to be estimated in the constructed map, these 3D features $\mathbb {F}_2^P$ require to be transformed into the map coordinate by ${m_2} \cdot \mathbb {F}_2^P$.

Fig. 7. Schematic of single-view sensor localization.

Download Full Size | PDF

In Fig. 7, the used 2D and transformed 3D features are connected by gray dashed lines. These 2D features are located in camera pixel coordinate ($u$, $v$) and these 3D features are located in the map coordinate $({x_w},{y_w},{z_w})$. These gray dashed lines in Fig. 7 plot the imaging process based on the pinhole model. By substituting $\mathbb {F}_2^P$ with ${m_2} \cdot \mathbb {F}_2^P$ in Eq. (2), the desired single-view sensor state matrix ${E_2} = [r_2^{},t_2^{}]$ can be calculated by following Appendix B. The sensor state is also visualized as blue "camera" graphics. The calculated $r_2$ and $t_2$ determine the orientation and location of $f_2$, respectively.

2.3.2 Local localization optimization

As we know, each feature is perceived many times from different perspectives. Thus there are usually many common features between adjacent frames. The state matrices can be locally optimized to reduce the accumulated errors by using the spatial constraints of these common features.

Selecting three frames $f_1$, $f_2$ and $f_3$ from Fig. 5, the local optimization process is illustrated in Fig. 8. The corresponding initial sensor states are also visualized by blue “camera” graphics for clarity. The selected three frames contain two common features $F_3$ and $F_4$ as connected by colored dashed lines. Because any spatial geometric structure is consistent in different frames, e.g., spatial location and topology. Accordingly, state matrices, $E_1$, $E_2$ and $E_3$, can be simply optimized by minimizing the re-projection errors [61]. Taking $F_3$ in $f_1$ as an example, the re-projection error can be calculated by

(8)$${\chi _{1\_3}} = F_{1\_3}^I - \frac{1}{{\rm{s}}} \cdot A \cdot {E_1} \cdot {m_1} \cdot F_{1\_3}^P.$$

Fig. 8. Schematic of local optimization.

Download Full Size | PDF

Considering $f_1$, $f_2$ and $f_3$ constrained by $F_3$ and $F_4$, three state matrices, $E_1$, $E_2$ and $E_3$, can be optimized by

(9)$$\begin{aligned} \mathop {J({E_1},{E_2},{E_3})} &= \mathop {\arg \min }\sum_{n = 1}^3 {\sum_{k = 3}^4 {{{\left\| {{\chi _{n\_k}}} \right\|}^2}} }\\ &= \mathop {\arg \min }_{} \sum_{n = 1}^3 {\sum_{k = 3}^4 {{{\left\| {F_{n\_k}^I - \frac{1}{s} \cdot A \cdot {E_n} \cdot {m_n} \cdot F_{n\_k}^P} \right\|}^2}}} . \end{aligned}$$

The sensor state can be adjusted by refreshing $E_1$, $E_2$ and $E_3$ for $f_1$, $f_2$ and $f_3$, respectively. In Fig. 8, the sensor state before and after the localization optimization are plotted by blue and red “camera” graphics, respectively.

To guarantee a sufficient number of common features, the number of frames to perform local optimization requires a flexible choice based on device movement speed and capture frame rate. In practice, local optimization is usually executed for each adjacent five frames, because in this way the view of adjacent frames will not deviate too much.

2.3.3 Global localization optimization

The global optimization further optimizes the state matrices to obtain consistent localization results by using the loop closing correction [62]. The loop closing point is first detected by similar features search and then used for global optimization by following the loop consistency.

The global optimization process is illustrated in Fig. 9. The yellow and blue lines denote the true state and the measured trajectory, respectively. The cumulative error increases when the sensor moves forward. When the sensor reaches closing points, the graph-based method [63] is used to revise all sensor state matrices. The red dashed line in Fig. 9 denotes the revised trajectory.

Fig. 9. Schematic of local optimization.

Download Full Size | PDF

The details of the global optimization process are as follows. As shown in Fig. 9, the incremental state matrices are used as the nodes to construct the graph. Features searched in the construction process are reused to detect closing points of each loop. Assume the sensor starts from the first frame and goes back to the initial position at the $N-th$ frame. Thus, a loop is constructed from $E_1$ to $E_N$, where $E_1$ and $E_N$ are actually a pair of detected loop closing points. These two state matrices should be equal but cannot close the loop caused by the accumulated errors. Based on the global loop consistency, all state matrices can be simply optimized by minimizing the loop closing error. Taking $E_1$ and $E_N$ as an example, the loop closing error of these can be calculated by

(10)$${e_N} = {\gamma _{N - 1}} - {\widehat{\gamma} _{N - 1}},$$

where ${\gamma _{N - 1}}$ and ${\widehat {\gamma } _{N - 1}}$ denotes the relative motion and the expected motion between $E_{N-1}$ and $E_{N}$, respectively.

To achieve global optimization, the whole state matrices can be revised by

(11)$$J(\mathbb {E}) = \mathop {\arg \min }_{}^{} C(\mathbb {E}),$$

where $C(\mathbb {E}) = \sum \nolimits _{n = 1}^N {e_{n,}^T{\Omega _n}{e_n}}$ denotes the global loop closing errors, ${\Omega _{mn}}$ denotes the information matrix and defaults to the identity matrix. The Gauss-Newton algorithm [64] is used to solve Eq. (11).

For simplicity, the entire progress of our FPP-SLAM is summarized as follows:

Step 1: sequentially perceive 3D point clouds and 2D texture by using an FPP system;

Step 2: incrementally transform each frame into the map and simultaneously determine the corresponding sensor state corresponding for each frame;

Step 3: repeat steps 1 and 2, but if the number of captured frames reaches the threshold, execute localization optimization to eliminate the accumulated errors in the sensor states and revise the map;

Step 4: return to step 1 and repeat the above processes.

3. Experiments

The proposed FPP-SLAM is evaluated on both controlled and room-sized indoor scenes. The commonly used ToF-SLAM (i.e., Kinect plus ORB-SLAM3 [65]) is selected and compared with FPP-SLAM. All SLAM algorithms are implemented in an Ubuntu 16.04 platform, using a laptop with a 3.2GHz AMD CPU and 16 GB memory.

Figure 10 shows the labeled diagram of the used FPP and Kinect sensors. The developed FPP system shown in Fig. 10(a) mainly includes a Basler acA1920-155uc camera with a resolution of 1920 $\times$ 1200 pixels, a Lighter Crafter 4500 projector with a resolution of 912 $\times$ 1140 pixels, and an interface expansion board. The fringe period of the designed sinusoidal fringe patterns is 16 pixels. The number of phase-shift and Gray code patterns are three and seven. The binary defocusing technique and relatively low sensor movement speeds (<0.05m/s) are adopted to avoid motion errors caused by sensor movement [25,41]. Figure 10(b) shows the selected commercial Azure Kinect DK [66], which includes an RGB camera, a depth camera and infrared emitters. The corresponding parameters are provided in Table 1.

Fig. 10. Labeled diagram of (a) FPP sensor and (b) Kinect sensor.

Download Full Size | PDF

Table 1. Parameters comparison of FPP sensor and Kinect

View Table | View all tables in this article

3.1 Accuracy evaluation of FPP-SLAM

A controlled scene is first constructed to evaluate FPP-SLAM. The experimental setup is shown on the left side of Fig. 11, which consists of an FPP sensor, a doll, and a turn-table. The FPP sensor is placed in a fixed location and the doll is placed on a turn-table, which also can be equivalent to the sensor rotating around the object. The turn-table is spun through a full rotation to capture point clouds from different perspectives of the doll for mapping, and the estimated sensor state trajectory should also be a circle and return to the origin. The obtained full 3D shape and trajectory can be used to evaluate the quality of localization and mapping. As shown on the right side of Fig. 11, the captured single-view point clouds are well transformed into the map and the estimated sensor state trajectory is globally consistent.

Fig. 11. Experimental setup and the results of FPP-SLAM.

Download Full Size | PDF

To assess the repeatability, the turn-table is rotated 3 times. The performance is evaluated by the deviation of the state coordinates determined when the turn-table is rotated back to its original position. Figures 12(a) and (b) show the experimental results of our FPP-SLAM and traditional ToF-SLAM. All these sensor states form a circular trajectory around the doll and achieve a closing loop. The sensor state coordinates of 3 times of FPP-SLAM and ToF-SLAM are provided in Tables 2 and 3. The Root Mean Square Errors (RMSE) between the sensor state coordinates determined by the first and third rotations are used to evaluate the localization accuracy. The RMSE values of FPP-SLAM and ToF-SLAM are respective 1.3 mm and 8.1 mm, where approximately 83.9$\%$ accuracy has been improved. The mapping results of FPP-SLAM and ToF-SLAM are shown at the bottom of Figs. 12(a) and (b), respectively. FPP-SLAM can achieve more accurate mapping results than that of traditional ToF-SLAM.

Fig. 12. Comparison of the performance of (a) FPP-SLAM and (b) ToF-SLAM.

Download Full Size | PDF

Table 2. State coordinates of FPP-SLAM.

View Table | View all tables in this article

Table 3. State coordinates of ToF-SLAM.

View Table | View all tables in this article

3.2 FPP-SLAM for actual indoor scenes

To investigate the performance of our method in real indoor scenes, we conducted another experiment in a corridor with a size of 2.37$\times$15.94 m. The measured corridor is shown in Fig. 13(a). The system setup of this indoor experiment is shown in Fig. 13(b), which includes a computer, an FPP sensor, a battery and a mobile cart. The front-end FPP sensor is fixed to the mobile cart and scans the indoor scene in real-time according to a pre-defined trajectory, which works continuously to continuously perceive the indoor scene data. The back-end algorithm simultaneously performs both mapping and localization along with the scanning of the indoor scene. The ToF-SLAM is also implemented and used for comparison.

Fig. 13. (a) The measured indoor scene; (b) the system setup.

Download Full Size | PDF

The results of FPP-SLAM and ToF-SLAM are shown in Figs. 14(a) and (b), respectively. Notably, to avoid interference from redundant frames, the results are made up of keyframes [67]. In our FPP-SLAM, not all frames are used to participate in mapping and localization, and the key frame strategy is adopted to improve efficiency. Only if the current frame has changed significantly from the previous frame, it will be built into the map as a keyframe, and the final results are made up of keyframes. Therefore, data redundancy does not put memory pressure on the mapping and localization algorithm.

Fig. 14. Experimental results of the indoor scene by (a) FPP-SLAM and (b) ToF-SLAM.

Download Full Size | PDF

As shown in Figs. 14(a) and (b), we provide a side view and a top view of these two methods. As shown in the enlarged regions, the map obtained by our FPP-SLAM has higher quality and accuracy, which is obviously superior to ToF-SLAM. In addition, the map obtained by ToF-SLAM also has a lot of layering, which is caused by the lack of robustness of the algorithm. In contrast, FPP-SLAM obtains ideal results. Next, to evaluate the sensor localization accuracy, the measured trajectory is compared with the actual pre-defined trajectory of the device movement, where the red ’camera’ graphics and yellow dashed lines denote the measured and the pre-defined trajectory, respectively. As we can see, the trajectory obtained by FPP-SLAM is closer to the set of the actual device movement path. The RMSR of the localization deviations of FPP-SLAM and ToF-SLAM are respective 7.4 mm and 68.6 mm. This experiment demonstrates that FPP-SLAM outperforms conventional ToF-SLAM in an actual indoor scene.

4. Conclusions

The availability of accurate 3-D metrology technologies such as FPP has the potential to revolutionize the fields of indoor localization and mapping, but methods to date have not fully leveraged the accuracy and speed of perceiving that FPP offer. In this paper, FPP is first introduced into indoor SLAM to improve front-end indoor perceiving quality, and then the corresponding back-end localization and mapping algorithms are designed to make full use of FPP, which enable an accurate and efficient FPP-SLAM. The capabilities of the proposed method have been demonstrated on the room-sized experiment. We believe that this paper will open a new door for the application of FPP in indoor SLAM.

Accurate indoor 3D perceiving is still the prerequisite of foremost importance for indoor SLAM. In the future, to make our FPP-SLAM more practical, we will consider combining error compensation algorithms to completely solve the dynamic measurement issue. In addition, since the motion of the measurement device is controllable, it can predict and compensate for this motion by adjusting the projection patterns of the FPP, which is more favorable to solve the dynamic problem of FPP-SLAM. Furthermore, we will also explore complex scene perceiving to achieve higher quality indoor 3D perceiving.

Appendix A: 3D features matching

Taking two sets of 3D features $\mathbb {F}_1^P = \{ F_{1\_1}^P;\cdots ;F_{1\_K}^P\}$ and $\mathbb {F}_2^P = \{ F_{2\_1}^P;\cdots ;F_{2\_K}^P\}$ as an example, they have the transformation relationships $F_1^P = [R_2^P,T_2^P] \cdot F_2^P$. For solving unknown values $R_2^P$ and $T_2^P$, the Singular Value Decomposition [57] is introduced to construct iteratively minimizing function Eq. (4), and the detailed solving process is as follows.

First, we need to calculate the “centered” and “de-centered” coordinates of the two sets of 3D features.

(12)$$\bar{\mathbb {F}}_1^P = \frac{1}{K}\sum_{k = 1}^K {F_{1\_k}^P}, \bar{\mathbb {F}}_2^P = \frac{1}{K}\sum_{k = 1}^K {F_{2\_k}^P}.$$

(13)$$\widehat{F}_{1\_k}^P = \frac{1}{K}\sum_{k = 1}^K {F_{1\_k}^P-\bar{\mathbb {F}}_1^P}, \widehat{F}_{2\_k}^P = \frac{1}{K}\sum_{k = 1}^K {F_{2\_k}^P-\bar{\mathbb {F}}_2^P}.$$

Then, Eq. (4) can be rewritten as

(14)$$\begin{aligned}J(R_2^P,T_2^P) &\mathop = _{} \mathop {\arg \min }_{}^{} \sum_{k = 1}^K {{{\left\| {F_{1\_k}^P - (R_2^PF_{2\_k}^P + T_2^P)} \right\|}^2}}\\ & \mathop = _{} \mathop {\arg \min }_{}^{} \sum_{k = 1}^K {{{\left\| {F_{1\_k}^P - R_2^PF_{2\_k}^P - T_2^P - \bar{\mathbb {F}}_1^P + R_2^P\bar{\mathbb {F}}_2^P + \bar{\mathbb {F}}_1^P - R_2^P\bar{\mathbb {F}}_2^P} \right\|}^2}}\\ & \mathop = _{} \mathop {\arg \min }_{}^{} \sum_{k = 1}^K {{{\left\| \widehat{F}_{1\_k}^P - R_2^P\widehat{F}_{2\_k}^P \right\|}^2}}+{{{\left\| \bar{\mathbb {F}}_1^P - R_2^P\bar{\mathbb {F}}_2^P - T_2^P \right\|}^2}}. \end{aligned}$$

Assuming $T_2^P =\bar {\mathbb {F}}_1^P - R_2^P\bar {\mathbb {F}}_2^P$, we have,

(15)$$\begin{aligned}J(R_2^P) & \mathop = _{} \mathop {\arg \min }_{}^{} \sum_{k = 1}^K {{{\left\| \widehat{F}_{1\_k}^P - R_2^P\widehat{F}_{2\_k}^P \right\|}^2}}\\ & \mathop = _{} \mathop {\arg \min }_{}^{} \sum_{k = 1}^K {{{\left [ {\left ( \widehat{F}_{1\_k}^P \right )}^T \widehat{F}_{1\_k}^P + {\left ( \widehat{F}_{2\_k}^P \right )}^T {\left ( R_2^P ) \right )}^T R_2^P \widehat{F}_{2\_k}^P \right ]}}}\\ & \mathop = _{} \mathop {\arg \max }_{}^{} Trace \sum_{k = 1}^K {{{\left [ {\left ( \widehat{F}_{1\_k}^P \right )}^T R_2^P \widehat{F}_{2\_k}^P \right ]}}}\\ & \mathop = _{} \mathop {\arg \max }_{}^{} Trace R_2^P \sum_{k = 1}^K {{{\left [ {\left ( \widehat{F}_{1\_k}^P \right )}^T \widehat{F}_{2\_k}^P \right ]}}}\\ & \mathop = _{} \mathop {\arg \max } Trace R_2^P H . \end{aligned}$$

The $H$ matrix can be further calculated by

(16)$${\rm{H}} = U\Lambda {V^T}.$$

The desired $R_2^P$ can be solved based on $U$ and $V$,

(17)$${\rm{R}} = UV.$$

Finally, the $T_2^P$ can be solved by

(18)$$T_2^P =\bar{\mathbb {F}}_1^P - R_2^P\bar{\mathbb {F}}_2^P.$$

Appendix B: Single-view sensor localization

As above illustrated, the single-view localization can be considered as a problem of solving the sensor external matrix from the pre-calibrated system parameters and partial 2D and corresponding 3D points [1]. Here, the extracted 2D features $\mathbb {F}_2^I$ and the transformed 3D features ${m_2} \cdot \mathbb {F}_2^P$ are used as the desired 2D and 3D points, which can be further described as

(19)$$\mathbb {F}_2^I = \{ F_{2\_1}^I;\cdots;F_{2\_K}^I\} = \{ ({u_1},{v_1});\cdots;({u_K},{v_K})\}.$$

(20)$${m_2} \cdot \mathbb {F}_2^P = {m_2} \cdot \{ F_{2\_1}^P;\cdots;F_{2\_K}^P\} = \{ ({x_1},{y_1},{z_1});\cdots;({x_K},{y_K},{z_K})\} .$$

According to the pinhole imaging model [42], the relationship between the 2D features $\mathbb {F}_2^I$ and 3D features ${m_2} \cdot \mathbb {F}_2^P$ can be described by

(21)$$s\left[ \begin{array}{l} {u_k}\\ {v_k}\\ 1 \end{array} \right] = A \cdot E \cdot \left[ \begin{array}{l} {x_k}\\ {y_k}\\ {z_k}\\ 1 \end{array} \right]{\rm{ = }}A \cdot [r,t] \cdot \left[ \begin{array}{l} {x_k}\\ {y_k}\\ {z_k}\\ 1 \end{array} \right] = \left[ \begin{array}{llll} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24}\\ a_{31} & a_{32} & a_{33} & a_{34} \end{array} \right]\left[ \begin{array}{l} {x_k}\\ {y_k}\\ {z_k}\\ 1 \end{array} \right],$$

where $A$, $E$ and $s$ are respective the sensor intrinsic matrix, extrinsic matrix and scaling factor. From which, $E=[r,t]$ also denotes the orientation and location of sensor state, which is also the basis for the sensor localization [19].

Equation (21) can be derive the following form.

(22)$$\begin{aligned}&s{u_k} = a_{11}^{}{x_k} + a_{12}^{}{y_k} + a_{13}^{}{z_k} + a_{14}^{};\\ &s{v_k} = a_{21}^{}{x_k} + a_{22}^{}{y_k} + a_{23}^{}{z_k} + a_{24}^{};\\ &s = a_{31}^{}{x_k} + a_{32}^{}{y_k} + a_{33}^{}{z_k} + a_{34}^{}. \end{aligned}$$

By eliminating $s$ , Eq. (22) can be simplified as

(23)$$\begin{aligned}&a_{11}^{}x_{_k}^{} + a_{12}^{}y_{_k}^{} + a_{13}^{}z_{_k}^{} + a_{14}^{} - a_{31}^{}x_{_k}^{}u_{_k}^{} - a_{32}^{}y_{_k}^{}u_{_k}^{} - a_{33}^{}z_{_k}^{}u_{_k}^{} - a_{34}^{}u_{_k}^{} = 0;\\ &a_{21}^cx_{_k}^{} + a_{22}^cy_{_k}^{} + a_{23}^cz_{_k}^{} + a_{24}^{} - a_{31}^{}x_{_k}^{}v_{_k}^{} - a_{32}^{}y_{_k}^{}v_{_k}^{} - a_{33}^{}z_{_k}^{}v_{_k}^{} - a_{34}^{}v_{_k}^{} = 0. \end{aligned}$$

According to the above formula, each set of 3D-2D matching points corresponds to two equations, there are a total of 12 unknowns, and at least 6 sets of matching points are required.

For our $K$ sets of matching features, Eq. (23) can be rewritten as

(24)$$\left[ \begin{array}{lllllllllllll} x_1^{} & y_1^{} & z_1^{} & 1 & 0 & 0 & 0 & 0 & - u_1^{}x_1^{} & - u_1^{}y_1^{} & - u_1^{}z_1^{} & - u_1^{}\\ 0 & 0 & 0 & 0 & x_1^{} & y_1^{} & z_1^{} & 1 & - v_1^{}x_1^{} & - v_1^{}y_1^{} & - v_1^{}z_1^{} & - v_1^{}\\ \ldots \;\; \ldots\\ x_K^{} &y_K^{} &z_K^{} & 1 & 0 & 0 & 0 & 0 & - u_K^{}x_K^{} &- u_K^{}y_K^{} & - u_K^{}z_K^{} &- u_K^{}\\ 0 & 0 & 0 & 0 & x_K^{} & y_K^{} & z_K^{} & 1 & - v_K^{}x_K^{} &- v_K^{}y_K^{} & - v_K^{}z_K^{} & - v_K^{} \end{array} \right]\left[ \begin{array}{l} a_{_{11}}^{}\\ a_{_{12}}^{}\\ a_{_{13}}^{}\\ a_{_{14}}^{}\\ a_{_{21}}^{}\\ a_{_{22}}^{}\\ a_{_{23}}^{}\\ a_{_{24}}^{}\\ a_{_{31}}^{}\\ a_{_{32}}^{}\\ a_{_{33}}^{}\\ a_{_{34}}^c \end{array} \right] = 0.$$

The above formula can be simplified as

(25) $$AF=0.$$

When $K=6$, the systems of linear equations can be solved directly.

When $K\ge 6$, we can obtain the least-squares solution under the constraint of $|F| = 1$. And then through SVD, the last column of the $V$ matrix is the solution.

(26)$$F = UD{V^T}.$$

By substituting Eq. (25) into Eq. (26), we can obtain Eq. (27).

(27)$$F = \left[ {A \cdot r A \cdot t} \right].$$

Thus, the rotation matrix $R^c$ and translation matrix $T^c$ can be calculated as follows.

(28)$$r = {A^{ - 1}}\left[ \begin{array}{lll} a_{_{11}}^{} & a_{_{12}}^{} & a_{_{13}}^{}\\ a_{_{21}}^{} & a_{_{22}}^{} & a_{_{23}}^{}\\ a_{_{31}}^{} & a_{_{32}}^{} & a_{_{33}}^{} \end{array} \right].$$

(29)$$t = {A^{ - 1}}\left[ \begin{array}{l} {a_{14}}\\ {a_{24}}\\ {a_{34}} \end{array} \right].$$

Funding

National Postdoctoral Program for Innovative Talents (2022ZB256); China Postdoctoral Science Foundation (2022M721620); Jiangsu Provincial Key Research and Development Program (BE2022391); National Natural Science Foundation of China (61727802, 62201261).

Acknowledgments

We thank Chenxing Wang and Rongbiao Zhu for technical supports and experimental discussion. We also thank Xiaohuan Wang for the grammar suggestions.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. A. Singandhupe and H. M. La, “A review of slam techniques and security in autonomous driving,” in 2019 third IEEE international conference on robotic computing (IRC), (IEEE, 2019), pp. 602–607.

2. H. Lategahn, A. Geiger, and B. Kitt, “Visual slam for autonomous ground vehicles,” in 2011 IEEE International Conference on Robotics and Automation, (IEEE, 2011), pp. 1732–1737.

3. T. Mill, A. Alt, and R. Liias, “Combined 3d building surveying techniques–terrestrial laser scanning (tls) and total station surveying for bim data management purposes,” J. Civ. Eng. Manag. 19(Supplement_1), S23–S32 (2013). [CrossRef]

4. V. Krauß, A. Boden, L. Oppermann, and R. Reiners, “Current practices, challenges, and design implications for collaborative ar/vr application development,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, (2021), pp. 1–15.

5. P. Newman, D. Cole, and K. Ho, “Outdoor slam using visual appearance and laser ranging,” in Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., (IEEE, 2006), pp. 1180–1187.

6. S. Zlatanova, G. Sithole, M. Nakagawa, and Q. Zhu, “Problems in indoor mapping and modelling,” Acquisition and Modelling of Indoor and Enclosed Environments 2013, Cape Town, South Africa, 11-13 December 2013, ISPRS Archives Volume XL-4/W4, 2013 (2013).

7. J. D. Blower, “Gis in the cloud: implementing a web map service on google app engine,” in Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application, (2010), pp. 1–4.

8. T. Yoshida, K. Nagatani, S. Tadokoro, T. Nishimura, and E. Koyanagi, “Improvements to the rescue robot quince toward future indoor surveillance missions in the fukushima daiichi nuclear power plant,” in Field and service robotics, (Springer, 2014), pp. 19–32.

9. H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping: part i,” IEEE Robot. Automat. Mag. 13(2), 99–110 (2006). [CrossRef]

10. Z. Ren, L. Wang, and L. Bi, “Robust gicp-based 3d lidar slam for underground mining environment,” Sensors 19(13), 2915 (2019). [CrossRef]

11. R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular slam with map reuse,” IEEE Robot. Autom. Lett. 2(2), 796–803 (2017). [CrossRef]

12. R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE international symposium on mixed and augmented reality, (Ieee, 2011), pp. 127–136.

13. M. Zeybek, “Indoor mapping and positioning applications of hand-held lidar simultaneous localization and mapping (slam) systems,” Türkiye Lidar Dergisi 3(1), 7–16 (2021). [CrossRef]

14. J. Smisek, M. Jancosek, and T. Pajdla, “3d with kinect,” in Consumer depth cameras for computer vision, (Springer, 2013), pp. 3–25.

15. Y. M. Mustafah, A. W. Azman, and F. Akbar, “Indoor uav positioning using stereo vision sensor,” Procedia Eng. 41, 575–579 (2012). [CrossRef]

16. S. Zhang, “Recent progresses on real-time 3d shape measurement using digital fringe projection techniques,” Opt. Lasers Eng. 48(2), 149–158 (2010). [CrossRef]

17. Z. Zhang, D. Zhang, and X. Peng, “Performance analysis of a 3d full-field sensor based on fringe projection,” Opt. Lasers Eng. 42(3), 341–353 (2004). [CrossRef]

18. Q. Zhang, X. Su, L. Xiang, and X. Sun, “3-d shape measurement based on complementary gray-code light,” Opt. Lasers Eng. 50(4), 574–579 (2012). [CrossRef]

19. R. Azzam, T. Taha, S. Huang, and Y. Zweiri, “Feature-based visual simultaneous localization and mapping: A survey,” SN Appl. Sci. 2(2), 224 (2020). [CrossRef]

20. J. Li, R. Zhong, Q. Hu, and M. Ai, “Feature-based laser scan matching and its application for indoor mapping,” Sensors 16(8), 1265 (2016). [CrossRef]

21. A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp,” in Robotics: science and systems, vol. 2 (Seattle, WA, 2009), p. 435.

22. Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “Pointnetlk: Robust & efficient point cloud registration using pointnet,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 7163–7172.

23. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2013), pp. 2930–2937.

24. J. Yang, H. Li, D. Campbell, and Y. Jia, “Go-icp: A globally optimal solution to 3d icp point-set registration,” IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2241–2254 (2016). [CrossRef]

25. D. Zheng, Q. Kemao, J. Han, J. Wang, H. Yu, and L. Bai, “High-speed phase-shifting profilometry under fluorescent light,” Opt. Lasers Eng. 128, 106033 (2020). [CrossRef]

26. Z. Zhang, “Review of single-shot 3d shape measurement by phase calculation-based fringe projection techniques,” Opt. Lasers Eng. 50(8), 1097–1106 (2012). [CrossRef]

27. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Trans. Robot. 31(5), 1147–1163 (2015). [CrossRef]

28. R. Otero, S. Lagüela, I. Garrido, and P. Arias, “Mobile indoor mapping technologies: A review,” Autom. Constr. 120, 103399 (2020). [CrossRef]

29. Q. Zhang and X. Su, “High-speed optical measurement for the drumhead vibration,” Opt. Express 13(8), 3110–3116 (2005). [CrossRef]

30. W. Guo, Z. Wu, Q. Zhang, and Y. Wang, “Real-time motion-induced error compensation for 4-step phase-shifting profilometry,” Opt. Express 29(15), 23822–23834 (2021). [CrossRef]

31. Y. Wang, Z. Liu, C. Jiang, and S. Zhang, “Motion induced phase error reduction using a hilbert transform,” Opt. Express 26(26), 34224–34235 (2018). [CrossRef]

32. S. Feng, L. Zhang, C. Zuo, T. Tao, Q. Chen, and G. Gu, “High dynamic range 3d measurements with fringe projection profilometry: a review,” Meas. Sci. Technol. 29(12), 122001 (2018). [CrossRef]

33. S. S. Gorthi and P. Rastogi, “Fringe projection techniques: whither we are?” Opt. Lasers Eng. 48(2), 133–140 (2010). [CrossRef]

34. S. Lv, D. Tang, X. Zhang, D. Yang, W. Deng, and Q. Kemao, “Fringe projection profilometry method with high efficiency, precision, and convenience: theoretical analysis and development,” Opt. Express 30(19), 33515–33537 (2022). [CrossRef]

35. C. Zuo, S. Feng, L. Huang, T. Tao, W. Yin, and Q. Chen, “Phase shifting algorithms for fringe projection profilometry: A review,” Opt. Lasers Eng. 109, 23–59 (2018). [CrossRef]

36. Z. Wu, W. Guo, Y. Li, Y. Liu, and Q. Zhang, “High-speed and high-efficiency three-dimensional shape measurement based on gray-coded light,” Photonics Res. 8(6), 819–829 (2020). [CrossRef]

37. X. He and Q. Kemao, “A comparison of n-ary simple code and n-ary gray code phase unwrapping in high-speed fringe projection profilometry,” Opt. Lasers Eng. 128, 106046 (2020). [CrossRef]

38. G. Sansoni, S. Corini, S. Lazzari, R. Rodella, and F. Docchio, “Three-dimensional imaging based on gray-code light projection: characterization of the measuring algorithm and development of a measuring system for industrial applications,” Appl. Opt. 36(19), 4463–4472 (1997). [CrossRef]

39. Z. Wu, W. Guo, L. Lu, and Q. Zhang, “Generalized phase unwrapping method that avoids jump errors for fringe projection profilometry,” Opt. Express 29(17), 27181–27192 (2021). [CrossRef]

40. D. Zheng, F. Da, Q. Kemao, and H. S. Seah, “Phase-shifting profilometry combined with gray-code patterns projection: unwrapping error removal by an adaptive median filter,” Opt. Express 25(5), 4700–4713 (2017). [CrossRef]

41. Z. Wu, W. Guo, and Q. Zhang, “High-speed three-dimensional shape measurement based on shifting gray-code light,” Opt. Express 27(16), 22631–22644 (2019). [CrossRef]

42. S. Zhang and P. S. Huang, “Novel method for structured light system calibration,” Opt. Eng. 45(8), 083601 (2006). [CrossRef]

43. Z. Zhang, S. Huang, S. Meng, F. Gao, and X. Jiang, “A simple, flexible and automatic 3d calibration method for a phase calculation-based fringe projection imaging system,” Opt. Express 21(10), 12218–12227 (2013). [CrossRef]

44. S. Zhang and S.-T. Yau, “Simultaneous geometry and color texture acquisition using a single-chip color camera,” in Interferometry XIV: Techniques and Analysis, vol. 7063 (SPIE, 2008), pp. 210–217.

45. C. Zuo, T. Tao, S. Feng, L. Huang, A. Asundi, and Q. Chen, “Micro fourier transform profilometry (μftp): 3d shape measurement at 10,000 frames per second,” Opt. Lasers Eng. 102, 70–91 (2018). [CrossRef]

46. Z. Wu, C. Zuo, W. Guo, T. Tao, and Q. Zhang, “High-speed three-dimensional shape measurement based on cyclic complementary gray-code light,” Opt. Express 27(2), 1283–1297 (2019). [CrossRef]

47. S. Feng, Q. Chen, G. Gu, T. Tao, L. Zhang, Y. Hu, W. Yin, and C. Zuo, “Fringe pattern analysis using deep learning,” Adv. Photonics 1(02), 1 (2019). [CrossRef]

48. H. Yu, X. Chen, Z. Zhang, C. Zuo, Y. Zhang, D. Zheng, and J. Han, “Dynamic 3-d measurement based on fringe-to-fringe transformation using deep learning,” Opt. Express 28(7), 9405–9418 (2020). [CrossRef]

49. Y. Zhao, H. Yu, R. Zhu, K. Zhang, X. Chen, Y. Zhang, D. Zheng, and J. Han, “Novel optical-markers-assisted point clouds registration for panoramic 3d shape measurement,” Opt. Lasers Eng. 161, 107319 (2023). [CrossRef]

50. P. Dong and N. P. Galatsanos, “Affine transformation resistant watermarking based on image normalization,” in Proceedings. International Conference on Image Processing, vol. 3 (IEEE, 2002), pp. 489–492.

51. K. G. Derpanis, “The Harris corner detector,” 2, 1–2 (York University, 2004).

52. F. Bellavia, D. Tegolo, and C. Valenti, “Improving harris corner selection strategy,” IET Comput. Vis. 5(2), 87–96 (2011). [CrossRef]

53. S. Feng, C. Zuo, L. Zhang, T. Tao, Y. Hu, W. Yin, J. Qian, and Q. Chen, “Calibration of fringe projection profilometry: A comparative review,” Opt. Lasers Eng. 143, 106622 (2021). [CrossRef]

54. H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” in European conference on computer vision, (Springer, 2006), pp. 404–417.

55. J. M. Phillips, R. Liu, and C. Tomasi, “Outlier robust icp for minimizing fractional rmsd,” in Sixth International Conference on 3-D Digital Imaging and Modeling (3DIM 2007), (IEEE, 2007), pp. 427–434.

56. P. Wei, L. Yan, H. Xie, and M. Huang, “Automatic coarse registration of point clouds using plane contour shape descriptor and topological graph voting,” Autom. Constr. 134, 104055 (2022). [CrossRef]

57. H. Andrews and C. Patterson, “Singular value decomposition (svd) image coding,” IEEE Trans. Commun. 24(4), 425–432 (1976). [CrossRef]

58. Y. Zhao, R. Zhu, K. Zhang, H. Yu, L. Bai, D. Zheng, and J. Han, “Accurate dynamic 3-d shape measurement based on the fringe pattern super-reconstruction technique,” Measurement 200, 111575 (2022). [CrossRef]

59. M. J. Westoby, J. Brasington, N. F. Glasser, M. J. Hambrey, and J. M. Reynolds, “’structure-from-motion’photogrammetry: A low-cost, effective tool for geoscience applications,” Geomorphology 179, 300–314 (2012). [CrossRef]

60. J. Wang, R. K. Ghosh, and S. K. Das, “A survey on sensor localization,” J. Control Theory Appl. 8(1), 2–11 (2010). [CrossRef]

61. B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms, (Springer, 1999), pp. 298–372.

62. B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. Tardós, “A comparison of loop closing techniques in monocular slam,” Robotics Auton. Syst. 57(12), 1188–1197 (2009). [CrossRef]

63. G. Grisetti, R. Kümmerle, C. Stachniss, and W. Burgard, “A tutorial on graph-based slam,” IEEE Intell. Transport. Syst. Mag. 2(4), 31–43 (2010). [CrossRef]

64. M. H. Loke and T. Dahlin, “A comparison of the gauss–newton and quasi-newton methods in resistivity imaging inversion,” J. Appl. Geophys. 49(3), 149–162 (2002). [CrossRef]

65. R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Trans. Robot. 33(5), 1255–1262 (2017). [CrossRef]

66. M. Antico, N. Balletti, G. Laudato, A. Lazich, M. Notarantonio, R. Oliveto, S. Ricciardi, S. Scalabrino, and J. Simeone, “Postural control assessment via microsoft azure kinect dk: An evaluation study,” Comput. Methods Programs Biomed. 209, 106324 (2021). [CrossRef]

67. G. Klein and D. Murray, “Improving the agility of keyframe-based slam,” in European conference on computer vision, (Springer, 2008), pp. 802–815.

Module	FPP Sensor	Azure Kinect DK
Size (mm)	300 $\times$ 125 $\times$ 65	103 $\times$ 126 $\times$ 39
Weight (g)	2011	440
Speed (fps)	150	10
Resolution (pixel)	1920 $\times$ 1200	640 $\times$ 576
Accuracy (mm/m)	0.05-0.1	1-5

	State 1	State 2	State 3	State 4	State 5	State 6	State 7	State 8	State 9	State 10	State 11	State 12
1th	(16.0, −1.1, −660.3)	(17.5, −327.5, −567.5)	(16.5, −581.9, −336.8)	(17.3, −659.6, 0.1)	(17.9, −580.6, 336.1)	(17.6, −327.7, 566.5)	(17.5, 0.8, 660.6)	(18.3, 327.3, 568.5)	(16.9, 581.4, 336.2)	(17.4, 660.1, −0.1)	(16.7, 580.4, −335.2)	(16.9, 326.5, −568.2)
2th	(16.2, −1.4, −659.2)	(17.6, −326.8, −567.7)	(16.1, −582.4, −336.0)	(16.4, −661.0, 0.7)	(16.7, −581.5, 335.9)	(18.3, −328.3, 567.2)	(17.6, 0.1, 659.4)	(17.4, 326.8, 567.2)	(16.5, 581.7, 335.6)	(16.6, 661.5, 1.4)	(16.8, 581.0, −337.2)	(17.7, 327.4, −568.3)
3th	(17.1, −0.8, −659.6)	(16.0, −327.0, −568.1)	(,16.6 581.6, −337.5)	(17.8, −662.8, 1.3)	(17.1, −581.2, 335.4)	(16.2, −327.7, 566.7)	(17.1, 0.3, 660.4)	(17.7, 326.4, 567.4)	(16.0, 581.5, 334.2)	(17.4, 662.8, 0.4)	(16.8, 579.5, −338.5)	(17.9, 326.8, −567.2)

	State 1	State 2	State 3	State 4	State 5	State 6	State 7	State 8	State 9	State 10	State 11	State 12
1th	(−0.9, 2.7, −994.8)	(−2.4, −485.4, −850.2)	(1.4, −873.6,- 506.8)	(4.9, −991.8, −0.3)	(−5.4, −873.2, 505.2)	(−3.5, −496.1, 850.1)	(−1.4, −3.8, 992.5)	(−1.8, 494.3, 850.3)	(−1.2, 873.7, 508.9)	(1.6, 992.6, −4.9)	(4.0, 867.3, −505.1)	(1.6, 490.4, −846.7)
2th	(−1.1, 2.5, −988.1)	(−2.8, −489.4, −849.7)	(−1.9, −871.1, −498.2)	(−3.5, −991.2, 5.7)	(0.2, −876.1, 506.3)	(2.3, −488.2, 849.9)	(−3.6, −2.7, 990.5)	(−0.2, 489.7, 848.7)	(−1.3, 872.2, 507.3)	(−3.3, 989.3, −0.8)	(5.3, 875.4, −508.2)	(2.7, 495.9, −848.4)
3th	(−6.5, 6.7, −989.0)	(−9.9, −490.7, −850.7)	(−0.1, −863.3, −495.6)	(−2.5, −991.3, 1.9)	(4.3, −881.4, 501.3)	(1.6, −487.6, 854.5)	(2.8, −2.5, 987.4)	(4.7, 487.1, 845.2)	(−2.1, 869.5, 513.0)	(−7.5, 988.9, −5.6)	(5.1, 875.6, −507.2)	(−1.4, 499.5, −845.1)

Module	FPP Sensor	Azure Kinect DK
Size (mm)	300 $\times$ 125 $\times$ 65	103 $\times$ 126 $\times$ 39
Weight (g)	2011	440
Speed (fps)	150	10
Resolution (pixel)	1920 $\times$ 1200	640 $\times$ 576
Accuracy (mm/m)	0.05-0.1	1-5

	State 1	State 2	State 3	State 4	State 5	State 6	State 7	State 8	State 9	State 10	State 11	State 12
1th	(16.0, −1.1, −660.3)	(17.5, −327.5, −567.5)	(16.5, −581.9, −336.8)	(17.3, −659.6, 0.1)	(17.9, −580.6, 336.1)	(17.6, −327.7, 566.5)	(17.5, 0.8, 660.6)	(18.3, 327.3, 568.5)	(16.9, 581.4, 336.2)	(17.4, 660.1, −0.1)	(16.7, 580.4, −335.2)	(16.9, 326.5, −568.2)
2th	(16.2, −1.4, −659.2)	(17.6, −326.8, −567.7)	(16.1, −582.4, −336.0)	(16.4, −661.0, 0.7)	(16.7, −581.5, 335.9)	(18.3, −328.3, 567.2)	(17.6, 0.1, 659.4)	(17.4, 326.8, 567.2)	(16.5, 581.7, 335.6)	(16.6, 661.5, 1.4)	(16.8, 581.0, −337.2)	(17.7, 327.4, −568.3)
3th	(17.1, −0.8, −659.6)	(16.0, −327.0, −568.1)	(,16.6 581.6, −337.5)	(17.8, −662.8, 1.3)	(17.1, −581.2, 335.4)	(16.2, −327.7, 566.7)	(17.1, 0.3, 660.4)	(17.7, 326.4, 567.4)	(16.0, 581.5, 334.2)	(17.4, 662.8, 0.4)	(16.8, 579.5, −338.5)	(17.9, 326.8, −567.2)

FPP-SLAM: indoor simultaneous localization and mapping based on fringe projection profilometry

Abstract

1. Introduction

2. Principle of the proposed FPP-SLAM

2.1 High-definition 2D and 3D data perceiving based on FPP

2.2 Sparse-to-dense mapping

2.2.1 Feature extraction

2.2.2 Sparse mapping based on feature matching

2.2.3 Refining based on dense geometric consistency

2.3 Local-to-global localization

2.3.1 Single-view sensor localization

2.3.2 Local localization optimization

2.3.3 Global localization optimization

3. Experiments

3.1 Accuracy evaluation of FPP-SLAM

3.2 FPP-SLAM for actual indoor scenes

4. Conclusions

Appendix A: 3D features matching

Appendix B: Single-view sensor localization

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (14)

Tables (3)

Equations (29)

Optics Express