DSCNet: lightweight and efficient self-supervised network via depthwise separable cross convolution blocks for speckle image matching

Lin Li; Lin Li; Peng Wang; Peng Wang; Peng Wang; Lingrui Wang; Lingrui Wang; Changku Sun; Changku Sun; Luhua Fu; Luhua Fu; Luhua Fu

doi:10.1364/OE.519957

1. Introduction

Image matching [1] is one of the important research contents in computer vision and image processing, and is widely used in visual 3D reconstruction [2], attitude estimation [3], object recognition [4] and object tracking [5]. Its purpose is to obtain one or more transformations in the transformation space so that two or more images of the same scene from different times, sensors or perspectives are spatially consistent. Typically, image matching results are influenced by various factors, which can be categorized into two primary aspects: the first aspect concerns the reliability of the data source, especially the quality of the collected source images; the second one is related to the effectiveness of the feature extraction and matching methods.

With the advantages of simple hardware configuration and high measurement accuracy, the combined optical measurement approach of vision and structured light has become a prevalent non-contact measurement technique [6–9]. Compared with vision-based methods, the structured light-based measurement methods can easily overcome the problem of low matching accuracy caused by weak texture regions and improve the quality of the captured images. The two most commonly used structured light projection patterns are fringe projection profilometry (FPP) and speckle projection profilometry (SPP) [10]. Different from FPP, since SPP can acquire target information by projecting a single frame image [11], it can mitigate the impact of disturbances caused by system vibrations on the dynamic displacement or shape changes of the target surface [12]. Moreover, SPP exhibits excellent resistance to ambient light interference [13]. Therefore, measurement methods based on SPP have significant advantages in real-time acquisition of three-dimensional motion information of objects [14].

Slow measurement speed and low accuracy are two challenges that traditional SPP-based measurement methods need to address. To solve the issue of slow measurement speed in SPP, methods such as GPU acceleration [15] and the utilization of laser speckle [16] for rapid projection were proposed to improve measurement speed. To handle the problem of low measurement accuracy, Khan et al. [17] utilized a modified Harris corner detection algorithm to extract feature points and used KLT tracking method to search for corresponding points in the matching images. Yang et al. [18] proposed an improved blob detection algorithm based on LoG operators. The algorithm could efficiently extract the blobs of various scales and brightness at once, and restrain the influence of edges and corners. However, it extracts few small-sized speckles. Guo et al. [19] employed the optimized projective rectification and simplified subpixel matching techniques to search for corresponding point pairs rapidly. Yeh et al. [20] utilized Gabor filters, SIFT (Scale-Invariant Feature Transform) and projection to extract features of laser speckle images. The K-means algorithm was used to establish an indexing structure for the database to accelerate the matching process. In summary, since the efficacy of traditional methods typically relies on predefined or handcrafted features, the expert knowledge has a significant impact on such methods [21]. If the manually selected features are not suitable, then the matching accuracy will be seriously affected. Deep learning-based models, by contrast, offer the advantages of autonomously learning and extracting relevant features directly from the input data. This characteristic proves to be advantageous when dealing with diverse input images [22].

In recent years, deep learning has demonstrated powerful feature learning and representation capabilities in various visual tasks, enabling it to deal with many challenges which are difficult for traditional methods to solve [23]. In the field of computer vision, lots of network architectures based on convolutional neural network (CNN) possess significant improvements in precision, matching accuracy and calculation speed compared with traditional binocular matching methods [24]. However, there are currently few speckle feature point detection algorithms based on deep learning. Yin et al. [25] leveraged a multi-scale residual subnetwork to synchronously extract compact feature tensors from speckle images and proposed a lightweight U-net network to achieve higher matching performance. Wang et al. [26] presented a stereo matching network called densely connected stereo matching network that adopted densely connected feature extraction and incorporated attention weight volume construction. But both of the mentioned networks merely take speckle images as the input to the network. The datasets are constructed by the FPP with multiple fringe projections, increasing the amount of data collection and the device complexity. In addition, there are potentials for optimization and simplification in network architecture. Dong et al. [27] introduced a lightweight and edge-preserving speckle matching network based on digital speckle correlation. However, the network requires epipolar rectification of the input two images and still relies on traditional stereo matching methods to construct the speckle datasets.

To tackle the above challenges and further improve the accuracy and speed, a lightweight and efficient self-supervised CNN called depthwise separable cross network (DSCNet) is developed. We construct a novel backbone by introducing depthwise separable cross convolution blocks to reduce computational complexity and enhance feature extraction capability. At the head network, we use the softargmax detector head to refine the coordinates of speckle feature points to sub-pixel level. In addition, a coarse-to-fine module is adopted to further improve the matching accuracy. Self-supervised learning and transfer learning are introduced to train DSCNet for the extraction and matching capabilities of speckle feature points, which effectively enhance the generalization and feature representation capabilities of the model. Data augmentation and real-time training techniques enable DSCNet to comprehensively learn various features and patterns present in rich data from real-world scenarios, thereby improving the model’s robustness. Compared to other traditional algorithms and deep learning-based methods, DSCNet demonstrates significant advantages in matching accuracy, mismatch rate and matching speed. DSCNet contains only 1.78 million parameters, most of which are shared between two branches. The experimental system based on DSCNet runs at 42ms for a speckle image pair. To the best of our knowledge, our network is the first deep learning-based network capable of achieving end-to-end self-supervised learning for the extraction and matching of speckle feature points.

2. Proposed method

2.1 Measurement system

As shown in Fig. 1, the binocular measurement method based on laser speckle projection combines speckle structured light projection with binocular stereo vision. A laser speckle projector projects a speckle pattern onto the surface of the measurement object, and the corresponding left and right images are captured by the binocular. Subsequently, a deep learning-based network processes the captured images to extract the coordinates of each speckle feature point and match corresponding points in the left and right views.

Fig. 1. Schematic diagram of measurement system.

Download Full Size | PDF

2.2 Overall architecture of DSCNet

Considering the deployment requirements of the deep learning-based measurement system in a complex cockpit environment, the accuracy and efficiency of speckle feature point extraction and matching method should be coordinated and balanced. Inspired by SuperPoint [28] and LoFTR [29], we propose a lightweight and efficient CNN called DSCNet. As shown in Fig. 2, the overall architecture consists of a shared backbone network and two branches. DSCNet takes gray image pair $I^{{A}}$ and $I^{{B}}$ as input, outputs feature point heatmaps $H^{{A}}$ and $H^{{B}}$, as well as matching results $M_{{f}}$.

Fig. 2. Overall architecture of the proposed DSCNet.

Download Full Size | PDF

The backbone network is composed of three down blocks and one cross convolution block to reduce the input image dimensionality and extract image feature information. Followed by the backbone network, the architecture splits into two branches for feature point detection and description. Most of the network’s parameters are shared between the two tasks. Compared to traditional systems that first detect feature points and then calculate descriptors [30,31], the architecture of DSCNet not only reduces the number of learned parameters but improves the shared computation and representation capabilities between the two tasks.

2.3 Backbone

The speckle image captured by the measurement system is shown in Fig. 3. For a speckle feature point, its image area ranges from 6 to 10 pixels, exhibiting a gray distribution pattern with a bright center pixel and dark edge pixels. To extract the gray information of speckle feature points more efficiently and accurately, we propose a novel lightweight and efficient feature extraction backbone, as shown in Fig. 4. The backbone comprises depthwise separable cross convolutional layers, cross convolutional layers, residual layers, normalization layers, activation layers and max-pooling layers.

Fig. 3. Grayscale distribution pattern of speckle feature points.

Download Full Size | PDF

Fig. 4. Feature extraction backbone.

Download Full Size | PDF

In backbone, considering the distribution pattern of speckles, we design a novel depthwise separable cross convolution block, which is shown in Fig. 5.

Fig. 5. Depthwise separable cross convolution block.

Download Full Size | PDF

Depthwise separable cross convolution is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a pointwise convolution [32]. Considering the distribution pattern of speckle feature points, in depthwise convolution, we introduce 1$\times$k and k$\times$1 asymmetric convolutional kernels on the basis of the standard k$\times$k convolutional kernel, forming a cross convolutional kernel. Therefore, depthwise separable cross convolution not only reduces computational cost and network parameters through depthwise separable convolution [33], but enhances the network’s feature extraction capability through cross convolution.

To process the feature maps with weight and height represented by $D_{{iw}}$ and $D_{{ih}}$ and M channels to output feature maps with dimensions of $D_{{ow}} \times D_{{oh}} \times \textit {N}$ by the k $\times$ k kernel, the computational cost for the standard convolution is

(1)$${C_{{sc}}} = \textit{k} \times \textit{k} \times \textit{M} \times D_{{ow}} \times D_{{oh}} \times \textit{N}.$$

In depthwise cross convolution, each input channel is processed by a single convolutional kernel. Thus, there will be M sets of convolutional kernels to process the input image. The computational cost of the depthwise cross convolution is

(2)$${C_{{dc}}} = {\rm{ }}( {\textit{k} \times \textit{k} + 1 \times \textit{k} + \textit{k} \times 1} ) \times \textit{M} \times D_{{ow}} \times D_{{oh}}.$$

Then, the generated feature map will be processed by simple 1$\times$1 convolution. The pointwise convolution has a computational cost of

(3)$${C_{{pc}}} = {\rm{ }}1 \times 1 \times \textit{M} \times D_{{ow}} \times D_{{oh}} \times \textit{N}.$$

The total parameters of the depthwise separable cross convolution can be represented by the following equation:

(4)$$C_{{dsc}} = ( \textit{k} \times \textit{k} + 1 \times \textit{k} + \textit{k} \times 1 ) \times \textit{M} \times D_{{ow}} \times D_{{oh}} + 1 \times 1 \times \textit{M} \times D_{{ow}} \times D_{{oh}} \times \textit{N}.$$

The comparison of the computational costs for the depthwise separable cross convolution and standard convolution can be expressed by the following equation:

(5)$$\begin{aligned} C_{{r}} &= \frac{{(\textit{k} \times \textit{k} + 1 \times \textit{k} + \textit{k} \times 1) \times \textit{M} \times {D_{ow}} \times {D_{oh}}}}{{\textit{k} \times \textit{k} \times \textit{M} \times {D_{ow}} \times {D_{oh}} \times \textit{N}}} + \frac{{1 \times 1 \times \textit{M} \times {D_{ow}} \times {D_{oh}} \times \textit{N}}}{{\textit{k} \times \textit{k} \times \textit{M} \times {D_{ow}} \times {D_{oh}} \times \textit{N}}}\\ &= \frac{{1}}{{\textit{N}}} + \frac{{2}}{{\textit{k} \times \textit{N}}} + \frac{1}{{{\textit{k}^2}}}. \end{aligned}$$

It demonstrates that the depthwise separable cross convolution still maintains efficient computing capability with the size and channel of the convolutional kernel increase.

We choose the exponential linear unit (ELU) [34] as the activation function. The expression for ELU is

(6)$$f\left( x \right) = \left\{ \begin{array}{ll} x{,} & x \ge 0 \\ {\alpha} ({{e^x} - 1} ){,} & x < 0. \end{array} \right.$$

ELU addresses the gradient vanishing problem by introducing a linear part for non-negative inputs. For negative inputs, ELU controls saturate by adjusting the hyperparameter $\alpha$, thereby improving robustness to input variations and noise. Unlike the ReLU only has positive values, the negative values of ELU push the mean unit activations closer to zero, which accelerate the training speed and improve the stability of the training process.

2.4 Modules

2.4.1 Detection module

The front-end convolution blocks of the feature point detection module compute tensor maps outputted by the network backbone and output tensor maps $X^{{A}}$ and $X^{{B}} (X^{{A}}=H/8{\times }W/8{\times }65, X^{{B}}=H/8{\times }W/8{\times }65)$. After a channel-wise softmax and subpixel convolutional upsampling, the original image resolution heatmaps $H^{{A}}$ and $H^{{B}}$ for the distribution of feature points’ response values with pixel-wise accuracy are obtained. Unlike SuperPoint [28], which obtains sparse feature points with only pixel-wise accuracy by directly applying non-maximum suppression (NMS) to the heatmaps, we present a detector head with softargmax [35] to overcome the challenge of training end-to-end. We apply softargmax on the $w{\times }w$ patches extracted from the neighbors of each feature point. The output of the softargmax enables flow of gradients from the latter blocks to the front, in order to refine the coordinates for sub-pixel accuracy. The final coordinates of each feature point can be expressed as:

(7)$$( u ^ { \prime } {,}\ v ^ { \prime } ) = ( u _ { 0 } {,}\ v _ { 0 } ) + ( \delta u {,}\ \delta v )$$

where in a given patch,

(8)$$\delta u = \frac{{\sum {_j\sum {_i{e^{f({u_i}{,}{v_j})}}i} } }}{{\sum {_j\sum {_i{e^{f({u_i}{,}{v_j})}}} } }}{,}{\rm{ }}\quad \delta v = \frac{{\sum {_j\sum {_i{e^{f({u_i}{,}{v_j})}}j} } }}{{\sum {_j\sum {_i{e^{f({u_i}{,}{v_j})}}} } }}$$

$f(u{,}\ v)$ denotes the pixel value of the heatmap at position $(u$, $v)$, and $i$, $j$ denotes the relative directions in $x$, $y$ axis with respect to the center pixel $(u_0$, $v_0)$. The integer-level feature point $(u_0$, $v_0)$ is therefore updated to $(u'$, $v')$ with sub-pixel accuracy.

2.4.2 Description module

The front-end convolution blocks of the feature point description module compute tensor maps outputted by the network backbone and output tensor maps $Y^{{A}}$ and $Y^{{B}}$ ($Y^{{A}}=H/8{\times }W/8{\times }256, Y^{{B}}=H/8{\times }W/8{\times }256$). The back-end blocks then perform bilinear interpolation of the descriptor and then L2-normalize the activations to be unit length, yielding features $D^{{A}}$ and $D^{{B}}$.

2.4.3 Coarse-to-Fine module

The coarse-to-fine module meticulously refines the matching results based on the contextual information and the descriptive details of feature points.

The score matrix $S$ between the feature points is first calculated by

(9)$$S(i{,} \ j) = \frac{1}{\tau } \cdot \left\langle {{D^A}(i){,} \ {\rm{ }}{D^B}(j)} \right\rangle.$$

Then, we apply softmax on both dimensions of $S$ [36] to obtain the probability of soft mutual nearest neighbor matching. Formally, when using dual-softmax, the matching probability $P_c$ is obtained by

(10)$${P_c}(i{,} \ j) = \mathrm{softmax}{(S(i{,} \ \cdot))_j} \cdot \mathrm{softmax}{(S({\cdot}{,} \ j))_i}.$$

For every coarse match ($\hat i{,}\ \hat j$), we first locate its position at original image pair $I^{{A}}$ and $I^{{B}}$, and then crop two sets of local windows of size $w{\times }w$. We directly compare the gray values of point ($\hat i{,}\ \hat j$) with the gray values of pixels $p_i$ and $q_j$ ($i=1{,} 2{,} {\ldots }{,} 4w-4{,} j=1{,} 2{,} {\ldots }{,} 4w-4$) along the window boundaries. Simultaneously, gray thresholds $T_1$ and $T_2$ are set. Based on the confidence matrix $P_c$, we select matches with confidence higher than the threshold of $\theta _c$, and further enforce the mutual nearest neighbor (MNN) criteria, which filters possible outlier coarse matches. The coarse-level match predictions are denoted as:

(11)$$I(\hat i) > {T_1}\quad {\rm{ }}I(\hat j) > {T_1}$$

(12)$$I(\hat i) - I(p_i) > {T_2}\quad {\rm{ }}I(\hat j) - I(q_j) > {T_2}$$

(13)$${M_c} = \{ (\hat i{,}\ \hat j)|\forall (\hat i{,}\ \hat j) \in MNN(P_c){,}\ {P_c}(\hat i{,}\ \hat j) \ge {\theta _c}\}.$$

We use a correlation-based approach to meticulously refine the matching results. In heatmaps $H^{{A}}$ and $H^{{B}}$, we crop two sets of local windows of size $w{\times }w$ centered at $\hat i$ and $\hat j$, and denote them as $F^{{A}}(\hat i)$ and $F^{{B}}(\hat j)$. We correlate the center vector of $F^{{A}}(\hat i)$ with all vectors in $F^{{B}}(\hat j)$ and thus produce a heatmap that represents the matching probability of each pixel in the neighborhood of $\hat j$ with $\hat i$. By computing the expectation over the probability distribution, we derive the ultimate matching coordinates with sub-pixel accuracy. Gathering all matches yields the final fine-level matches $M_f$.

2.5 Loss function

To reduce training memory footprint and improve operational efficiency, we replace dense loss with sparse descriptor loss. $M_p$ positive pairs are sampled sparsely from a total of $(H_c \times W_c)^2$ of positive and negative pairs. For each positive pair, we gather $M_n$ negative pairs, forming $M_p{\times }M_n$ pairs of sampled correspondences.

The final loss of DSCNet is the sum of the feature point detection loss $L_p$ and the descriptor loss $L_d$:

(14)$${L} = {L_p}(X, Y) + {L_p}(X', Y') + {L_d}.$$

Two types of feature point detector loss function can be applied in DSCNet. One involves computing the fully convolutional cross-entropy loss over the cells ${x_{ij}} \in X$. We call the set of corresponding ground truth feature point labels $Y$ and individual entries as ${y_{ij}}$. The loss is

(15)$${L_p}(X{,}\ Y) = \frac{1}{{{H_c}{W_c}}}\mathop \sum _{i = 1{,}j = 1}^{{H_c}{W_c}} {l_p}({x_{ij}}{;}\ {y_{ij}})$$

where

(16)$${l_p}({x_{ij}}{;}\ y) ={-} \log(\frac{{{e^{{x_{ijy}}}}}}{{\mathop \sum\limits _{k = 1}^{65} {e^{{x_{ij}}}}}})$$

The second loss function for the feature point detector involves convolving the ground truth detection heatmap with a Gaussian kernel ${\sigma _{fe}}$. We denote the coordinates of the ground truth feature points with ($u$, $v$) and the predicted feature points coordinates with ($x$, $y$). The $l_p$ is expressed as:

(17)$${l_p} ={-} \log (\frac{{{e^{ - \frac{{{{({x} - {u})}^2} + {{({y} - {v})}^2}}}{{2{\sigma ^2}}}}}}}{{\mathop \sum\limits _{k = 1}^{65} {e^{ - \frac{{{{({x_k} - {u})}^2} + {{({y_k} - {v})}^2}}}{{2{\sigma ^2}}}}}}}).$$

To improve the robustness of feature point descriptor, enabling it to adapt to variations in environmental illumination and geometric differences, while preserving the uniqueness of feature vectors for identifying corresponding points, the descriptor loss is defined as:

(18)$${L_d}(D{,}\ D'{,}\ S) = \frac{1}{{{{({H_c}{W_c})}^2}}}\mathop \sum _{i = 1{,}j = 1}^{{H_c}{W_c}} \mathop \sum _{i' = 1{,} j' = 1}^{{H_c}{W_c}} {l_d}(d{,}\ d'{,}\ s)$$

where

(19)$$\begin{aligned} &{l_d}(d{,}\ d'{,}\ s)\\ &= {\lambda} \cdot s \cdot \max (0 {,}\ {d^T}d' - \max ((d_n^Td') {,}\ ({d^T}{{d'_n}})) - {m_p})\\ &\ \ \ + {\lambda} \cdot s \cdot \max (0 {,}\ {d^T}d' - {m_n}), \end{aligned}$$

(20)$${s_{iji'j'}} = \left\{ \begin{array}{ll} 1{,} & \left\| {H \cdot p_{ij} - p_{i'j'}} \right\| \le 2 \\ 0{,} & \text{otherwise} \end{array} \right.$$

Where ${d^T}d'$ represents the cosine similarity of positive sample descriptors. $d_n^Td'$ and ${d^T}{d'_n}$ mean the cosine similarity of negative sample descriptors. ${d_n}$ and ${d'_n}$ denote the non-matching descriptors closest to $d'$ and $d$, respectively. ${p_{ij}}$ represents the feature point at coordinates ($i$, $j$) in images, and $H$ is the homography matrix.

3. Datasets

This section describes the training datasets of DSCNet in detail. We perform real-time transformations on input images such as translation, rotation, scale and warp to simulate the images captured by cameras in real driving scenarios. DSCNet utilizes self-supervised learning to generate pseudo ground truth for unlabeled images.

3.1 Synthetic datasets for pre-training

Currently, there is no large-scale annotated image database of speckle structured light. Annotating datasets is a labor-intensive task and often involves high costs. However, self-supervised learning can leverage the rich information inherent in the data itself to construct auxiliary tasks. This enables obtaining supervisory signals without using any labels and training neural networks to extract discriminative features, ultimately improving the detection accuracy of the network.

Thus to bootstrap our network in extracting and matching speckle feature points, we first create a large-scale synthetic dataset called synthetic speckle datasets as shown in Fig. 6. In the datasets, we can remove label ambiguity by modeling feature points by mimicking the distribution patterns of structured light.

Fig. 6. Synthetic speckle datasets.

Download Full Size | PDF

It is worth mentioning that, as shown in Fig. 6, Fig. 7 and Fig. 8, we can choose to train the network by synthesizing various structured light patterns, such as laser speckle, stripe grating, line structured light, grid structured light, etc. Therefore, the self-supervised training method we propose is able to recognize various structured lights in different application scenarios, demonstrating extremely high versatility.

Fig. 7. Synthetic grid structured light.

Download Full Size | PDF

Fig. 8. Synthetic line structured light.

Download Full Size | PDF

3.2 Real speckle datasets

To simulate changes in illumination, we use binocular to capture speckle images projected onto the helmet under different light intensities. We control the helmet’s pose with the high-precision turntable and apply operations such as translation, rotation, scale, warp and compound transformations to the captured images to simulate the complex head movements of the pilot. This step increases the diversity and balance of the samples. The final real training datasets consist of 14,204 gray images, and the validation datasets comprise 5,690 gray images, all with a resolution of 320 $\times$ 240, as shown in Fig. 9.

Fig. 9. Real speckle datasets.

Download Full Size | PDF

4. Experiments and analysis

4.1 Experimental details

The experimental system includes an Osela SL-830-S-A-RPP017ES laser speckle projector with a working wavelength of 830nm, two HIKVISION monochrome industrial cameras with a resolution of 1280 $\times$ 1024, and matched lens with a focal length of 8 mm, a high-precision turntable, and a helmet, which is shown in Fig. 10. The experiments are conducted under indoor lighting conditions, and the cameras’ spectral response intensity to laser speckle projector is 22${\%}$. The measured helmet is placed in front of the cameras with a distance around 700 mm.

Fig. 10. Experimental system.

Download Full Size | PDF

All training is implemented using PyTorch and conducted on an Intel Xeon W-2123 CPU and a NVIDIA GeForce RTX 3060 GPU with 12 GB of memory. The model is trained using adam with an initial learning rate of $1 \times 10^{-4}$ and a batch size of 32. $\theta _c$ is chosen to 0.2. For descriptor sparse loss, we have $H_c = H/8$, $W_c = W/8$, $M_p = 600$, $M_n = 100$. Window size $w$ is equal to 5. $m_p$ is set to 1, $m_n$ is 0.2 and $\lambda$ is 0.0001. We opt for the second feature detection loss function, where ${\sigma _{fe}}$ = 0.2.

4.2 Training

We adopt an integrated training approach involving real-time training, self-supervised learning, and transfer learning. Specifically, we first pre-train DSCNet on the synthetic speckle datasets. Then, the speckle images are captured by experimental system. After applying preprocessing operations, including compound transformations, data augmentation techniques and image annotation, we input the processed images and annotation results into the network for training. For each training image, DSCNet must satisfy the requirements for the number of extracted speckle feature points, precision and matching accuracy. Throughout the entire training process, DSCNet takes input images and labels to produce predictions. Subsequently, the loss function is used to compute the difference between predictions and annotation results, yielding the model’s loss. Through backpropagation, the gradient of the loss against model parameters is calculated. Finally, the optimizer updates the model parameters based on the gradient to minimize the loss. It’s worth noting that in the first training process, the labels are annotated by the pre-trained model, and in the second training, they are annotated by the first trained model. The entire process is real-time and continuous. This training method enables DSCNet to comprehensively learn various features and patterns present in rich data from real scenarios, thus improving the robustness of the model.

DSCNet is trained with 100k iterations on synthetic speckle datasets and two rounds of 100k iterations on real speckle datasets. Figure 11 shows the changes in loss, precision and recall during training process.

Fig. 11. Training process. (a) Changes in loss. (b) Changes in precision. (c) Changes in recall.

Download Full Size | PDF

In Fig. 11, we observe a clear optimization trend of the model during training process. As the training iterations increase, the loss of DSCNet on both the training and validation sets continues to decrease, while the precision and recall continue to increase and eventually stabilize. The increase in precision indicates a significant improvement in the model’s accuracy in positive prediction, and the rise in recall suggests an effective enhancement in the model’s ability to identify positives. DSCNet demonstrates excellent generalization ability on the validation sets, confirming its robustness and adaptability in the face of unknown data. These results validate the effectiveness and reliability of DSCNet in the task of matching images of pilots’ helmets, providing strong support for its wide application in real-world scenarios.

4.3 Verification experiments

We utilize commonly used performance evaluation metrics including repeatability (Rep.), mean localization error (MLE), mean average precision (MAP), and matching score (M. S) [37], to assess the experimental results.

To evaluate the training effectiveness of DSCNet, we compare the model’s performance in extracting and matching speckle feature points in both low-texture and multi-texture regions on the helmet for each training round. The threshold $\epsilon$ of correctness for speckle feature point detection is set to 2. The results are presented in Table 1 and Fig. 12.

Fig. 12. Speckle feature point extraction and matching results for each training model. The top images are low-texture areas, and the bottom images are multi-texture areas. The left images show the feature point detection results, and the right images show the matching results.

Download Full Size | PDF

Table 1. Performance evaluation of training models for extracting and matching speckle feature points ($\epsilon = 2 \, \mathrm {pixels}$).

View Table | View all tables in this article

As shown in Fig. 12, DSCNet demonstrates a certain proficiency in extracting speckle feature points after pre-training, with an extraction accuracy of approximately 0.83. However, the extraction accuracy is still insufficient, and errors in point extraction persist. After training on the real speckle datasets, it can be observed that the ability of DSCNet to extract and match speckle feature points has significantly improved, with almost no errors in point extraction and matching. Similar results can be inferred from Table 1. With the increase in training rounds, the repeatability of the feature points extracted by the feature point detector rises from 0.0997 to 0.3807, and the mean localization error decreases from 0.9046 to 0.7038. Meanwhile, the mean average precision of feature point matching for feature point descriptor grows from 0.4682 to 0.6572, and the matching score increases from 0.5362 to 0.7553. We use the left-right consistency check method [27] to calculate the mismatch rate, and the final mismatch rate for DSCNet is 0.95%.

4.4 Comparative experiments

To evaluate the performance of the DSCNet feature point detector and descriptor, we conduct comparative experiments of DSCNet with two well-known detector and descriptor systems: SIFT [38] and ORB [39], as well as four state-of-the-art deep learning networks: DeDoDe [40], ALIKE [41], DISK [42] and GlueStick [43]. The results are shown in Fig. 13 and Table 2.

Fig. 13. Matching results of different methods on real speckle datasets. The second row of images are captured during the day, the third and fifth rows of images are acquired under bright lighting conditions at night, and the first and fourth rows of images are taken under dark conditions at night.

Download Full Size | PDF

Table 2. Comparison of feature detection methods for speckle point matching ($\epsilon = 2 \, \mathrm {pixels}$).

View Table | View all tables in this article

The results show that both ORB and GlueStick lack the capability to extract and match speckle feature points. In terms of quantity, DSCNet evidently extracts and matches more speckle feature points than SIFT, DeDoDe, ALIKE and DISK, while maintaining a higher mean matching accuracy. In terms of matching performance, DSCNet exhibits significant robustness to changes in illumination and viewing angles. For a 480 $\times$ 360 speckle image pair, although the matching speed of DSCNet is lower than DISK, it still outperforms other methods.

4.5 Ablation experiments

In this subsection, we attempt to investigate DSCNet through a series of ablation studies. Specifically, we focus on the following three design decisions. The first model is the complete DSCNet, the second model removes depthwise separable cross convolution blocks, and the third model removes both depthwise separable cross convolution blocks and coarse-to-fine module. For comparison, we pre-train all three models on synthetic speckle datasets with the same training configuration. Subsequently, we transfer them to the real speckle datasets for a single round of training, and the final results are shown in Fig. 14.

Fig. 14. Comparison of training results of three ablation models. (a) Comparison of loss. (b) Comparison of precision. (c) Comparison of recall.

Download Full Size | PDF

As shown in Fig. 14, removing any of the two designs degrades DSCNet. The introduction of the coarse-to-fine module accelerates the convergence speed of loss function. Compared to the third model, the second model exhibites precision improvements of 28.33${\%}$ and 6.61${\%}$ on the synthetic and real speckle training datasets, separately. The introduction of the depthwise separable cross convolution blocks not only further reduces the training loss but improve the precision of the second model by 18.81${\%}$ and 8.27${\%}$ on the synthetic and real speckle training datasets, respectively. In addition, the recall on the synthetic and real speckle training datasets also increases by 3.7${\%}$ and 6.1${\%}$, separately.

The superiority of the coarse-to-fine module is easily explained as it performs sub-pixel adjustments to the speckle feature points, thereby improving the precision and recall of DSCNet on the datasets. As for the superiority of the depthwise separable cross convolution blocks, we explain it in detail by considering the distribution pattern of speckle feature points. Due to the randomness of the speckle feature points projected by the laser speckle projector, not all speckle points conform to the grayscale distribution pattern with a bright center pixel and dark edge pixels as shown in Fig. 3. It is the presence of these irregularly distributed speckle feature points that affects the accuracy of feature point extraction and matching algorithms.

For speckle feature points in Fig. 15, when the reference images are rotated and input into the network, the original 3$\times$3 convolution blocks as shown by the blue line produce meaningless results. This prevents establishing meaningful correlations with the reference images. In contrast, the part shown by the green line in the depthwise separable cross convolution blocks can still output the same information as the reference images by convolving the rotated images, which strengthens the connection between rotated images and reference images. Therefore, the depthwise separable cross convolution blocks improve the robustness of DSCNet to image rotation. This means that even with significant movement of the pilot’s head during the aircraft flight or noticeable body movement during combat, DSCNet can still maintain excellent image matching accuracy. This is crucial for accurate measurement of the pilot’s head pose, ensuring precise perception of the battlefield environment during flight.

Fig. 15. Comparison of original 3$\times$3 convolution and depthwise separable cross convolution. The top row is the original 3$\times$3 convolution blocks, the bottom row is the depthwise separable cross convolution blocks.

Download Full Size | PDF

5. Conclusion

In this article, we propose DSCNet to address the image matching problem based on laser speckle pattern projection. In the backbone of the network, we design the depthwise separable cross convolution blocks, improving the robustness of DSCNet to image rotation. To improve the precision of speckle feature point extraction and matching, we incorporate a softargmax detection head at the end of the feature detection module and a coarse-to-fine module in the feature description module. The integrated approach of real-time training, self-supervised learning, and transfer learning enables DSCNet to comprehensively learn various features and patterns present in rich data, improving the generalization and feature representation capabilities of DSCNet. Compared with other classic methods and deep learning-based algorithms, DSCNet exhibits a significant advantage in terms of speckle feature point matching accuracy, mismatch rate and matching speed, with a mean matching accuracy of up to 91.62${\%}$ on the helmet, a mismatch rate of only 0.95${\%}$, and a matching speed that can reach 24 frames per second. The ablation experiments confirm that the depthwise separable cross convolution blocks improve the match rate between images, while the coarse-to-fine module improves the detection accuracy of speckle feature points.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. L. Kou, K. Yang, L. Luo, et al., “Binocular stereo matching of real scenes based on a convolutional neural network and computer graphics,” Opt. Express 29(17), 26876–26893 (2021). [CrossRef]

2. Y. Li, J. Qian, S. Feng, et al., “Composite fringe projection deep learning profilometry for single-shot absolute 3D shape measurement,” Opt. Express 30(3), 3424–3442 (2022). [CrossRef]

3. T. Liu, Y. Wang, X. Niu, et al., “LiDAR odometry by deep learning-based feature points with two-step pose estimation,” Remote Sens. 14(12), 2764 (2022). [CrossRef]

4. G. Krishnan, R. Joshi, T. O’Connor, et al., “Human gesture recognition under degraded environments using 3D-integral imaging and deep learning,” Opt. Express 28(13), 19711–19725 (2020). [CrossRef]

5. R. Yao, X. Zhu, Y. Zhou, et al., “Unsupervised cycle-consistent adversarial attacks for visual object tracking,” Opt. Lasers Eng. 80, 102532 (2023). [CrossRef]

6. A. G. Leal-Junior, A. Frizera, C. Marques, et al., “Optical fiber specklegram sensors for mechanical measurements: a review,” IEEE Sens. J. 20(2), 569–576 (2020). [CrossRef]

7. A. G. Leal-Junior, H. Rocha, P. L. Almeida, et al., “Force estimation with sustainable hydroxypropyl cellulose sensor using convolutional neural network,” IEEE Sens. J. 24(2), 1366–1373 (2024). [CrossRef]

8. P. Gorai, S. Kumar, C. Marques, et al., “Imprinted polymer functionalized concatenated optical microfiber: hypersensitive and selective,” IEEE Sens. J. 23(1), 329–336 (2023). [CrossRef]

9. E. Csencsics, T. Wolf, and G. Schitter, “Efficient framework for the simulation of translational and rotational laser speckle displacement in optical sensor assemblies,” Opt. Eng. 61(06), 061410 (2022). [CrossRef]

10. P. Etchepareborda, M.-H. Moulet, and M. Melon, “Random laser speckle pattern projection for non-contact vibration measurements using a single high-speed camera,” Mech. Syst. Signal Proc. 30, 7025–7037 (2023).

11. C. Liu, L. Chen, X. He, et al., “Coaxial projection profilometry based on speckle and fringe projection,” Opt. Commun. 341, 228–236 (2015). [CrossRef]

12. X. Yuan, C. Teng, X. Xu, et al., “High-accuracy and real-time 3D positioning, tracking system for medical imaging applications based on 3D digital image correlation,” Opt. Lasers Eng. 88, 82–90 (2017). [CrossRef]

13. Y. Yin, Z. Cai, H. Jiang, et al., “High dynamic range imaging for fringe projection profilometry with single-shot raw data of the color camera,” Opt. Lasers Eng. 89, 138–144 (2017). [CrossRef]

14. A. W. Stark, E. Wong, D. Weigel, et al., “Subjective speckle suppression in laser-based stereo photogrammetry,” Opt. Eng. 55(12), 121713 (2016). [CrossRef]

15. X. Liu, H. Zhao, G. Zhan, et al., “Rapid and automatic 3D body measurement system based on a GPU-Steger line detector,” Appl. Opt. 55(21), 5539–5547 (2016). [CrossRef]

16. M. Schaffer, M. Grosse, B. Harendt, et al., “High-speed three-dimensional shape measurements of objects with laser speckles and acousto-optical deflection,” Opt. Lett. 36(16), 3097–3099 (2011). [CrossRef]

17. D. Khan, M. A. Shirazi, and M. Y. Kim, “Single shot laser speckle based 3D acquisition system for medical applications,” Opt. Lasers Eng. 105, 43–53 (2018). [CrossRef]

18. F. Yang and S. Fu, “Research on feature extraction and matching algorithm based on speckle structured light binocular vision system,” Proc. SPIE 11338, 1133839 (2019). [CrossRef]

19. J. Guo, X. Peng, A. Li, et al., “Automatic and rapid whole-body 3D shape measurement based on multinode 3D sensing and speckle projection,” Appl. Opt. 56(31), 8759–8768 (2017). [CrossRef]

20. C.-H. Yeh, P.-Y. Sung, C.-H. Kuo, et al., “Robust laser speckle recognition system for authenticity identification,” Opt. Express 20(22), 24382–24393 (2012). [CrossRef]

21. C. He, Y. Cao, Y. Yang, et al., “Fault diagnosis of rotating machinery based on the improved multidimensional normalization resNet,” IEEE Trans. Instrum. Meas. 72, 1–11 (2023). [CrossRef]

22. J. Zhao and H. Zhu, “CBPH-Net: A small object detector for behavior recognition in classroom scenarios,” IEEE Trans. Instrum. Meas. 72, 1–12 (2023). [CrossRef]

23. J. Tan, W. Su, Z. He, et al., “Deep learning-based method for non-uniform motion-induced error reduction in dynamic microscopic 3D shape measurement,” Opt. Express 30(14), 24245–24260 (2022). [CrossRef]

24. Z. Ma, B. Wang, L. Huang, et al., “Dimension-expanded-based matching method with siamese convolutional neural networks for gravity-aided navigation,” IEEE Trans. Ind. Electron. 70(10), 10496–10505 (2023). [CrossRef]

25. W. Yin, Y. Hu, S. Feng, et al., “Single-shot 3D shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

26. R. Wang, P. Zhou, and J. Zhu, “Accurate 3D reconstruction of single-frame speckle-encoded textureless surfaces based on densely connected stereo matching network,” Opt. Express 31(9), 14048–14067 (2023). [CrossRef]

27. Y. Dong, X. Yang, H. Wu, et al., “Lightweight and edge-preserving speckle matching network for precise single-shot 3D shape measurement,” Measurement 210, 112549 (2023). [CrossRef]

28. D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-supervised interest point detection and description,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)337–33712 (2018).

29. J. Sun, Z. Shen, Y. Wang, et al., “LoFTR: Detector-free local feature matching with transformers,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)8922–8931 (2021).

30. Y. Ono, E. Trulls, P. Fua, et al., “LF-Net: Learning local features from images,” in 32nd Conference on Neural Information Processing Systems (NIPS), Advances in Neural Information Processing Systems (2018), 6234–6244.

31. I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)39–48 (2017).

32. A. G. Howard, M. Zhu, B. Chen, et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv, arXiv:1704.04861 (2017). [CrossRef]

33. L. Yu, E. Yang, B. Yang, et al., “A robust learned feature-based visual odometry system for UAV pose estimation in challenging indoor environments,” IEEE Trans. Instrum. Meas. 72, 1–11 (2023). [CrossRef]

34. D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv, arXiv:1511.07289 (2015). [CrossRef]

35. Y. Jau, R. Zhu, H. Su, et al., “Deep keypoint-based camera pose estimation with geometric constraints,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)4950–4957 (2020).

36. I. Melekhov, G. Brostow, J. Kannala, et al., “Image stylization for robust features,” arXiv, arXiv:2008.06959 (2020). [CrossRef]

37. K. M. Yi, E. Trulls, V. Lepetit, et al., “LIFT: Learned invariant feature transform,” in 14th European Conference on Computer Vision (ECCV) (Springer, Cham, 2015), Vol. 9910, p. 467.

38. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. 60(2), 91–110 (2004). [CrossRef]

39. E. Rublee, V. Rabaud, K. Konolige, et al., “ORB: An efficient alternative to sift or surf,” 2011 IEEE International Conference on Computer Vision (ICCV)2564–2571 (2011).

40. J. Edstedt, G. Bökman, M. Wadenbäck, et al., “DeDoDe: Detect, don’t describe – describe, don’t detect for local feature matching,” arXiv, arXiv:2308.08479 (2023). [CrossRef]

41. X. Zhao, X. Wu, J. Miao, et al., “ALIKE: Accurate and lightweight keypoint detection and descriptor extraction,” IEEE Trans. Multimedia 25, 3101–3112 (2023). [CrossRef]

42. J. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” arXiv, arXiv:2006.13566 (2020). [CrossRef]

43. R. Pautrat, I. Suárez, Y. Yu, et al., “GlueStick: Robust image matching by sticking points and lines together,” arXiv, arXiv:2304.02008 (2023).10.48550/arXiv.2304.02008

Training Models	Detector Metrics		Descriptor Metrics
Training Models	Rep.	MLE	MAP	M. S
DSCNet (Pre-training)	0.0997	0.9046	0.4682	0.5362
DSCNet (One round)	0.2952	0.8495	0.5743	0.6998
DSCNet (Two rounds)	0.3807	0.7038	0.6572	0.7553

Matching Methods	Matching Time(s)	Total Number	Correct Number	Mean Accuracy( $%$ )
SIFT	0.211	73	14	19.18
ORB	0.043	16	0	0
DeDoDe	0.107	83	4	4.82
ALIKE	0.136	56	5	8.93
DISK	0.012	236	11	4.66
GlueStick	0.204	76	0	0
DSCNet	0.042	561	514	91.62

Training Models	Detector Metrics		Descriptor Metrics
Training Models	Rep.	MLE	MAP	M. S
DSCNet (Pre-training)	0.0997	0.9046	0.4682	0.5362
DSCNet (One round)	0.2952	0.8495	0.5743	0.6998
DSCNet (Two rounds)	0.3807	0.7038	0.6572	0.7553

Matching Methods	Matching Time(s)	Total Number	Correct Number	Mean Accuracy( $%$ )
SIFT	0.211	73	14	19.18
ORB	0.043	16	0	0
DeDoDe	0.107	83	4	4.82
ALIKE	0.136	56	5	8.93
DISK	0.012	236	11	4.66
GlueStick	0.204	76	0	0
DSCNet	0.042	561	514	91.62

DSCNet: lightweight and efficient self-supervised network via depthwise separable cross convolution blocks for speckle image matching

Abstract

1. Introduction

2. Proposed method

2.1 Measurement system

2.2 Overall architecture of DSCNet

2.3 Backbone

2.4 Modules

2.4.1 Detection module

2.4.2 Description module

2.4.3 Coarse-to-Fine module

2.5 Loss function

3. Datasets

3.1 Synthetic datasets for pre-training

3.2 Real speckle datasets

4. Experiments and analysis

4.1 Experimental details

4.2 Training

4.3 Verification experiments

4.4 Comparative experiments

4.5 Ablation experiments

5. Conclusion

Disclosures

Data availability

References

Data availability

Cited By

Figures (15)

Tables (2)

Equations (20)

Optics Express