Registration of multi-modal images under a complex background combining multiscale features extraction and semantic segmentation

Wenjun Jiang; Ji Wu; Chi Chen; Jianming Chen; Xiangjin Zeng; Liyun Zhong; Jianglei Di; Xiaoyan Wu; Xiaoyan Wu; Yuwen Qin

doi:10.1364/OE.465214

1. Introduction

Imaging with visible light is able to obtain rich information, including the contrast, color, shape, etc. of the object. However, the image quality is very susceptible to environmental influences such as rain, snow, fog, low light, and so on. Polarization imaging technology can perform long-distance image acquisition in harsh environments, and has absolute advantages in suppressing background noise, improving detection distance, acquiring detailed features, and identifying target camouflage. Therefore, its applications are very extensive, such as detecting hidden or camouflaged targets, and enabling navigation in severe weather. At the same time, infrared imaging technology is also not affected by rain, snow, light, etc., and can make up for the shortcomings of visible light band imaging, but the cost of the imaging system is high and the image resolution is relatively low. Multi-modal imaging system, which combines visible light polarization imaging technology and infrared imaging technology, is capable of overcoming the disadvantages of single-band imaging and we can further obtain more intact and precise information about the object through image fusion. Nowadays, multi-modal imaging system has a great number of significant applications in video surveillance, target tracking, robot double-sided vision, etc. However, there is usually a certain angular deviation when we use the visible light polarization camera and the infrared camera to shoot the same target in the binocular form. In this case, if we directly fuse the infrared image and the polarization image, the final result will be very poor. Therefore, image registration technology is very important for multi-modal imaging [1–4].

Image registration is to align two images of the same scene acquired by different sensors, different angles and different periods, which has been widely used in remote sensing, optical field, medical imaging, material mechanics and other fields [5–10]. Traditional image registration methods are mainly divided into region-based methods and feature-based methods [11]. Region-based image registration methods work by maximizing the similarity between two images. It does not need to rely on complex image preprocessing, but uses the grayscale statistical information of the image itself to measure the similarity of the images, and then finds the optimal transformation through similarity metrics. Similarity metrics mainly include cross-correlation, normalized cross-correlation and mutual information. The implementation of the algorithm is relatively simple, but due to the large amount of computation required in the search process of the optimal transformation, this method takes a long time and has limited application range [12]. Feature-based methods usually select certain descriptors to describe the specific salient features of images, such as points, lines, edges, etc. Among them, the most classic one is scale invariant feature transform (SIFT) proposed by Lowe [13], which is robust to rotation, scaling, noise and can be quickly matched in massive data. However, the SIFT algorithm is still time-consuming due to the large amount of computational data and high time complexity. On the basis of SIFT, Ke proposed a better feature extraction algorithm PCA-SIFT, Mikolajczyk proposed to use GLOH to achieve image registration, Bay proposed to use speeded up robust features (SURF) for image registration, and Rublee proposed to use Oriented FAST and Rotated BRIEF (ORB) to achieve image registration [14–17]. Among them, PCA-SIFT and GLOH both use the PCA technique to reduce dimensionality, while the SURF algorithm proposed by Herbert Bay improves on all three aspects of feature point detection, feature point description and feature matching. These methods achieve substantial improvement over traditional methods in both computational speed and accuracy. However, the extracted image features are mostly low-level features, and a large number of middle-level and high-level features are lost. Therefore, for multi-modal images, due to the huge difference in appearance and the limited expression capabilities of low-level feature, it’s easily interfered by factors such as brightness, rotation angle, and texture, resulting in registration failure. Not only that, the distribution of feature points extracted by the traditional feature-based image registration method is sometimes concentrated in a certain part of the image, then the partial areas of the two images cannot be aligned, which will affect the subsequent fusion steps.

With the rapid development of deep learning, convolutional neural networks (CNN) have been widely used in image classification, target detection and other fields with great success [18–20]. In the field of feature extraction, the higher-level features in images extracted by pre-trained CNN using large-scale dataset ImageNet are more expressive than the low-level features extracted by traditional methods, increasing the robustness of feature descriptors and greatly improving the accuracy and stability of image registration [21–22]. However, when facing multi-modal images with complex backgrounds, after feature point extraction, screening and matching, most of the remaining feature point pairs will be located on the background, which is unrelated to the registration target. Due to the huge interference brought by the complex background, there are few feature point pairs on the registration target, and the correct transformation parameters required for registration cannot be generated, which seriously affects the final registration result.

In order to complete the registration of multi-modal images under complex backgrounds, a multi-modal image registration algorithm that combines multiscale features extraction and semantic segmentation is proposed. The multi-modal image is fed into a CNN that can fuse deep and shallow information to obtain its highly robust feature descriptors. In particular, in order to exclude the interference of the complex background, a CNN with an attention mechanism is trained to filter out the irrelevant feature points so as to obtain the correct transformation parameters. The results will be fused together with the same weight to compare the difference between two images. The experimental results demonstrate the effectiveness of the proposed method.

2. Method

The proposed multi-modal image registration method mainly consists of five parts: image preprocessing, feature descriptors generation, background feature points screening, feature interior points matching, mismatched points eliminating and transformation parameters estimating. Figure 1 shows the flowchart of the proposed method.

Fig. 1. Flowchart of the proposed multi-modal image registration method.

Download Full Size | PDF

2.1. Image preprocessing

For multi-modal images, it is often acquired with different detectors, such as infrared cameras and visible light polarization cameras. The resolution of the images often varies widely. Therefore, a series of image preprocessing needs to be performed on the captured images. First, the deep residual network ResNet18 is introduced to generate the feature descriptors of the multi-modal images. Since the input image resolution of the ResNet18 is 224 × 224, the two multi-modal images need to resample to the same resolution of 224 × 224. Second, each image is divided into a large number of 8 × 8 square areas, and each 8 × 8 square area generates a different feature vector, so the most central position of the square area in each 8 × 8 square area generates a feature point to represent the attribution of these feature vectors. Finally, a local equalization algorithm with non-overlapping sub-blocks is completed for each image, that is, independent histogram equalization is performed on each 8 × 8 square area, so that the local details of the image are more obvious, which is beneficial to feature extraction and generate feature descriptors.

2.2. Feature descriptors generation

In the proposed method, the feature descriptors of the feature points in the images are represented by the output of some specific layers of CNN because the neural network can effectively learn various features of the images from a large number of samples, which greatly simplifies the process of complex feature extraction. ResNet is one of the most typical CNN, which is mostly used for image classification [23]. In the experiment, ResNet is selected as the feature descriptor generator. One reason is that ResNet achieves excellent results on image classification tasks in the ImageNet dataset with more than 14 million images. It indicates the particularly outstanding ability of ResNet in image features extraction and object recognition. Another reason is that the shortcut connections in ResNet can superimpose the feature maps of the upper and lower layers, which combines the deep image feature information with the shallow ones. Thus, the image feature information is superimposed and fused. The generated feature descriptor will be more accurate and robust as well as can better characterize the feature points of the image. The pre-trained CNN we use is a part of ResNet18 trained with ImageNet dataset which consists of convolutional layers, residual blocks, and fully connected layer, where the convolutional layer is used for feature extraction, the residual block is used to fuse and superimpose information of the deep and shallow layers, and the fully connected layer is used for classification. Here we only use its convolutional layers and the residual blocks for feature descriptors generation. The architecture of the ResNet18 is shown in Fig. 2.

Fig. 2. The architecture of ResNet18.

Download Full Size | PDF

In image preprocessing, multi-modal images are resampled to the same resolution of 224 × 224 due to the input size of the ResNet18 is 224 × 224 × 3. Then multi-modal images have been divided into 28 × 28 small areas, each area occupies an area of 8 × 8 pixels, and a feature point is generated in the middle of each small area. Here, the output of the conv3_x layer in ResNet18 directly forms the first feature map F₁, with the size of 28 × 28 × 128, and each 8 × 8 area generates a 128-dimensional feature vector, that is, each feature point has a 128-dimensional feature vector. The size of the output R₄ of the conv4_x layer is 14 × 14 × 256, and each 16 × 16 area generates a 256-dimensional feature vector, that is, 4 feature points share a 256-dimensional feature vector, and the second feature map F₂ is generated by R₄. The size of the output R₅ of the conv5_x layer is 7 × 7 × 512, and a 512-dimensional feature vector is generated for each 32 × 32 area, that is, 16 feature points share a 512-dimensional feature vector, and the third feature map F₃ is generated by R₅. The feature maps can be expressed as

(1)$${F_2} = {R_4} \otimes {E_{2 \times 2 \times 1}},$$

(2)$${F_3} = {R_5} \otimes {E_{4 \times 4 \times 1}},$$

where ${\otimes} $ is defined as Kronecker product, E_{2 × 2 × 1} is defined as a tensor of 2 × 2 × 1 size which is filled with 1s. E_{4 × 4 × 1} is defined as a tensor of 4 × 4 × 1 size which is filled with 1s.

After obtaining F₁, F₂ and F₃, feature maps need to be normalized to unit variance.

(3)$${F_i} \leftarrow \frac{{{F_i}}}{{\sigma ({{F_i}} )}},i = 1,\textrm{ }2,\textrm{ }3,$$

where, σ(F_i) stands for the standard deviation of F_i. The feature vectors of three different dimensions of each feature point are generated by F₁, F₂ and F₃. The generation of the feature descriptor is completed ultimately.

2.3. Screening of background feature points

For multi-modal images with complex backgrounds, we often pay attention to whether the target in the foreground can be registered, but the complex background often contains many feature points, which will have a great impact on the registration results. Therefore, in the process of preliminary screening of feature points, we need to remove the feature points in the background as much as possible, leaving only the feature points on the registration subject to ensure that the registration result is completely based on the feature points on the registration subject.

So as to achieve the purpose of screening background feature points, a U-net neural network needs to be trained by processing a large batch of data for semantic segmentation, which can solve the problems of slow speed, low efficiency and high cost of manual semantic segmentation and dramatically improve efficiency of the entire registration process [24–25]. The U-net neural network with channel attention mechanism is chosen in our experiment, which has some characteristics compared with the traditional fully convolutional network [26–27]. The traditional fully convolutional network integrates deep information and shallow information by adding corresponding pixels but U-net is implemented by channel splicing [28]. The advantage is that the dimension of the feature map after the traditional addition of pixels does not change and each dimension contains more information, but in the binary classification problem such as semantic segmentation, splicing retains more dimensional information and location information. This key skip connection step combines the location information of the underlying information and the semantic information of the deep features, so that the subsequent neural network layers can choose between shallow features and deep features, which is more advantageous for the multi-modal images. The attention mechanism is added for the consideration of images with complex backgrounds. Since the background in multi-modal images interferes greatly with the registration subject, the neural network needs to be similar to the human visual system. The attention mechanism can dynamically adjust the weight of each channel through the attention layer, so as to recalibrate the features to improve the representation ability of the network. First, the global receptive field is obtained by compression, and the global distribution of the response on the feature channel is represented, and then the weights are generated for each feature channel through parameters, and finally the attention layer is formed. When dealing with multi-modal images with complex backgrounds after adding the channel attention mechanism block to the upsampling layers of the U-net, such a structure can greatly increase the network's ability to capture registered subjects, thereby improving the accuracy of segmentation.

When the U-net training is completed, just input the image to be registered into the network, and the output is a mask image. By multiplying the mask image and the original image pixel by pixel, then the subject in the foreground can be obtain. At this time, the pixel value of the registered subject in the image remains unchanged, and the pixel value of the background is 0. Due to the shielding effect of the mask image, the points located outside the registration subject in the image are masked so that the remaining points are all in the registration subject, completing the screening of irrelevant feature points in the background.

2.4. Interior feature points matching

For the feature points on the registration subject, each feature point has three feature vectors of different dimensions, so we use the total feature distance of each feature point as the similarity judgment measure of the feature points in the two multi-modal images. The total distance of features d_all is defined as

(4)$${d_{\textrm{all}}} = 2{d_{\textrm{128}}}({{a_i},{b_j}} )+ \sqrt 2 {d_{256}}({{a_i},{b_j}} )+ {d_{\textrm{512}}}({{a_i},{b_j}} ).$$

Due to the difference in dimension between feature vectors, weight compensation is required. a_i and b_j stand for the feature vectors of two feature points. d₁₂₈(a_i,b_j) stands for the Euclidean distance between the 128-dimensional vectors of two feature points. d₂₅₆(a_i,b_j) stands for the Euclidean distance between the 256-dimensional vectors of two feature points. d₅₁₂(a_i,b_j) stands for the Euclidean distance between the 512-dimensional vectors of two feature points.

After calculating the total feature distance between this feature point and other feature points in this way, if the ratio between the closest distance to one point and the second closest distance to another point is less than a proportional threshold, this pair of matching points is accepted.

2.5. Elimination of mismatched points and estimation of transformation parameters

After completing the preliminary matching of feature points, the classic RANSAC algorithm is then used to eliminate the incorrectly matched point pairs, leaving the correctly matched feature point pairs, and use these feature point pairs to generate the homography matrix required for registration [29]. RANSAC randomly selects 4 pairs of non-collinear matching points, calculates the corresponding homography matrix model, and uses the model to calculate the projection error of all data points, and compares it with the set maximum reprojection threshold, if the error is less than the threshold, then this point will be added to the inliers set. If the current inliers set is greater than the optimal inliers set, the current inliers set will be updated to the optimal inliers set, and then repeated iterations will be performed. Ultimately the optimal model with the minimum projection error in the optimal inliers set is obtained. The formula for calculating the number of iterations k is

(5)$$k = \frac{{log(1 - z)}}{{log(1 - {w^n})}},$$

where, z denotes the confidence, w is the proportion of inliers in the data set, n is the minimum number of samples required for iterative selection.

After that, the filtered points are used to generate the homography matrix H required for image transformation [30], where H is defined as

(6)$$H = \left( {\begin{array}{{ccc}} {{h_{11}}}&{{h_{12}}}&{{h_{13}}}\\ {{h_{21}}}&{{h_{22}}}&{{h_{23}}}\\ {{h_{31}}}&{{h_{32}}}&1 \end{array}} \right).$$

At last, for the feature point pairs on the image, we have

(7)$$\left( {\begin{array}{{c}} {x^{\prime}}\\ {y^{\prime}}\\ 1 \end{array}} \right) = \left( {\begin{array}{{ccc}} {{h_{11}}}&{{h_{12}}}&{{h_{13}}}\\ {{h_{21}}}&{{h_{22}}}&{{h_{23}}}\\ {{h_{31}}}&{{h_{32}}}&1 \end{array}} \right)\left( {\begin{array}{{c}} x\\ y\\ 1 \end{array}} \right),$$

where, (x, y) denotes position coordinates before feature point transformation, (x’, y’) denotes position coordinates after feature point transformation.

After obtaining the homography matrix by calculating the Eq. (7), it will be multiplied by the homogeneous coordinates of the corresponding points in the moving image to complete the image registration between two multi-modal images. If there are images of three or more modalities at the same time, we can also use the method above repeatedly to calculate the transformation parameters between each moving image and fixed image so as to finish the image registration of a set of multi-modal images.

3. Experimental results and discussions

In our experiment, we captured a set of aircraft images under complex background using both a visible light polarization camera (BFS-U3-51S5P) and a near-infrared camera(XEN-000139). During the capture of the aircraft images, there is a difference in the viewing angle of the two cameras, as is shown in Fig. 3. If we need to perform the target recognition on the basis of multimodal imaging, we first need to complete image registration. In this case, the effect of the complex background will be very significant.

Fig. 3. Raw data image of the experiment. (a) captured by visible light polarization camera; (b) captured by near-infrared camera.

Download Full Size | PDF

In order to perform image registration, we first need separate the registered subject and background parts by manual cropping. The pixel value of the background is 0 and the pixel value of the registration subject is 1, and each image is paired with the result of its semantic segmentation. We set the original image as samples and the segmentation results are labels. After data augmentation, a total of 3500 sets of data can be obtained, which are divided into training set, validation set and test set according to the ratio of 6:2:2. At last, there are 2100 sets of data in the training set, 700 sets of data in the validation set, and 700 sets of data in the test set. Then the U-net with attention mechanism is trained. The loss function we use is the cross-entropy loss function, the optimizer is the Adam optimizer, the learning rate is set to 10⁻⁵, and the training is stopped when the cross-entropy loss function value is less than 0.0006 [31–32]. CNN is implemented by Pytorch 1.8.1 based on Python 3.7, which is performed on a PC with an Intel Core i7-10700 CPU, 16 GB of RAM, using NVIDIA GeForce GTX 3090Ti GPU.

We tested the proposed method on polarized and near-infrared images to demonstrate its robustness for extracting feature descriptors, its accuracy of feature point matching, and its registration accuracy. Further, the matching results and registration results of feature points are compared with the traditional SIFT and ORB method. Both the SIFT and ORB method use the ratio of the nearest Euclidean distance to the next nearest Euclidean distance for feature point matching. The threshold is set to 0.8 in the SIFT method and 0.7 in the ORB method.

To evaluate the final experimental results, a combined subjective and objective quality assessment method is adopted. Some objective evaluation indexes are commonly used to evaluate feature-based registration algorithms, such as the number of feature point pairs on the image or the registration subject, the number of mismatched feature point pairs on the registration subject, and the accuracy. The accuracy represents the ratio of the number of correctly matched feature point pairs to the total number of feature point pairs on the registration subject. These objective evaluation indexes are used to illustrate the accuracy of feature matching, thereby reflecting the robustness of feature descriptors. The comparison of feature point matching results obtained by different methods is shown in Fig. 4.

Fig. 4. Comparison of feature point matching results obtained by different methods. (a) SIFT; (b) MASK + SIFT; (c) ORB; (d) MASK + ORB; (e) our method.

Download Full Size | PDF

Table 1 shows the number of matching feature points obtained by different methods. The two traditional feature detection algorithms, SIFT and ORB, have very poor feature extraction capabilities for registration subjects under complex backgrounds without mask maps, and the number of feature matching point pairs on the registration subjects is only 1 and 0. After the mask pattern masks the complex background, both SIFT and ORB can find some feature points on the registration subject. After the preliminary screening of background feature points, the feature matching has achieved good results and achieved a higher accuracy (62.5% and 80%), which illustrates the step of semantic segmentation is very necessary when facing images with complex backgrounds. In the meantime, our method has a highest accuracy (91.0%) in feature point matching. After the SIFT and ORB algorithm extracts the feature points on the aircraft and completes the matching, because there are still a certain number of mismatched points, several mismatched points cannot be eliminated when the RANSAC algorithm is used, which cannot lead to a higher matching accuracy and generate a wrong homography used for registration. Therefore, the feature descriptors produced by our method are more robust than SIFT and ORB with the help of neural networks.

Table 1. The number of matching feature points obtained by different methods

View Table | View all tables in this article

Since another feature detection algorithm, ORB, has a better feature matching effect (80.0%) with the help of the mask map, we compared the registration results using the MASK + ORB algorithm with the results using our method. Figure 5 shows the superimposed multi-modal images after registration by using the MASK + ORB method and our method. In the superimpose result of using the MASK + ORB algorithm, there is a ghost phenomenon in the head and tail of the aircraft after the floating image after registration and the fixed image are fused. That is because the distribution of the feature points is not uniform and they are concentrated in some certain part of the aircraft, resulting in the inability to align the complete aircraft in the process of affine transformation and thus the phenomenon of ghosting in the superimposed results.

Fig. 5. The superimposed multimodal images after registration. (a) MASK + ORB; (b) our method.

Download Full Size | PDF

In addition, to further compare the registration accuracy of these two methods, the root mean square distance RMSD, the mean absolute distance MAD, and the median distance MED are used to measure the difference between the real registration results and the registration results generated by the algorithm. The real registration result is calculated by manually annotating 10 correct feature matching point pairs. The calculation formulas of the root mean square distance RMSD and the mean absolute distance MAD are as follows:

(8)$$\textrm{RMSD} = \sqrt {\frac{1}{N}{{\sum\limits_{j = 1}^N {||{{P_{TRUE}}({{S_j}} )- P({{S_j}} )} ||} }^2}} ,$$

(9)$$\textrm{MAD} = \frac{{\sum\limits_{j = 1}^N {|{{P_{TRUE}}({{S_j}} )- P({{S_j}} )} |} }}{N},$$

where, N = 10 denotes the number of manually annotated feature point pairs, P(S_j) denotes the coordinate position of the j^th feature point in the image to be matched after the transmissive transformation of the registration parameters obtained by the algorithm, P_TRUE(S_j) denotes the coordinate position of the j^th feature point in the image to be matched after the transmissive transformation of the registration parameters obtained by the real registration parameters. Figure 6 shows the manually labeled 10 sets of feature matching point pairs.

Fig. 6. Manually labeled 10 sets of feature matching point pairs.

Download Full Size | PDF

The comparison result of the registration indexes is shown in Table 2, whether it is the MASK + ORB method or the proposed method, the errors between the registration results and the real registration results are very small. It shows that the registration accuracy of these two methods is already very close to the registration result generated by the real registration parameters. The registration errors (3.14, 5.90, 2.82) of our method are smaller than those (3.45, 5.93, 3.27) of the MASK + ORB method, because the feature points generated by the MASK + ORB method are more concentrated, a small part of the registration subject cannot be completely aligned, resulting in an increase of the offset after registration compared with the real registration result. However, the feature points extracted by our method are distributed very uniformly in every part of the aircraft, so the registration accuracy is improved, the registration error is smaller than the traditional method, and the effect after superposition will be more ideal. In the meantime, in terms of the time required for registration, our method (0.86s) also takes less time than the MASK + ORB method (1.41s), which ensures its usefulness in areas where real-time algorithms are required. Experimental results show its high accuracy and good real-time performance compared to traditional methods.

Table 2. Comparison of registration indexes

View Table | View all tables in this article

In the meantime, our method still has shortcomings. Since we generate a feature point in every 8 × 8pixel area, and represent this feature point through three feature vectors of different dimensions, it will lead to the inability to achieve pixel-level accuracy. Therefore, the feature points cannot find the most accurate corresponding feature points in another image during feature matching, resulting in errors in the final registration step. The reason is that the receptive field corresponding to the output of the shallowest output layer used in our method is an 8 × 8pixel area, and we can’t use a too shallow output layer of the ResNet18 neural network as a feature description because its ability in representing feature points.

4. Conclusions

A registration method for multi-modal images under complex background is proposed, so that the multi-modal images captured by multi-spectral cameras can be fused together to obtain complete information of the target, which is of great help to target detection. There are two key steps in our method. First, we generate descriptors of feature points through ResNet18 trained on the massive ImageNet dataset. Since the ResNet18 combines the deep image feature information with the shallow image feature information. They are superimposed and fused. Besides, the feature descriptors we generate come from the three output layers of ResNet18, which represent the feature information of different depths of the image, so the generated feature descriptors will have higher robustness, which is helpful for subsequent feature point matching step. Second, we train a CNN for semantic segmentation to filter out irrelevant feature points in the background, and eliminate the interference and influence of the background on the target registration, so as to achieve a better registration effect. Finally, by comparing several other feature-based image registration methods, it can be found that the feature points found by our method have more accurate matching and more uniform distribution, and the registration performance after eliminating the wrong matching points.

Funding

National Natural Science Foundation of China (62075183, 62175041); Program for Guangdong Introducing Innovative and Entrepreneurial Teams (2019ZTO8X340).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. B. Zitova and J. Flusser, “Image registration methods: a survey,” Image and Vision Computing 21(11), 977–1000 (2003). [CrossRef]

2. H. Kaur, D. Koundal, and V. Kadyan, “Image fusion techniques: a survey,” Arch Computat Methods Eng 28(7), 4425–4447 (2021). [CrossRef]

3. S. Zhang, F. Huang, B. Liu, G. Li, Y. Chen, L. Sun, and Y. Zhang, “Robust registration for ultra-field infrared and visible binocular images,” Opt. Express 28(15), 21766–21782 (2020). [CrossRef]

4. B. Li, W. Wang, and H. Ye, “Multi-sensor image registration based on algebraic projective invariants,” Opt. Express 21(8), 9824–9838 (2013). [CrossRef]

5. A. R. Wade and F. W. Fitzke, “A fast, robust pattern recognition system for low light level image registration and its application to retinal imaging,” Opt. Express 3(5), 190–197 (1998). [CrossRef]

6. F. P. Oliveira and J. M. R. Tavares, “Medical image registration: a review,” Computer Methods in Biomechanics and Biomedical Engineering 17(2), 73–93 (2014). [CrossRef]

7. Z. Song, S. Li, and T. F. George, “Remote sensing image registration approach based on a retrofitted SIFT algorithm and Lissajous-curve trajectories,” Opt. Express 18(2), 513–522 (2010). [CrossRef]

8. C. Zuo, Q. Chen, G. Gu, and X. Sui, “Registration method for infrared images under conditions of fixed-pattern noise,” Opt. Commun. 285(9), 2293–2302 (2012). [CrossRef]

9. C. Zuo, Q. Chen, G. Gu, X. Sui, and J. Ren, “Improved interframe registration based nonuniformity correction for focal plane arrays,” Infrared Physics & Technology 55(4), 263–269 (2012). [CrossRef]

10. Y. J. Wang and Y. H. Lin, “An optical system for augmented reality with electrically tunable optical zoom function and image registration exploiting liquid crystal lenses,” Opt. Express 27(15), 21163–21172 (2019). [CrossRef]

11. K. Yang, A. Pan, Y. Yang, S. Zhang, S. H. Ong, and H. Tang, “Remote sensing image registration using multiple image features,” Remote Sens. 9(6), 581 (2017). [CrossRef]

12. J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Mutual-information-based registration of medical images: a survey,” IEEE Trans. Med. Imaging 22(8), 986–1004 (2003). [CrossRef]

13. D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the seventh IEEE international conference on computer vision, (IEEE, 1999), pp. 1150–1157.

14. K. Yan and R. Sukthankar, “PCA-SIFT: a more distinctive representation for local image descriptors,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), (2004), pp. II.

15. K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Machine Intell. 27(10), 1615–1630 (2005). [CrossRef]

16. H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” in European conference on computer vision, (Springer, 2006), pp. 417.

17. . E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in 2011 International conference on computer vision, (IEEE, 2011), pp. 2564–2571.

18. S. Mallat, “Understanding deep convolutional networks,” Phil. Trans. R. Soc. A. 374(2065), 20150203 (2016). [CrossRef]

19. Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep learning for remote sensing image classification: A survey,” WIREs Data Mining Knowl Discov 8, e1264 (2018). [CrossRef]

20. Z. Q. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE Trans. Neural Netw. Learning Syst. 30(11), 3212–3232 (2019). [CrossRef]

21. Z. Yang, T. Dan, and Y. Yang, “Multi-temporal remote sensing image registration using deep convolutional features,” IEEE Access 6, 38544–38555 (2018). [CrossRef]

22. F. Ye, Y. Su, H. Xiao, X. Zhao, and W. Min, “Remote sensing image registration using convolutional neural network features,” IEEE Geosci. Remote Sensing Lett. 15(2), 232–236 (2018). [CrossRef]

23. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778.

24. Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” Int J Multimed Info Retr 7(2), 87–93 (2018). [CrossRef]

25. G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “VoxelMorph: a learning framework for deformable medical image registration,” IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019). [CrossRef]

26. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

27. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

28. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 3431–3440.

29. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24(6), 381–395 (1981). [CrossRef]

30. D. Barath and Z. Kukelova, “Homography from two orientation-and scale-covariant features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 1091–1099.

31. A. Jamin and A. Humeau-Heurtier, “(Multiscale) cross-entropy methods: A review,” Entropy 22(1), 45 (2019). [CrossRef]

32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, (2015).

	Feature point pairs on the image	Feature point pairs on the registration subject	Mismatched feature point pairs on the registration subject	Accuracy(%)
SIFT	15	1	1	0
MASK + SIFT	8	8	3	62.5
ORB	9	0	0	0
MASK + ORB	10	10	2	80.0
Our Method	11	11	1	91.0

	Feature point pairs on the image	Feature point pairs on the registration subject	Mismatched feature point pairs on the registration subject	Accuracy(%)
SIFT	15	1	1	0
MASK + SIFT	8	8	3	62.5
ORB	9	0	0	0
MASK + ORB	10	10	2	80.0
Our Method	11	11	1	91.0

Registration of multi-modal images under a complex background combining multiscale features extraction and semantic segmentation

Abstract

1. Introduction

2. Method

2.1. Image preprocessing

2.2. Feature descriptors generation

2.3. Screening of background feature points

2.4. Interior feature points matching

2.5. Elimination of mismatched points and estimation of transformation parameters

3. Experimental results and discussions

4. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Tables (2)

Equations (9)

Optics Express

	RMSD	MAD	MED	Time
MASK + ORB	3.45	5.93	3.27	1.41s
Our Method	3.14	5.90	2.82	0.86s