Real-infraredSR: real-world infrared image super-resolution via thermal imager

Yicheng Zhou; Yuan Liu; Liyin Yuan; Qian Chen; Guohua Gu; Xiubao Sui

doi:10.1364/OE.496484

1. Introduction

Infrared image processing technology has become increasingly widespread in various kinds of missions, such as medical imaging [1], detection task [2] and biological application [3], leading to a growing demand for the high-resolution infrared images. Therefore, effectively improving the resolution of infrared images has become a critical issue. The focal plane detector is the core device that affects infrared image resolution, and the resolution of infrared images is limited by the size of detector pixels. The most direct way to improve resolution is to reduce pixel size. However, due to the complex structure of the detector, improving its resolution by hardware is very difficult. To solve this problem, infrared image super-resolution methods are proposed for effectively improving the resolution of infrared images.

Fortunately, recent advancements in deep learning technology for image processing have significantly improved image super-resolution reconstruction effect. The first application of convolutional neural network(CNN) in the image super-resolution reconstruction(SRCNN [4]) enable the emergence of a large number of high-quality image super-resolution algorithms. These algorithms outperform traditional interpolation algorithms such as bicubic [5]. However, most existing reconstruction technologies focus on visible light scenes and rarely explore other wavebands like infrared scenes due to the difficulty of acquiring these kinds of images. In order to achieve super-resolution of single infrared image, the excellent reconstruction performance of CNN can be leveraged.

Inspired by the outstanding performance of visible image super-resolution, it is possible to use the visible image super-resolution method to achieve infrared image super-resolution. But there are the following problems. Firstly, most of the image super-resolution reconstruction methods use synthetic datasets, such as DIV2K, which is obtained by bicubic interpolation. This approach has an advantage that perfect alignment between high and low-resolution (namely ’HR-LR’) images can be achieved. However, it only takes the pixel relationship between images into account, without considering other more complex factors such as environment effect and imaging mechanism, which cannot be represented by simple interpolation. This leads directly to the result that the downsampling kernel used is quite different from the real one in degradation stage. In recent years, researcher begin to focus on capturing data from real scenes [6] [7]. Secondly, researchers lack a real-world infrared super-resolution dataset. The generation mechanisms of infrared and visible image are different. Infrared images are obtained by measuring the heat radiated from the object. Therefore, compared with visible images, infrared images have the following characteristics: poor resolution, low contrast, low signal-to-noise ratio and lack of rich details. If the network is trained only on existing real-world visible image datasets with rich texture details, the infrared image generated during inference may have excessive detail sharpening, resulting in distortion. Therefore, simply using real-world datasets of visible light to train our network is unbefitting.

In the field of infrared super-resolution, a real-world infrared dataset containing HR-LR image pairs is necessary. Since infrared image acquisition and production of refrigerated infrared lens are much more complex than visible image, we only use focal length transformation to achieve 2X-zoomed field of view (FOV). After preliminary screening and alignment, 411 pairs of infrared HR-LR image pairs are collected for training and testing. A sample is shown in Fig. 1(a), and the detail textures comparison of HR-LR infrared image pair can be seen in Fig. 1(b). At the same time, an asymmetric residual block structure is proposed, which can effectively reduce the number of model parameters and complexity of network. Additionally, traditional $L_1$ loss is widely used in image super-resolution, and this brings the improvement of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). However, when it comes to the real scans, this single loss function cannot guarantee improved visual effect. Therefore, perceptual loss and contextual loss are introduced in the context of the original $L_1$ loss. Perceptual loss is used to enhance the visual effect of the final reconstruction, and contextual loss is utilized to correct for slight misalignment of HR-LR image pairs in the preprocessing stage. Finally, experiment results demonstrate the effectiveness of our proposed improvements through qualitative and quantitative analysis.

Fig. 1. (a) HR-LR image pairs with different resolution can be obtained by changing field of view (FOV). (b) Detail texture comparison of high-low resolution images obtained using refrigerated thermal imager and infrared zoom lens

Download Full Size | PDF

To sum up, the contributions of this paper are as follows:

• A new real-world infrared super-resolution dataset is proposed, which can achieve double magnification. The long focal shots were taken as GT images. This dataset can be used for SISR training and evaluation of infrared real scenes.
• The asymmetric residual block structure is proposed, which effectively lightens the network and improves the performance of the network, significantly increasing the practicality of the network.
• Contextual loss and perceptual loss were introduced to solve the slight misalignment of images in the registration process, focusing on improving visual effects.

2. Related works

With the wide application of deep learning in computational optics, image super-resolution reconstruction is also affected positively. Traditional filter-based methods (such as bicubic) have been replaced by learning-based methods due to their excessive smoothness in texture processing. In the field of image super-resolution, SISR (single image super-resolution) [8] is more practical than MISR (multi-frame image super-resolution) [9] [10] because it only requires a single image to enhance the texture and detail of a low-resolution image, thus having received a lot of attention.

In 2014, Dong et al. first introduced CNN to solve single image super-resolution. Subsequently, FSRCNN [11] was proposed to improve network performance by using the parametric learnable deconv layer to replace the original bicubic interpolation. Furthermore, deconv placed at the end of the network reduces parameters and improves real-time network performance. Later, a series of algorithms based on CNN were proposed. For example, VDSR [12] was proposed to use the deep network to improve the receptive field, so as to make more use of context information; SRResNet introduced residual blocks to obtain a deeper network; EDSR removed the excess BN layer in SRResNet. As a result of which, more layers can be stacked under limited resources or more features can be extracted from each layer to achieve better performance. These methods undeniably bring the image super-resolution to a new level.

However, it is found in the practical application that the image with high PSNR and SSIM [13] does not always meet the needs of the real scene. Scholars believe that these methods have a common problem, which is to use the known downsampled kernel (eg.bicubic) to simulate the degradation process. Recent studies such as SRGAN [14] have shown that the simulation results of synthetic data overestimate the super-resolution power of real-world images. In SRGAN, the authors demonstrate that high PSNR images do not represent good reconstruction results. Therefore, some scholars began to focus on the super-resolution of real scene images. For example, RealSR [15] used Canon 5D3 and Nikon D810 to shoot a real-world super-resolution dataset and developed a set of image alignment algorithm for HR-LR image pairs. SR-Raw [16] uses long and short focal lengths for the same scene to obtain image sequences, takes the images obtained with the longest focal lengths as the ground truth, and obtains the image pairs with different magnification rates. Additionally, it uses the original data obtained by sensors to train the network and retain the real information of the images to the greatest extent. All the above methods aims to build real-world datasets to achieve super-resolution of data in real scenarios.

In fact, the degradation process of infrared field is more complex than that of visible light. After investigation, there is rarely relevant work in the infrared super-resolution field of real scenarios. Inspired by scholars’ research in the field of real-world single image super-resolution, this paper aims to carry out research on the super-resolution of infrared images in real scenes, and proposes a real-world infrared super-resolution dataset to achieve double scaling.

3. Real-world infrared SISR dataset

In order to use infrared images of real scenes to obtain a more complex downsampling kernel that is closer to the actual situation, we proposed to use a refrigerated infrared thermal imager with a dual-field zoom lens to collect the infrared data, covering multiple indoor and outdoor scenes. The infrared image data sets used in this experiment are all taken by the refrigerated infrared thermal imager [17]. In the preprocessing stage, feature-based field of view matching and luminance registration are used to achieve the alignment of LR-HR image pairs. The acquisition and registration steps are explained in detail in this section.

3.1 Infrared image acquisition

We use two focal segments of dual field of view (25 and 50$mm$) to collect high-low resolution infrared image pairs with different levels of optical zoom, and the infrared wavelength is 3-5$\mathrm{\mu}$m. The scene shot at the 50mm end is used as the GT image with richer texture details, and the 25mm end is used as the input, so that data pairs can be formed for training the double zoom model. We collected a total of 486 image pairs for various indoor and outdoor scenes, the size of which is 640*512. Since there may be moving objects such as pedestrians and vehicles in the scenes, they need to be cleaned. 411 pairs were selected after preprocessing, among which 405 pairs were used for training. The remaining 6 pairs were used to construct a final test set like SET5, which is a professional image super-resolution dataset, to verify the validity of the model.

Different from visible light scenes, infrared thermal imager works by using photoelectric equipment to detect and measure radiation, so as to obtain infrared thermal image, which corresponds to the heat distribution field on the surface of the object. Anything above absolute zero (−273 $^{\circ }$C) emits infrared radiation. To ensure that the dataset is applicable to a wide range of scenarios, it should include a comprehensive coverage of both indoor and outdoor scenes during the data collection process. Meanwhile, the temperature factor should also be taken into account, so the dataset should cover both day and night. In order to ensure the shooting of objects in the same scene, a tripod is used to fix the thermal imager, and the field of view taken is changed only by adjusting the focal length. Fig. 2 gives some examples that we collect HR-LR image pairs. Secondly, it is important to note that histogram equalization can cause issues with luminance registration when there are objects in the scene that are excessively bright or dark (i.e. too hot or cold), as shown in Fig. 3. Therefore, it is recommended to avoid such scenes as much as possible to ensure optimal results during the registration process. The original infrared image does not have the above luminance problem, but the reason for using a histogram balanced image rather than the original image is that the original image may appear blurred and make it difficult to discern objects, particularly the details that the super resolution process seeks to enhance. So we abandon the original images to make the infrared image super-resolution dataset.

Fig. 2. Examples of image pairs at different focal lengths. The upper images are photoed in high-resolution scene at 50$mm$ focal section and serves as ground truth. The lower images are the low-resolution images at 25$mm$ focal section, and the middle red box part is the corresponding low-resolution scene of the upper images.

Download Full Size | PDF

Fig. 3. The luminance unalignment in infrared HR-LR image acquisition due to the existence of non-uniformity correction.

Download Full Size | PDF

3.2 Image alignment

The acquisition of near-infrared image pairs is not very troublesome. However, due to a series of uncertain factors such as operation error and ground vibration in the shooting process, slight deviation of the field of view is difficult to avoid. With the reduction of focal length, the barrel distortion of zoom lens still exists in optical design. All of these factors make pixel-aligned images difficult to obtain. Therefore, we developed a set of infrared image alignment scheme, and realized the alignment of HR-LR infrared image pairs.

The alignment process is divided into three steps: removal of distortion, perspective transformation and luminance registration, as shown in Fig. 4:

Removal of Distortion—As shown in part (1) in Fig. 4, we first import the infrared image into Photoshop to remove most of the lens distortion. However, this operation has little effect on correcting the position around the image far from the optical center. Therefore, we cut out 20 pixel around the image and directly delete the severely distorted areas.
Perspective Transformation—In part (2) of Fig. 4, SIFT [18] and RANSAC [19] are used to select the corresponding scenes of long focus in short focus, because the key points found by SIFT are some very prominent points that will not be changed by lighting, affine transformation, noise and other factors, such as corner points, edge points, bright points in dark area and dark points in bright area. Considering that the luminance is inconsistent due to the existence of histogram equalization, and SIFT characteristics can solve this problem, so the SIFT registration algorithm is preferred. Firstly, after removing the black edge, the low-resolution image corresponding to the short focus can be obtained. Due to the different object distance, the magnification ratio will not be consistent. After computation on multiple pairs of images, the magnification ratio is basically around 1.95. We use an interpolation algorithm to fine-tune, in order to achieve double magnification.
Luminance Registration—As mentioned above in part (3) of Fig. 4, in order to avoid large brightness difference in short and long focal images, we try to avoid objects with large temperature difference between the two fields of view. Although this will filter out most of the excessive differences in brightness, there will still be some slight error. In this paper, the least square method is used to achieve brightness registration. $I_L$ is used to represent the low-resolution image that has undergone dedistortion and perspective transformation, and $I_H$ is used to represent the high-resolution image GT. The objective function can be defined as follows: $(1)$$\arg \min _{\alpha, \beta}\left\|\alpha \times \mathrm{I}_{\mathrm{L}}+\beta-\mathrm{I}_{\mathrm{H}}\right\|_{\mathrm{P}}^{\mathrm{P}}$$$ Where, $\alpha$ and $\beta$ are luminance correction parameters, and $|| \cdot |{|_p}$ are robust norm (p$\leq$1). The expressions of $\alpha$ and $\beta$ are as follows: $(2)$$\begin{aligned} & \alpha=\frac{\operatorname{std}\left(\mathrm{I}_{\mathrm{H}}\right)}{\operatorname{std}\left(\mathrm{I}_{\mathrm{L}}\right)} \\ & \beta=\operatorname{mean}\left(\mathrm{I}_{\mathrm{H}}\right)-\alpha \operatorname{mean}\left(\mathrm{I}_{\mathrm{L}}\right) \end{aligned}$$$
In this way, each pair of HR-LR infrared images with luminance registration can be guaranteed to have the same pixel mean and standard deviation.

Fig. 4. Illustration of our image pair registration process.

Download Full Size | PDF

4. Proposed method

4.1 Network structure

In Section 3, we have built an aligned super-resolution dataset consisting of real-world infrared images, and now we need a suitable and efficient network to realize inference. In the past, in order to improve the performance of network, a deeper network structure was often used, which increased the complexity of the model and difficulty of training and inferencing. It is worth mentioning that the increase in performance is often not proportional to the increase of model complexity. Therefore, taking practicability into account, it is necessary to improve the efficiency of inference while improving the super-resolution effect.

In the field of single image super resolution (SISR), considering that complex networks are often difficult to converge, we use the structure of SOTA network EDSR. The network structure is shown in Fig. 5. Feature extraction module and up-sampling module are used to achieve image super-resolution. As shown in Fig. 6 (a) and (b), EDSR removes the redundant BN layer and part of ReLU layer on the basis of SRResNet original residual block, saving the storage consumption caused by BN layer itself. In Inception [21], the author proposed the concept of asymmetric convolution module, and adopted two one-dimensional convolution to enhance the box convolution from the horizontal and vertical directions, so as to enhance the influence factor of local significant features. Inspired by asymmetric convolution, we propose a new structure – asymmetric residual block, as shown in Fig. 6 (c) and (d), $x_l$ and $x_{l+1}$ are the input and output of the $l$-th residual block. The number of asymmetric residuals blocks in the network is set to 16, and the number of characteristic channels in the convolution operation is 32. It has been tested that the asymmetric residual block structure can reduce the number of parameters by 30 percent, as is shown in Table 1.

Fig. 5. Network structure including a group of asymmetric residual blocks and a pixel-shuffle module.

Download Full Size | PDF

Fig. 6. The basic structure of asymmetric residual block, compared with original res-blocks proposed in SRResNet [14] and EDSR [20].

Download Full Size | PDF

Table 1. Comparison of EDSR and Ours parameters

View Table | View all tables in this article

4.2 Loss function

The full reference index PSNR and SSIM are often used as the most important standard to measure the effect of super-resolution algorithm. One benefit of this is that such MSE-related metrics minimize the error between pixels. However, it is worth noting that the full reference metric is not designed based on human visual perception, but only considers the image signal and noise, which leads to the generated picture may not be consistent with the visual characteristics of human eyes. So we design a loss function based on $L_1$ loss, where perceptual loss $L_{\text {Perceptual}}$ and contextual loss $L_{C X}$ are introduced to enable the network to get better result. The final expression of total loss $L_{SR}$ is as follows:

(3)$$L_{S R}=L_1+\alpha L_{\text{Perceptual }}+\beta L_{C X}$$

where $\alpha$ and $\beta$ are adjustment factors. In our experiment, both parameters are set to 0.5.

4.2.1 Perceptual loss

First, perceptual loss is introduced. Johnson et al. propose perceptual loss for the first time, and use the pre-trained VGG19 network to simulate human eyes to acquire image features and minimize Euclidean distance between features. Such an operation achieved better visual performance in SRGAN [14]. Subsequently, Wang et al. in ESRGAN [22] notice that primitive perceptual loss uses features after the use of the activation layer [23], which makes the features sparse, resulting in weak supervision and poor performance, so the feature map before the activation layer of vgg19-54 features is used to calculate perceptual loss.

Due to the poor contrast of the infrared image, the use of features before activation to calculate perceptual loss will preserve more texture details and strengthen the supervision of the network on the detail texture. Here, the variation adopted by Wang et al. in ESRGAN is introduced.

(4)$$L_{\text{Perceptual }}=\frac{1}{W_{i, j} H_{i, j}} \sum_{x=1}^{W_{i, j}} \sum_{y=1}^{H_{i, j}}\left(\phi_{i, j}\left(I^{H R}\right)_{x, y}\right.-\left.\phi_{i, j}\left(G\left(I^{L R}\right)\right)_{x, y}\right)^2$$

Where, $\phi _{i,j}$ represents the output before the $i$-th maximum pooling layer and after the $j$-th convolution operation (before activation) in VGG19 network; $W_{i,j}$ and $H_{i,j}$ represents the dimension of feature graph; $G$ refers to the generator in a GAN network, which is the network responsible for generating high-resolution images from low-resolution images.

4.2.2 Contextual loss

It is found that when the proposed infrared dataset is directly used for training, obvious blurriness can be found. After analysis we think that in the process of image registration, pixel level registration cannot be perfectly achieved. Although after the perspective transformation based on SIFT and RANSAC algorithm, there will still be a few pixels of offset. Therefore, it is necessary to introduce the offset correction loss in the training stage. MechrezR et al. proposed to treat the image as a set of feature points and minimize the total distance of all matched feature pairs to complete the alignment of image pairs [24]. The expression can be presented as follows:

(5)$$L_{C X}=C X(P, Q)=\frac{1}{N} \sum_i^N \min _{j=1, ., M}\left(D_{p_i, q_j}\right)$$

Where, $P$, $Q$ are source image and target image; $N$, $M$ represent the number of feature points in $P$ and $Q$; $p_i$, $q_j$ are feature points of source image and target image respectively, and $D_{p_i, q_j}$ denotes cosine distance of two image features.

In the CXloss, the author selected the last convolutional output before the fourth maximum pooling of the VGG19 network as the feature of calculation. The advantage of this operation is that the maximum pooling layer removes details of the feature map, which is contrary to the purpose of super-resolution.

5. Experiments

5.1 Experiment setup

We randomly selected 6 pairs of HR-LR images for test from the dataset proposed in Section 3. As shown in Fig. 8, the images include elements with rich texture details such as railings, stairs, guardrail and trees. At the same time, the remaining 405 pairs are used as the training set. Some examples are shown in Fig. 7. In order to guarantee the training amount of the network, we rotated and flipped the images of the training set at 90$^{\circ }$, 180$^{\circ }$ and 270$^{\circ }$, and randomly scrambled them to expand our dataset. In terms of the evaluation indicators, we preferred the two universal full reference indicators PSNR and SSIM, and used MATLAB to more precisely calculate the experimental results. First, the pictures were converted into YCBCR format, and then PSNR and SSIM were calculated respectively for y channel.

Fig. 7. Eight examples of data in the training set

Download Full Size | PDF

Fig. 8. Presentation of test set

Download Full Size | PDF

In the network, the batchsize is set to 8 and the epoch is set to 300. We used the Adam optimizer instead of SGD to optimize the network structure, and the initial learning rate was set to $10^{-4}$. Our hardware platform is IntelCoreTMi7-7700KCPU$@$4.20GHz$\times$8 and our graphics card is RTX2080Ti.

5.2 Limitation of the visible light dataset

At first, we thought we could use existing excellent super-resolution datasets of visible real scenes such as RealSR [15], DRealSR [25], Imagepairs [26] and SupER [27] to train our network. Visible images have rich texture details, which can strengthen the supervision of low-resolution infrared images lacking texture details. However, through experiments, we found that this kind of datasets are not suitable for the super-resolution network for low-resolution infrared images, and the excessively rich detail texture of visible images will make the inference results too sharp and distorted, as shown in Fig. 9.

Fig. 9. The left image shows the result of network trained by RealSR; the middle one is enlarged by bicubic interpolation; the last one is ground truth

Download Full Size | PDF

In Fig. 9, the left image shows the inference result after training with the RealSR dataset, the middle image shows the result after bicubic interpolation, and the right one shows the original image. We can intuitively see that the part in the red box of left picture is severely distorted at the railing in the middle of the stairs. At the same time, the training results of visible light super-resolution dataset have not achieved satisfactory results in terms of quantitative indexes either. We think that the degradation process of the camera collecting visible data is inconsistent with that of the infrared thermal imager, so the two types of datasets are not universal. Therefore, a new infrared super-resolution dataset is needed to improve the performance of real-world infrared image super-resolution.

5.3 Training strategy

Although we have verified in the last section that the training by using visible images alone cannot reach the desired result, we believe that the network can still be equipped with a strong ability to generate details texture by using the strong supervision of visible images with rich details texture. Due to the great difference in resolution between infrared thermal imager and DSLR camera, the size of infrared image is 640*512 due to its own structure, and the total pixel of an infrared image is 327680, but the images taken by ordinary DSLR cameras are basically over 20 million pixels. In other words, if the number of data pairs is consistent, the input of the network is about 70 times that of the infrared image when training with visible image.

Therefore, we consider using visible dataset for pre-training to make full use of the advantage of large data volume, so as to make up for the shortcomings of infrared image acquisition. Taking it as a pre-training model, the real-world infrared image dataset proposed in Section 3 is used to fine-tune the network and make it more suitable for infrared scenes.

As shown in Fig. 10, image (a) represents the network results trained directly with our dataset, it can be found that the details are very fuzzy. It is because the training set is too small; from image (b) , we can find that over sharpening and distortion of RealSR training results still exist; image (c) shows that after RealSR pre-training, the fine-tuning results can be found significantly improved; image (d) is the ground truth image.

Fig. 10. Before and after fine tuning

Download Full Size | PDF

5.4 Ablation study

5.4.1 Data sets

To demonstrate the effectiveness of the dataset proposed in this paper, we intend to perform the following ablation experiments. Previously, most advanced networks had been trained on DIV2K, so to demonstrate the effectiveness of Real-InfraredSR, we used DIV2K to compare the performance of our datasets on different networks. In addition to DIV2K, which represents the synthetic data, we also chose the RealSR dataset representing real-world for comparison. For the network, we choose EDSR and our network for group experiment. To be fair, we controlled the training amount of the network to be the same. The DIV2K single image size is about 8 times larger than the size of infrared image, so we selected 202 of them for training randomly, and the total amount of training was 1106Mb, while that of Real InfraredSR was 1124.6Mb after expansion. To ensure that the training volume is roughly equal. A total of 6 SISR models were produced in the final experiment.

As shown in Fig. 11, from the qualitative results, we can again demonstrate the view proposed forward in 5.2 that the real-world visible light super-resolution dataset is not applicable to the real infrared scenes. From the perspective of quantitative results which is shown in Table 2, its PSNR and SSIM are also the lowest, both from EDSR and our methods. PSNR is 2-3db lower than DIV2K and Real-InfraredSR training results, and SSIM is 0.03-0.04 lower. The reason why DIV2K, a synthetic visible dataset based on interpolation algorithm, can still have better numerical effect performance(PSNR and SSIM are only about 1db lower than the results obtained by Real-InfraredSR training) is that the bicubic interpolation algorithm has been proved to simulate the degradation of actual images. However, in the face of complex degradation of real scene, this algorithm still has a large error with the real degradation process. This ablation experiment further demonstrates that in the field of infrared super-resolution, a special data set is needed to support the super-resolution task in real scenes.

Fig. 11. Performance of different data sets on the same algorithm

Download Full Size | PDF

Table 2. Comparison of training results of different datasets when magnification is twice

View Table | View all tables in this article

Table 3. Performance of different loss functions on the same algorithm

View Table | View all tables in this article

5.4.2 Network structure

To validate the effectiveness of the proposed network architecture, we conducted the following ablation experiments. We separately output the feature maps after passing through sixteen original residual blocks and the proposed asymmetric residual blocks. Heatmaps are used to represent the activation values of the feature maps, displaying the response intensity of different regions. The comparative results are shown in Fig. 12.

Fig. 12. The above line represents the results obtained after training on the proposed asymmetric residual block structure. The following line represents the results obtained using the original residual blocks.

Download Full Size | PDF

As is shown in Table 2, our network shows improvements in both PSNR and SSIM compared to the original network. From the comparison of the feature maps, it is evident that our proposed network architecture introduces more abundant texture details in the feature maps. We believe that this enhancement can greatly contributes to improving image details, which is particularly critical for low-resolution infrared images.

5.4.3 Loss function

As mentioned above, we introduce perceptual loss and contextual loss to improve visual effects and overcome slight misalignment in low-resolution images. To demonstrate the effectiveness of this approach, we conduct the following ablation experiments on loss functions. The training plan consist of four iterations, using the original $L_1$ loss, $L_1$ loss combined with contextual loss, $L_1$ loss combined with perceptual loss, and $L_1$ loss combined with both contextual and perceptual losses, respectively, to train our network. We then showcase the impact of different loss functions on the results using the test set. The comparison of effects is illustrated in the following figure.

From Fig. 13, it can be observed that excluding the contextual loss during the training process leads to relatively blurry image details. This is attributed to the minor misalignment during the registration process, which prevents perfect alignment of edge details between high and low-resolution images. Furthermore, when the training process lacks perceptual loss, there is a slight decrease in image contrast, which hinders the rapid discernment of fine-textured details by the human eye. Finally, when both of these loss functions are incorporated into the training process, significant improvements in image details, especially in the edge regions, are observed.

Fig. 13. The visual effects resulting from different loss functions can vary.

Download Full Size | PDF

As is shown in Table 3 , we found that the inclusion of perceptual loss has a significant impact on the improvement of PSNR and SSIM. On the other hand, incorporating contextual loss slightly decreases PSNR and SSIM. However, considering that contextual loss effectively enhances the ability to recover image details, which aligns with the objective of super-resolution, we have ultimately decided to include contextual loss.

5.5 Comparison to the state-of-the-art

At the same time, in order to show that the proposed scheme has a good performance in infrared image superfraction of real scenes, the following comparative experiments are also carried out. We plan to compare with some past SOTA algorithms, including SRCNN and EDSR, and also select algorithms that have achieved excellent results in the field of image super-resolution in recent years, such as ZSSR [28], DBPI [29], BSRGAN [30] and DualSR [31], which are SOTA methods in blind super-resolution; MST++ [32], which is based on Transformer; MFDRN [33], which is proposed during this year.

Table 4 shows the quantitative performance of different algorithms on the test set. We have highlighted the top two rankings in the metrics. Overall, it can be observed that the proposed method consistently outperforms state-of-the-art (SOTA) algorithms in terms of PSNR and SSIM, even surpassing most of the metrics compared to the recently introduced MFDRN. When it comes to the networks ZSSR, DBPI, BSRGAN and DualSR, which have achieved better results in the field of blind super-resolution in recent years, the PSNR improvement is about 4db. Additionally, compared to MST++, which utilized the Transformer structure to achieve better reconstruction performance, our method also demonstrates promising numerical comparisons.

Table 4. Numerical comparison of different methods on the same test set

View Table | View all tables in this article

From a qualitative perspective, our method undoubtedly achieves the best visual effects in terms of reconstructing texture details in the details of Fig. 14,15,16. Particularly in areas with high texture density, such as railings and stairs, our method excels in recovering fine details without introducing excessive noise.

Fig. 14. The comparison effect between different algorithms.

Download Full Size | PDF

Fig. 15. The comparison effect between different algorithms.

Download Full Size | PDF

Fig. 16. The comparison effect between different algorithms.

Download Full Size | PDF

Moreover, to demonstrate the practicality of the proposed lightweight network, we compared it with MST++, which is the best-performing algorithm among the comparison methods. Firstly, we calculated the parameter count of these two networks, with our network having only 246K parameters, while MST++ has 6.93M parameters. During simulation on a PC, for the processing speed of a single image, our algorithm averaged a time of 0.0024s, whereas MST++ averaged a time of 0.099s. This indicates that MST++ cannot meet the lightweight requirement for real-time processing, unlike our network. Secondly, we implemented the algorithm on the NVIDIA Jetson TX2 platform, achieving a processing speed of approximately 25 frames per second. This meets the real-time image processing requirements on an embedded platform. Taking all these factors into consideration, we believe that the algorithm proposed in this paper possesses greater practical value.

6. Conclusion

Inspired by scholarship on blind image super-resolution in visible light, we seek to extend research to infrared images. Our paper outlines the use of a refrigerated infrared thermal imager with a 2X-zoomed lens to shoot hundreds of pairs of infrared images, from which we designed a registration algorithm and created a dataset with 411 pairs of real-world infrared images. Through experiments, it is verified that the degradation process of visible images cannot be directly applied to infrared images. However, visible light training results can be used as a pre-training model. Additionally, we demonstrate the effectiveness of our dataset in different algorithms through experiments. Furthermore, experiment results also improve the accuracy of our algorithm while reducing the number of parameters. Finally, compared with a series of previous SOTA algorithms and some excellent blind super-resolution algorithms, our results are closer to real-world images. In future work, we plan to expand our dataset further and increase the scale factor to form a more comprehensive dataset.

Funding

Fundamental Research Funds for the Central Universities ((Grant no. 30919011401, 30922010204, 30922010718, JSGP202202); National Natural Science Foundation of China (Grant no. 62105152).

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant no. 62105152), Fundamental Research Funds for the Central Universities (Grant no. 30919011401, 30922010204, 30922010718, JSGP202202), the Excellent Member of Youth Innovation Promotion Association CAS (No. Y2021071, No. Y202058), the National Defense Pre-Research Foundation of China during the 14th Five-Year Plan Period (D040107), the 173 Key Projects of Basic Research (2021-JCJO-ZD-025-11), Funds of the Key Laboratory of National Defense Science and Technology (Grant no:6142113210205), Leading Technology of Jiangsu Basic Research Plan (BK20192003).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. P. M. Shakeel, T. E. E. Tobely, H. Al-Feel, G. Manogaran, and S. Baskar, “Neural network based brain tumor detection using wireless infrared imaging sensor,” IEEE Access 7, 5577–5588 (2019). [CrossRef]

2. M. W. Akram, G. Li, Y. Jin, X. Chen, C. Zhu, and A. Ahmad, “Automatic detection of photovoltaic module defects in infrared images with isolated and develop-model transfer deep learning,” Sol. Energy 198, 175–186 (2020). [CrossRef]

3. K. B. Beć, J. Grabska, and C. W. Huck, “Near-infrared spectroscopy in bio-applications,” Molecules 25(12), 2948 (2020). [CrossRef]

4. C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, (Springer, 2014), pp. 184–199.

5. R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Trans. Acoust., Speech, Signal Process. 29(6), 1153–1160 (1981). [CrossRef]

6. H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, and C. Zhu, “Real-world single image super-resolution: A brief review,” Inf. Fusion 79, 124–145 (2022). [CrossRef]

7. A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-resolution: A survey and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

8. Y. Zou, L. Zhang, C. Liu, B. Wang, Y. Hu, and Q. Chen, “Super-resolution reconstruction of infrared images based on a convolutional neural network with skip connections,” Opt. Lasers Eng. 146, 106717 (2021). [CrossRef]

9. R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia, “Video super-resolution via deep draft-ensemble learning,” in Proceedings of the IEEE international conference on computer vision, (2015), pp. 531–539.

10. B. Wronski, I. Garcia-Dorado, M. Ernst, D. Kelly, M. Krainin, C.-K. Liang, M. Levoy, and P. Milanfar, “Handheld multi-frame super-resolution,” ACM Trans. Graph. 38(4), 1–18 (2019). [CrossRef]

11. C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, (Springer, 2016), pp. 391–407.

12. J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1646–1654.

13. C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, (Springer, 2014), pp. 372–386.

14. C. Ledig, L. Theis, and F. Huszár, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 4681–4690.

15. J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 3086–3095.

16. X. Zhang, Q. Chen, R. Ng, and V. Koltun, “Zoom to learn, learn to zoom,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 3762–3770.

17. Y. Zhou, “A dataset for real-world infrared image super-resolution,” figshare, 2023, https://doi.org/10.6084/m9.figshare.22777358.v1

18. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comp. Vis. 60(2), 91–110 (2004). [CrossRef]

19. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24(6), 381–395 (1981). [CrossRef]

20. B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, (2017), pp. 136–144.

21. D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in 2009 IEEE 12th international conference on computer vision, (IEEE, 2009), pp. 349–356.

22. X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European conference on computer vision (ECCV) workshops, (2018), pp. 0–0.

23. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, (Springer, 2016), pp. 694–711.

24. R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for image transformation with non-aligned data,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 768–783.

25. P. Wei, Z. Xie, H. Lu, Z. Zhan, Q. Ye, W. Zuo, and L. Lin, “Component divide-and-conquer for real-world image super-resolution,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, (Springer, 2020), pp. 101–117.

26. H. R. V. Joze, I. Zharkov, K. Powell, C. Ringler, L. Liang, A. Roulston, M. Lutz, and V. Pradeep, “Imagepairs: Realistic super resolution dataset via beam splitter camera rig,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2020), pp. 518–519.

27. T. Köhler, M. Bätz, F. Naderi, A. Kaup, A. Maier, and C. Riess, “Toward bridging the simulated-to-real gap: Benchmarking super-resolution on real data,” IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2944–2959 (2019). [CrossRef]

28. A. Shocher, N. Cohen, and M. Irani, ““zero-shot” super-resolution using deep internal learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 3118–3126.

29. J. Kim, C. Jung, and C. Kim, “Dual back-projection-based internal learning for blind super-resolution,” IEEE Signal Process. Lett. 27, 1190–1194 (2020). [CrossRef]

30. K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 4791–4800.

31. M. Emad, M. Peemen, and H. Corporaal, “Dualsr: Zero-shot dual learning for real-world super-resolution,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, (2021), pp. 1630–1639.

32. Y. Cai, J. Lin, Z. Lin, H. Wang, Y. Zhang, H. Pfister, R. Timofte, and L. Van Gool, “Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 745–755.

33. H. Wu, X. Hao, J. Wu, H. Xiao, C. He, and S. Yin, “Deep learning-based image super-resolution restoration for mobile infrared imaging system,” Infrared Phys. Technol. 132, 104762 (2023). [CrossRef]

	Scale	EDSR			Ours
	Scale	DIV2K	RealSR	Real-InfraredSR	DIV2K	RealSR	Real-InfraredSR
PSNR	2	33.37	31.3	34.4	33.67	31.46	34.69
SSIM		0.9	0.86	0.91	0.91	0.87	0.92

		Pic1	Pic2	Pic3	Pic4	Pic5	Pic6
L1	PSNR	31.69	37.64	37.33	34.19	36.15	31.02
L1	SSIM	0.90	0.95	0.94	0.90	0.92	0.87
L1+CX	PSNR	31.12	37.18	37.38	34.54	35.91	30.56
L1+CX	SSIM	0.89	0.95	0.94	0.91	0.92	0.86
L1+VGG	PSNR	32.25	38.49	37.59	34.61	36.52	31.08
L1+VGG	SSIM	0.91	0.95	0.94	0.91	0.92	0.87
L1+CX+VGG	PSNR	31.73	38.00	37.24	34.46	36.08	30.62
L1+CX+VGG	SSIM	0.91	0.96	0.94	0.92	0.93	0.86

Method	pic1	pic2	pic3	pic4	pic5	pic6
SRCNN	31.54/0.90	36.99/0.95	36.63/0.94	33.50/0.90	35.32/0.92	30.33/0.86
EDSR	31.47/0.89	37.31/0.95	37.31/0.95	35.27/0.91	36.49/0.93	30.41/0.86
ZSSR	30.72/0.89	37.11/0.95	36.35/0.94	33.08/0.89	35.08/0.91	29.81/0.85
DBPI	26.59/0.80	30.89/0.84	32.37/0.88	29.67/0.81	30.76/0.83	26.78/0.76
BSRGAN	25.72/0.78	34.27/0.91	32.72/0.90	30.69/0.85	32.23/0.87	27.34/0.80
DualSR	27.72/0.85	35.75/0.94	34.56/0.92	31.83/0.86	34.43/0.91	29.29/0.83
MST++	34.62/0.94	9.39/0.96	36.84/0.94	36.53/0.93	37.23/0.93	30.23/0.86
MFDRN	31.56/0.90	37.85/0.95	37.31/0.94	34.23/0.91	35.78/0.92	30.71/0.86
Our	31.73/0.91	38.00/0.96	37.24/0.94	34.46/0.92	36.08/0.93	30.62/0.86

	Scale	EDSR			Ours
	Scale	DIV2K	RealSR	Real-InfraredSR	DIV2K	RealSR	Real-InfraredSR
PSNR	2	33.37	31.3	34.4	33.67	31.46	34.69
SSIM		0.9	0.86	0.91	0.91	0.87	0.92

		Pic1	Pic2	Pic3	Pic4	Pic5	Pic6
L1	PSNR	31.69	37.64	37.33	34.19	36.15	31.02
L1	SSIM	0.90	0.95	0.94	0.90	0.92	0.87
L1+CX	PSNR	31.12	37.18	37.38	34.54	35.91	30.56
L1+CX	SSIM	0.89	0.95	0.94	0.91	0.92	0.86
L1+VGG	PSNR	32.25	38.49	37.59	34.61	36.52	31.08
L1+VGG	SSIM	0.91	0.95	0.94	0.91	0.92	0.87
L1+CX+VGG	PSNR	31.73	38.00	37.24	34.46	36.08	30.62
L1+CX+VGG	SSIM	0.91	0.96	0.94	0.92	0.93	0.86

Real-infraredSR: real-world infrared image super-resolution via thermal imager

Abstract

1. Introduction

2. Related works

3. Real-world infrared SISR dataset

3.1 Infrared image acquisition

3.2 Image alignment

4. Proposed method

4.1 Network structure

4.2 Loss function

4.2.1 Perceptual loss

4.2.2 Contextual loss

5. Experiments

5.1 Experiment setup

5.2 Limitation of the visible light dataset

5.3 Training strategy

5.4 Ablation study

5.4.1 Data sets

5.4.2 Network structure

5.4.3 Loss function

5.5 Comparison to the state-of-the-art

6. Conclusion

Funding

Acknowledgements

Disclosures

Data availability

References

Data availability

Cited By

Figures (16)

Tables (4)

Equations (5)

Optics Express