PI-NLOS: polarized infrared non-line-of-sight imaging

Hao Liu; Hao Liu; Pengfei Wang; Xin He; Mingyang Chen; Mengge Liu; Ziqin Xu; Xiaoheng Jiang; Xin Peng; Mingliang Xu; Mingliang Xu; Mingliang Xu

doi:10.1364/OE.507875

1. Introduction

NLOS imaging technique is capable to percept the scene hidden behind the wall and objects out of the line-of-sight by computing the light field reflected from the relay wall, so during the latest decade, it attracts dramatic interests in the applications of autonomous driving perception, public security, survivor rescue, remote sensing, etc.

According to whether a controllable light source is used, as shown in Fig. 1, NLOS imaging can be divided into active imaging [1–7] and passive imaging [8–12]. Among them, active NLOS imaging illuminates the surface of the relay wall with an active laser light source, and an ultra-fast, single-photon sensitive detector is used to capture the three-bounce light. Previous works proposed many methods to reconstruct hidden targets, including light-cone transform (LCT) [13], f-k [14], Fermat paths [15], phasor-field virtual wave [1], prior knowledge [16], speckle correlation [17], point spread functions [18], deep learning [19]. However, due to the advantages of low-cost, real-time, portability, and non-invasion, passive non-line of sight imaging (NLOS) imaging is a promising technique to improve the field performance of visual perception for the occluded object hidden behind the wall.

Fig. 1. An overview of active and passive NLOS imaging.

Download Full Size | PDF

Passive NLOS imaging is extremely challenging because it is a highly ill-posed inverse problem. The limited information hardly lead to a high-quality reconstruction, and the uncontrollable illumination from the ambient light source even increase the difficulty of passive NLOS imaging in the real-world application. Several methods are proposed in previous works to reduce the condition number for better recovery quality, including optimizing the optical transmission from the shadows created by the occluder [8,20], using polarization to reduce the condition number of the optical transmission matrix [10], using LWIR to obtain better perception for NLOS targets [9,21,22] and deep learning method [23,24]. Among them, the polarized cue is firstly applied in the field of NLOS imaging in Ref. [10], and improve the imaging quality. Polarization image encodes information about roughness, orientation, and reflection of the objects [25], and this physics-inspired modality is of significant help to solve the ill-posed problem in passive NLOS imaging, enhancing the NLOS scene reconstruction. The superior representation of image features by deep neural networks can greatly improve the reconstruction resolution, imaging distance and reduce the reconstruction time. However, when applying polarization and deep learning to passive NLOS imaging, there are still several challenges in these areas. Firstly, existing polarized NLOS imaging methods [10] utilize information from the visible (VIS) band. Although the reconstruction is enhanced in contrast to the conventional passive NLOS imaging without polarization cue, the PSNR and SSIM still need improvements before pushing the technique to the application. Moreover, previous works only used the single angle polarization, it is effective for the simple scene reconstruction, but fails in the complex scenes with several meters away from the relay wall. Secondly, existing deep learning methods utilize the single-input U-Net [26] as the basic structure, which is a well-established network structure for conventional vision application such as image deblurring. However, the NLOS objects reconstruction is different from conventional tasks.

To address the above challenges, this paper present a cross-modal learning-based method to obtain high-quality scene recovery hidden from the line-of-sight. We build a polarized LWIR NLOS imaging platform, and create a dataset as we show in Dataset 1 [27] with the modality of intensity and polarization information. A polarized infrared NLOS imaging deep neural network (DNN) is proposed here to combines the advantages of polarization intensity image and polarization degree image. The quantitative results show satisfied imaging quality for perception of the people around the corner.

2. Related works

2.1 Passive NLOS imaging

The absence of a controllable illumination source for passive NLOS imaging makes the acquired images susceptible to interference from the external environment, resulting in a dramatic decrease in the effective signals of the reconstructed images.

To solve the problem of high complexity of light transmission during NLOS imaging, many previous attempts have been made. These include the use of prior knowledge from partial occluders [8], polarization [10], and spectral information [12] to enhance the light transmission process and reduce the condition number, thus enhancing the reconstructed image. Because of the excellent representation of features by deep neural networks, deep learning-based passive NLOS imaging methods promise to an alternative approach with high imaging quality. Wang et al. [28] used ResNet network [29] to achieve fast and direct recognition and classification of hidden objects. Chen et al. [11] used NLOS-OT network to convert the reconstruction task to low-dimensional stream mapping by optimal transmission. For LWIR with stronger reflective properties for targets with thermal radiation, Kaga et al. [30] used the object’s own thermal radiation to recover the object’s position and temperature from the angular distribution of specular and diffuse reflections. Maeda et al. [9] re-customized the optical transmission framework of LWIR by using the heat source of the hidden object as a light source to demonstrate real-time 2D shape and 3D localization of the hidden object. It demonstrates that the passive LWIR NLOS imaging quality is better than VIS imaging. The wavelength of visible light is at least 1$\sim$2 orders of magnitude smaller than the surface roughness of the materials in the daily life which can be regarded as the relay wall, so the diffused reflection dominates in the VIS NLOS imaging. While for the LWIR band, the longer wavelength mitigates the percentage of diffused reflection during the NLOS process, the reconstruction therefore is less difficult in theory. Polarization is also an useful modality to be expected to enhance the imaging quality. Tanaka et al. [10] firstly proved that the polarization information indeed improves the reconstruction in the visible band. Although the incident light is non-polarized, the reflection can be sensitive to the polarization, and the polarization cue offers extra features to benefit the high quality reconstruction if appropriate multi-modal fusion method is employed to fully utilize these information.

2.2 NLOS imaging dataset

Deep neural networks have a powerful representation of features, but the networks are extremely data-driven dependent. To solve this problem, many attempts have been made in previous works. Geng et al. [11] created visible passive NLOS imaging datasets using existing datasets MNIST [31], animation data [32], and STL-10 [33]. The ground-truth images of these existing datasets displayed on a screen are regarded as the imaging targets, and a VIS camera records the reflected images from the relay wall around the corner. Zhou et al. [24] generated simulated datasets using a simulated data transfer process. Wang et al. [28] used MNIST [31] to create simulated NLOS data of handwritten digits.

3. Experimental setup and data acquisition

The schematic diagram of the experiments and data acquisition scenario is shown in Fig. 2. The relay wall is placed at the corner and the hidden human is located at $\sim$ 6 meters (m) away from the relay wall. The relay wall is the common-used plywood plank with surface roughness $\sim$ 30 microns, which is larger than the LWIR wavelength. Thus, the LWIR scattering on the relay wall contains both specular and diffused reflection.

Fig. 2. The schematic diagram of PI-NLOS imaging experiments and data acquisition scene, where the blue line represents the light transmission direction of NLOS imaging device and the red line represents the light transmission acquired by the GT camera.

Download Full Size | PDF

The signals representing the target are reflected by the relay wall, and captured by a self-designed LWIR camera coupled with a four-quadrant polarization device which consists of four pieces of LWIR polarizer with polarization degree of $0^{\circ }$, $45^{\circ }$, $90^{\circ }$, and $135^{\circ }$. The LWIR camera is 640*512 pixels in which each pixel size is 17 microns. The NETD is < 25 mK, and the response band is 8 $\sim$ 14 microns.

An LWIR camera as the ground-truth (GT) camera, directly measures the hidden objects as placed in the Fig. 2. The tiny difference due to slight perspective deviation is neglected because the angle between the red line and the blue line between the target and relay wall is approximate 85 mrad.

Our dataset collects data of various types of human poses. Every group contains GT images of target in the field of view (the red line in the Fig. 2), and the raw images of polarized LWIR NLOS imaging (the blue line in the Fig. 2) with polarization degree of $0^{\circ }$, $45^{\circ }$, $90^{\circ }$, and $135^{\circ }$. The dataset now has approximate 2,000 groups images, which are 10,000 gray-scale images, in total. In order to create different scene for the generality of the DNN model trained on the dataset, we also added additional elements such as chairs, water cups, hot water kettles, and other handheld items such as toy guns, violins, and rackets in the hands of the target.

For better image fusion by using cross modalities of both polarization and LWIR intensity, we utilize the raw images from the measurements to generate polarization intensity($\mathit {S} _{0}$) images and polarization degree (DoLP) images according to the definition of Stokes vector [34]. An example of images in the dataset is shown in Fig. 3.

Fig. 3. Demonstration of polarized LWIR NLOS imaging dataset.(a) $0^{\circ }$ polarized NLOS image (b) $45^{\circ }$ polarized NLOS image (c) $90^{\circ }$ polarized NLOS images (d) $135^{\circ }$ polarized NLOS image (e) $\mathit {S} _{0}$ image (f) DoLP image (g) ground-truth image

Download Full Size | PDF

4. Methods

4.1 LWIR polarization of the reflection

Priest-Germer bidirectional reflectance distribution function (BRDF) model which utilizes a 4 $\times$ 4 element Mueller matrix for the Fresnel scattering factor to predict a polarized BRDF takes the following form

(1)$$f_{PG}(\theta_i,\varphi_i,\theta_r,\varphi_r,\lambda) = \frac{1}{2\pi} \frac{1}{4\sigma^2}\frac{1}{cos^4\theta}\frac{exp(-(tan^2\theta/2\sigma^2))}{cos(\theta_i)cos(\theta_r)}M(\theta_i,\varphi_i,\theta_r,\varphi_r)$$

where $\lambda$ is the wavelength, $\sigma$ is the surface roughness parameter, $theta$ is the angle between the micro-facet normal $q$ and macro surface normal $z$ in the Fig. 4. $\theta _i$ and $\varphi _i$ respectively represent zenith and azimuth angle of the incident direction. $\theta _r$ and $\varphi _r$ respectively represent zenith and azimuth angle of the reflection. $M$ is the Mueller scattering matrix based on the $s$ and $p$ Fresnel reflection coefficients.

Fig. 4. Sketch diagram of BRDF model.

Download Full Size | PDF

In the passive NLOS scenario, we assume that the incident LWIR light on the relay wall is unpolarized, so the Stokes vector of incident $E_i$ = $[1,0,0,0]^T$, and the reflection $E_r$ is

(2)$$E_r=\int_{0}^{2\pi} \int_{0}^{\pi/2} f_{PG}(\theta_i,\varphi_i,\theta_r,\varphi_r,\lambda)sin\theta_r cos\theta_r d\theta_r d\varphi_r \cdot E_i$$

The four elements of the Mueller matrix $[m_{00},m_{10},m_{20},m_{30}]^T$ can be obtained by the Fresnel equation. In the NLOS reflection, the amount of circular polarization of LWIR from the target is often so weak that $m_{30}$ can be regarded as approximate zero.

(3)$$\left[ \begin{array}{c} m_{00}\\ m_{10}\\m_{20}\\m_{30} \end{array} \right] = \left[ \begin{array}{c} R_s(\lambda,\theta_i)+R_p(\lambda,\theta_i) \\ cos(2\eta_i)R_s(\lambda,\theta_i)- R_p(\lambda,\theta_i) \\ sin(2\eta_i)R_p(\lambda,\theta_i)- R_s(\lambda,\theta_i) \\ 0 \end{array} \right]$$

where $R_p$ and $R_s$ is the reflectivity of $p$ and $s$ direction, respectively, and can be expressed by the Fresnel equation.

Based on the Eqs. (1) $\sim$ (3),

(4)$$E_r = \left[ \begin{array}{@{}c@{}} S_0\\ S_1\\S_2\\S_3 \end{array} \right] = \frac{1}{8\pi\sigma^2}\int_{0}^{2\pi} \int_{0}^{\pi/2}\frac{sin\theta_r}{cos^4\theta}\frac{exp(-(tan^2\theta/2\sigma^2))}{cos(\theta_i) }\cdot \left[ \begin{array}{@{}c@{}} R_s(\lambda,\theta_i)+R_p(\lambda,\theta_i) \\ cos(2\eta_i)R_s(\lambda,\theta_i)- R_p(\lambda,\theta_i) \\ sin(2\eta_i)R_p(\lambda,\theta_i)- R_s(\lambda,\theta_i) \\ 0 \end{array} \right] d\theta_r d\varphi_r$$

Thus, according to the polarization vector of the reflection $E_r$, after reflection on the relay wall surface, the unpolarized LWIR light becomes to be sensitive to the degree of polarization. The polarization information provides additional modality apart from intensity images to benefit the computational reconstruction with better quality. These improved images with distinct contour and clearer texture details are of significant importance for the recognition, classification, segmentation, etc. of the machine vision tasks in applications including autonomous driving, security, survivor rescue, etc. which expect real time and accurate perception results under the limited computing resources in the compact devices.

4.2 Network architecture

Our proposed DNN makes full use of the intensity features of the NLOS $\mathit {S} _{0}$ image and the gradient features of the DoLP image. Figure 5 shows the architecture of the deep learning based method. The inputs are $\mathit {S} _{0}$ images and DoLP images in the LWIR band. The network is mainly composed of three decoders, encoders and attention fusion modules. According to the coarse-to-fine strategy, the encoder module and the fused attention module are used to continuously deepen the extraction of two branch image features and fused features to increase the perceptual field. Then the output of the decoder module is weighted by the encoder module to recover the sharpness of the image, and finally the reconstructed image is output in an end-to-end manner.

Fig. 5. The network architecture diagram.

Download Full Size | PDF

4.3 Encoder

In the encoding stage, we use different branches to perform the same processing for both polarized NLOS images. The method is a dual-input DNN, so in contrast to perform simple feature extraction and feature concatenation after downsampling of the inputs, We pass each step of feature extraction and downsampling from both branches through a fusion module to fully fuse the features from both branches and output the fused features to the decoder section.

Taking the polarization intensity $\mathit {S} _{0}$ channel feature processing as an example, we first use a 3$\times$3 conv and a residual module (Resblock) to extract shallow features, the Resblock module is shown in Fig. 6(c). Then a 3$\times$3 conv and a feature attention module FA are used to emphasize or suppress the output features from the previous module and highlight the target region of interest, the structure of FA is taken from [35], see Fig. 6(b). A Resblock is then added to perform deep feature extraction. The feature extraction results of both branches are subsequently obtained after an identical feature extraction and downsampling module.

Fig. 6. A detailed description of the main modules in our network, where (a) FRA module is mainly used for feature fusion, and (b) FA module and (c) Resblock are mainly used for feature extraction.

Download Full Size | PDF

In the feature fusion part, instead of simply concatenating the results of the two branches, we fuse the output features of each encoder by our carefully-designed attention fusion module (FRA), the structure of which is shown in Fig. 6(a). Instead of performing feature fusion before inputting to the decoder as in previous work, we perform channel-by-channel feature fusion on the results of each feature extraction during the encoder phase, thereby preserving the feature correlation between different channels and reducing feature degradation. This increases the nonlinearity of the network, enhances the network’s expression ability, increases information utilization, and improves the model’s generalization ability.

The FRA is a channel-by-channel fusion module. For the convenience of explanation, we assume the inputs of the two channels as $A$ and $B$, where $A$ represents the input of the polarization intensity $\mathit {S} _{0}$ channel, and $B$ represents the input of the polarization degree DoLP channel. The feature fusion process is detailed in Fig. 7.

Fig. 7. The process of channel-by-channel fusion of two input features $EB_{sk}^{\text {out}}$ and $EB_{dk}^{\text {out }}$

Download Full Size | PDF

The size of both inputs is 256*256*32. We assume the $i-th$ channel of $A$ and $B$ as $A(i)$ and $B(i)$ respectively, and obtain a matrix with unchanged size and one channel through value segmentation in algorithm loop. We set the result of concatenating $A(i)$ and $B(i)$ as $C(i)$. The output of a single channel in the FRA module is set as $D(i)$. The final fusion result is set as E. The value range of $i$ is 0-31.

As shown in the Fig. 6 and Fig. 7, after obtaining the inputs $A$ and $B$ from two channels, we use algorithm loops to segment the features A and B to obtain $A(i)$ and B(i), with the size of $A(i)$ and $B(i)$ changing to 256*256*1. Then we concatenate $A(i)$ and $B(i)$ to obtain $C(i)$, with the size of $C(i)$ being 256*256*2. Next, we apply three convolutional operations, with the convolutional kernels of the three layers being 3*3*2, 3*3*2, and 3*3*1, respectively. After convolution, we apply an FA module, where the size of the image remains the same, but the number of channels changes from 2 to 1. After the above operations, we obtain $D(i)$ with the size of 256*256*1. This process is repeated until the loop ends, and we concatenate each $D(i)$ to obtain $D$ with the size of 256*256*32, which completes the channel-by-channel fusion operation through channel splitting and combination. After that, we use remote connection to weight the original features to prevent feature degradation, with the specific formula shown as follows:

(5) $$E = A + B + C$$

(6) $$D = For(Cat(D(i)))$$

(7)$$D(i) = Conv1(Cat(A(i),B(i))) \times Conv2(A(i),B(i))$$

where For() is for loop method, Cat() is the cat function in PyTorch, Conv1() represents three layers of convolution, and Conv2() represents the convolution function in the FA module.

4.4 Decoder

In the decoding stage, we perform feature reconstruction on the fused features of the upper input through downlink channel operation, and at the same time, in order to prevent feature degradation and gradient disappearance problems during the training process, we also perform element-wise summation of the multiscale features of the first two encoder modules (EB) with the corresponding scale features of the decoder module through remote connection, and finally end-to-end The reconstructed images are finally output in an end-to-end manner. The image reconstruction at each level can be expressed by the following Eq. (8).

(8)$$\hat{R}_{n}=\left\{\begin{array}{ll} \mathrm{DB}_{n}\left(\mathrm{F}_{3}^{\text{out }} \right), & n=1, \\ \mathrm{DB}_{n}\left(\mathrm{DB}_{n-1}^{\text{out }}\right)+\mathrm{EB}_{s2}^{\text{out }}+\mathrm{EB}_{d2}^{\text{out }}, & n=2,\\ \mathrm{DB}_{n}\left(\mathrm{DB}_{n-1}^{\text{out }}\right)+\mathrm{EB}_{s1}^{\text{out }}+\mathrm{EB}_{d1}^{\text{out }}, & n=3, \end{array}\right.$$

where $F_{3}^{\text {out }}$, $DB_{n}$ are the fusion features of $F_{3}$output and nth DB module respectively. $EB_{s1}^{\text {out }}$, $EB_{s2}^{\text {out }}$, $EB_{d1}^{\text {out }}$, $EB_{d2}^{\text {out }}$ are the output features of the two branch EB encoder modules.

4.5 Loss function

Since the network takes both polarized intensity $\mathit {S} _{0}$ images and polarization degree (Dolp) images as inputs, where $\mathit {S} _{0}$ images can represent the intensity characteristics of the images and Dolp images are more sensitive to gradients, our loss function is designed to take into account the characteristics of the input images. The proposed DNN utilizes the fused features of both mainly for the reconstruction of human targets, so the loss function is mainly composed of two parts.

The first part of the loss function is set to pixel content loss. Since the polarization intensity image mainly provides intensity information but not color information, in our experiments, we are using the L1 loss [36] to calculate the difference between the reconstructed image and the true image because of the L1 loss robustness to abnormal points. The following Eq. (9) is the definition of the $Loss_{pixel}$ .

(9)$$Loss_{pixel} = \frac{1}{H\times W} \sum_{i=1}^{H\times W} \bigg| y_{label} - y_{pred}\bigg|$$

where H and W are the number of pixels of the height and width of the image, and $y_{label}$ and $y_{pred}$ represent the true image and the reconstructed image, respectively.

The second part of the loss function is the image gradient loss [36] calculated using the sobel operator. Since polarization images have relatively prominent image gradient features, the designed gradient loss function can make higher image gradient in the reconstructed image. In other words, we optimize the limit of disparity between the edge features of the reconstructed image and the true image to make the image have a more compliant image structure. $Loss_{grad}$ is defined as follows Eq. (10) .

(10)$$Loss_{grad} = \frac{1}{H\times W} \sum_{i=1}^{H\times W} \bigg| |{\nabla} y_{label}| - |{\nabla} y_{pred}|\bigg|$$

where ${\nabla} y_{label}$ and ${\nabla} y_{pred}$ represent the gradients of the predicted image and the true image obtained by convolution of the sobel operator.

Ultimately, a joint loss function is defined to guide the network training, as shown in Eq. (11) .

(11)$$Loss_{total} = Loss_{pixel} + \lambda Loss_{grad}$$

where $\lambda$ is the hyperparameter to balance the content loss and gradient loss, and after repeated experiments, we set it to 0.1.

5. Results and discussion

We train our DNN model on our PI-NLOS dataset. The training set, test set and validation set use 4800, 900 and 300 pairs of raw images from NLOS imaging measurements and GT images, respectively. For each iteration, we crop our images to 256$\times$256. To make the network fully converge, we set the epoch to 2000 and the learning rate is initially set to $10^{-4}$, decreasing by a factor of 0.5 every 500 epochs. Our experiments are performed on an i7 Intel- Xeno W-2275 CPU and an NVIDIA RTX A4000.

The common-used quantitative metrics of peak signal-to-noise ratio (PSNR) and image structural similarity (SSIM) are used to evaluate the imaging quality of experimental results.

We compared our method with several existing deep learning-based NLOS imaging methods [11,24,37] to validate the superiority of the DNN model proposed in this article. Because all prior NLOS imaging methods are single-input model, during the quantitative comparison, we have changed the input of our DNN to generate three variants: (1) PI-$\mathit {S} _{0}$, where both feature extraction branches have $\mathit {S} _{0}$ single input, (2) PI-Dolp, where both feature extraction branches have Dolp single input, and (3) PI-Double, where both feature extraction branches have dual inputs including $\mathit {S} _{0}$ and Dolp images.

The results are shown in Fig. 8. Both PI-$\mathit {S} _{0}$ with the same input and PI-Double with different inputs achieve visually superior results. The proposed DNN method outputs better qualitative reconstruction quality from the perspective of the contour of target with different pose. The other objects in the surroundings, such as the mobile phone placed on the chair, letters held on the hand, the chair, etc., also achieve satisfied reconstruction.

Fig. 8. (a) represent the $\mathit {S} _{0}$ raw images measured by PI-NLOS imaging device, and (g) represent the ground truth images of the target. Reconstructed images outputted by several existing imaging methods including (b) CA-GAN [37], (c) Phong [24], (d) NLOS-OT [11], and our method (e) PI-Single,(f) PI-Double.

Download Full Size | PDF

The quantitative comparison of recent deep learning-based passive NLOS works in the Fig. 8 and our method are present in Table 1. All prior methods and PI-$\mathit {S} _{0}$ are trained and test by using $\mathit {S} _{0}$ images as single input on our dataset. The PI-double is the results of dual-channel model proposed in this work. The experimental results show our method superior to recent NLOS imaging methods in the metrics of both PSNR and SSIM. The LWIR intensity images contribute the global contour of the target, and the different material surfaces in the NLOS scene lead to distinct degree of polarization, so the polarization modality can provide clearer contour and local details. The proposed dual-input DNN architecture can extract the multi-modal features from each channel, and then fuse them to output high quality reconstruction results. The strategy complements the deficiencies of information in the single modality NLOS imaging methods. In perspective of reconstruction time, although our method did not achieve the best reconstruction efficiency, the reconstruction is only slightly slower than Phong [24] method.

Table 1. Quantitative metrics results of reconstructed images.

View Table | View all tables in this article

The reconstructed images by using the same methods with the alternative input of Dolp raw images are illustrated in Fig. 9, and the evaluation metrics are compared in the Table 2. The qualitative and quantitative results of the method in this article still present better performance. The values of PSNR and SSIM slightly decrease because the individual DoLP images are insensitive to the intensity information and resulting the pixel-wise intensity deviation of the reconstruction.

Fig. 9. (a) represent the Dolp raw images measured by PI-NLOS imaging device, and (g) represent the ground truth images of the target. Reconstructed images outputted by several existing imaging methods including (b) CA-GAN [37],(c) Phong [24],(d) NLOS-OT [11], and our method (e) PI-Single,(f) PI-Double.

Download Full Size | PDF

Table 2. Quantitative metrics results of recent NLOS imaging methods with DoLP as input and our method.

View Table | View all tables in this article

In order to validate the effectiveness of modules in the PI-NLOS DNN, we conduct the ablation study on the dataset. Firstly, we train the DNN model using the FRA, FA, and self-designed loss function, achieving the highest performance in Table 3. Next, we tested PI DNN without CBCF, SKAM, and/or MLoss to compare the PSNR value of the reconstruction.

Table 3. Ablation study for the self-designed modules of FRA, FA and loss function.

View Table | View all tables in this article

Table 3 demonstrates the contributions of each module. The baseline model without all three modules obtain an average PSNR of 19.35. The proposed method with all modules employed outperforms the baseline with significant improvements, and the PSNR value is 24.66. The absence of FRA, FA, and self-designed loss function leads to PSNR decline of 3.81 dB, 1.75 dB, 3.09 dB. These results reveal the effectiveness and superiority of our data-driven method and modules.

6. Conclusion

In this work, we propose a polarized LWIR NLOS method by utilizing a data-driven DNN model to fuse polarization information and intensity distribution of LWIR images. The experimental scene is established to measure and acquire paired polarized NLOS images and the corresponding ground-truth of the imaging target. The polarization intensity $\mathit {S} _{0}$ images and polarization degree Dolp images can be obtained from observed raw data. Compared with the existing NLOS datasets which are in VIS band, the self-build dataset has a total of 10,000 LWIR images containing four different degrees of polarization. Meanwhile, we propose a dual-input attention fusion reconstruction DNN, which extracts polarization intensity and polarization degree image features by two channels from coarse to fine.

Instead of simply contacting the features of the two channels as in many previous DNN models, the features are fused by a designed attention fusion module to highlight the common information of interest in both inputs. Polarized pattern and features provide cross-modality information. The fused features are subjected to feature reconstruction by a carefully designed decoder module, while remote connections are added to prevent feature degradation and gradient disappearance during the training process, and the reconstructed image is finally output. Three common quantitative metrics, which are SSIM and PSNR, are employed to evaluate the quality of reconstructed images. The qualitative and quantitative analysis for the experimental results demonstrate that the proposed data-driven model remarkably improves imaging quality, and therefore this PI-NLOS imaging method holds significance in the potential real-world applications, ranging from self-driving perception and medical imaging to public security.

Funding

National Natural Science Foundation of China (62272421, 62172371, U21B2037).

Acknowledgments

Hao Liu acknowledges the funding project of National Natural Science Foundation of China (No. 62272421). Xiaoheng Jiang thanks the National Natural Science Foundation of China (No. U21B2037 and 62172371)

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but will release within two weeks after the article is formally published upon reasonable request. The dataset link is Dataset 1, Ref. [27]. For more details, please contact the first author Hao Liu: HaoLiu1989@hotmail.com or the corresponding author.

References

1. X. Liu, I. Guillén, and M. La Manna, “Non-line-of-sight imaging using phasor-field virtual wave optics,” Nature 572(7771), 620–623 (2019). [CrossRef]

2. R. Cao, F. de Goumoens, and B. Blochet, “High-resolution non-line-of-sight imaging employing active focusing,” Nat. Photonics 16(6), 462–468 (2022). [CrossRef]

3. X. Feng and L. Gao, “Ultrafast light field tomography for snapshot transient and non-line-of-sight imaging,” Nat. Commun. 12(1), 1–9 (2021). [CrossRef]

4. C. A. Metzler, F. Heide, and P. Rangarajan, “Deep-inverse correlography: towards real-time high-resolution non-line-of-sight imaging,” Optica 7(1), 63–71 (2020). [CrossRef]

5. Y. Li, J. Peng, J. Ye, et al., “Nlost: Non-line-of-sight imaging with transformer,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), pp. 13313–13322.

6. C. Wu, J. Liu, and X. Huang, “Non–line-of-sight imaging over 1.43 km,” Proc. Natl. Acad. Sci. 118(10), e2024468118 (2021). [CrossRef]

7. H. Wu, S. Liu, and X. Meng, “Non-line-of-sight imaging based on an untrained deep decoder network,” Opt. Lett. 47(19), 5056–5059 (2022). [CrossRef]

8. C. Saunders, J. Murray-Bruce, and V. K. Goyal, “Computational periscopy with an ordinary digital camera,” Nature 565(7740), 472–475 (2019). [CrossRef]

9. T. Maeda, Y. Wang, R. Raskar, et al., “Thermal non-line-of-sight imaging,” in 2019 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2019), pp. 1–11.

10. K. Tanaka, Y. Mukaigawa, and A. Kadambi, “Polarized non-line-of-sight imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2136–2145.

11. R. Geng, Y. Hu, Z. Lu, et al., “Passive non-line-of-sight imaging using optimal transport,” IEEE Trans. on Image Process. 31, 110–124 (2022). [CrossRef]

12. C. Hashemi, R. Avelar, and J. Leger, “Isolating signals in passive non-line-of-sight imaging using spectral content,” IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–12 (2023).

13. M. O’Toole, D. B. Lindell, and G. Wetzstein, “Confocal non-line-of-sight imaging based on the light-cone transform,” Nature 555(7696), 338–341 (2018). [CrossRef]

14. M. Isogawa, D. Chan, Y. Yuan, et al., “Efficient non-line-of-sight imaging from transient sinograms,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, (Springer, 2020), pp. 193–208.

15. S. Xin, S. Nousias, K. N. Kutulakos, et al., “A theory of fermat paths for non-line-of-sight shape reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 6800–6809.

16. F. Heide, M. O’Toole, and K. Zang, “Non-line-of-sight imaging with partial occluders and surface normals,” ACM Trans. Graph. 38(3), 1–10 (2019). [CrossRef]

17. J. Grau Chopite, M. B. Hullin, M. Wand, et al., “Deep non-line-of-sight reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 960–969.

18. C. Pei, A. Zhang, and Y. Deng, “Dynamic non-line-of-sight imaging system based on the optimization of point spread functions,” Opt. Express 29(20), 32349–32364 (2021). [CrossRef]

19. F. Mu, S. Mo, J. Peng, et al., “Physics to the rescue: Deep non-line-of-sight reconstruction for high-speed imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

20. A. B. Yedidia, M. Baradad, C. Thrampoulidis, et al., “Using unknown occluders to recover hidden scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 12231–12239.

21. T. Sasaki, C. Hashemi, and J. Leger, “Passive 3d location estimation of non-line-of-sight objects from a scattered infrared light field,” Opt. Express 29(26), 43642–43661 (2021). [CrossRef]

22. C. Hashemi, T. Sasaki, and J. Leger, “Parallax-driven denoising of passive non-line-of-sight thermal imagery,” in 2023 IEEE International Conference on Computational Photography (ICCP), (2023), pp. 1–12.

23. M. Aittala, P. Sharma, L. Murmann, et al., “Computational mirrors: Blind inverse light transport by deep matrix factorization,” Advances in Neural Information Processing Systems 32, 1 (2019).

24. C. Zhou, C.-Y. Wang, and Z. Liu, “Non-line-of-sight imaging off a phong surface through deep learning,” arXiv, arXiv:2005.00007 (2020). [CrossRef]

25. L. B. Wolff and A. G. Andreou, “Polarization camera sensors,” Image Vis. Comput. 13(6), 497–510 (1995). [CrossRef]

26. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

27. H. Liu, “Pi-nlos dataset,” github, (2023). https://github.com/Unconventional-Vision-Lab-ZZU/PI-NLOS, Last update on 2023-11-23.

28. Y. Wang, Y. Zhang, and M. Huang, “Accurate but fragile passive non-line-of-sight recognition,” Commun. Phys. 4(1), 88 (2021). [CrossRef]

29. K. He, X. Zhang, S. Ren, et al., “Identity mappings in deep residual networks,” in European conference on computer vision, (Springer, 2016), pp. 630–645.

30. M. Kaga, T. Kushida, and T. Takatani, “Thermal non-line-of-sight imaging from specular and diffuse reflections,” Trans. on Comput. Vis. Appl. 11(1), 8 (2019). [CrossRef]

31. Y. LeCun, “The mnist database of handwritten digits,” http://yann.lecun.com/exdb/mnist/ (1998).

32. G. Branwen and A. Gokaslan, “Danbooru2019: A large-scale crowdsourced and tagged anime illustration dataset,” (2019).

33. A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, (JMLR Workshop and Conference Proceedings, 2011), pp. 215–223.

34. J. S. Tyo, D. L. Goldstein, D. B. Chenault, et al., “Review of passive imaging polarimetry for remote sensing applications,” Appl. Opt. 45(22), 5453–5469 (2006). [CrossRef]

35. S.-J. Cho, S.-W. Ji, J.-P. Hong, et al., “Rethinking coarse-to-fine approach in single image deblurring,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 4641–4650.

36. L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network,” Inf. Fusion 82, 28–42 (2022). [CrossRef]

37. O. Kupyn, V. Budzan, M. Mykhailych, et al., “Deblurgan: Blind motion deblurring using conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 8183–8192.

Model and method	PSNR	SSIM	RUNTIME(s)
CA-GAN [37]	18.97	0.8220	0.1476
Phong [24]	12.24	0.6814	0.0075
NLOS-OT [11]	23.34	0.8748	0.0384
PI- $S_{0}$	23.51	0.8934	0.0232
PI-Double	24.66	0.9157	0.0238

Model and method	PSNR	SSIM	RUNTIME(s)
CA-GAN [37]	18.13	0.8127	0.1368
Phong [24]	10.34	0.5707	0.0279
NLOS-OT [11]	21.85	0.8593	0.0442
PI-Dolp	22.31	0.8736	0.0282
PI-Double	24.66	0.9157	0.0238

FRA module	FA module	self-designed Loss	PSNR
			18.35
$\sqrt$	$\sqrt$		21.57
$\sqrt$		$\sqrt$	22.91
	$\sqrt$	$\sqrt$	20.85
$\sqrt$	$\sqrt$	$\sqrt$	24.66

Model and method	PSNR	SSIM	RUNTIME(s)
CA-GAN [37]	18.97	0.8220	0.1476
Phong [24]	12.24	0.6814	0.0075
NLOS-OT [11]	23.34	0.8748	0.0384
PI- $S_{0}$	23.51	0.8934	0.0232
PI-Double	24.66	0.9157	0.0238

Model and method	PSNR	SSIM	RUNTIME(s)
CA-GAN [37]	18.13	0.8127	0.1368
Phong [24]	10.34	0.5707	0.0279
NLOS-OT [11]	21.85	0.8593	0.0442
PI-Dolp	22.31	0.8736	0.0282
PI-Double	24.66	0.9157	0.0238

PI-NLOS: polarized infrared non-line-of-sight imaging

Abstract

1. Introduction

2. Related works

2.1 Passive NLOS imaging

2.2 NLOS imaging dataset

3. Experimental setup and data acquisition

4. Methods

4.1 LWIR polarization of the reflection

4.2 Network architecture

4.3 Encoder

4.4 Decoder

4.5 Loss function

5. Results and discussion

6. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (9)

Tables (3)

Equations (11)

Optics Express