Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Video super-resolution for single-photon LIDAR

Open Access Open Access

Abstract

3D time-of-flight (ToF) image sensors are used widely in applications such as self-driving cars, augmented reality (AR), and robotics. When implemented with single-photon avalanche diodes (SPADs), compact, array format sensors can be made that offer accurate depth maps over long distances, without the need for mechanical scanning. However, array sizes tend to be small, leading to low lateral resolution, which combined with low signal-to-background ratio (SBR) levels under high ambient illumination, may lead to difficulties in scene interpretation. In this paper, we use synthetic depth sequences to train a 3D convolutional neural network (CNN) for denoising and upscaling (×4) depth data. Experimental results, based on synthetic as well as real ToF data, are used to demonstrate the effectiveness of the scheme. With GPU acceleration, frames are processed at >30 frames per second, making the approach suitable for low-latency imaging, as required for obstacle avoidance.

Published by Optica Publishing Group under the terms of the Creative Commons Attribution 4.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

1. Introduction

Three-Dimensional (3D) imaging captures depth information from a given scene and is used in a wide range of fields such as autonomous driving [1,2], smartphones [3] or industrial environments [4]. Time-of-Flight (ToF) is a common way to measure depth, based on illuminating the scene with a modulated light source and measuring the time for the signal to return [5]. For longer range, outdoor applications, direct ToF sensors (dToF), and pulsed illumination are typically used, with the returning backscattered signal being detected by highly sensitive Avalanche Photo Diodes (APDs) or Single-Photon Avalanche Diodes (SPADs), and timed using an electronic stopwatch. Although the first dToF systems were based on point detectors requiring optical scanning, advances in detector technology have lead to SPAD arrays with integrated timing and processing. SPAD dToF receivers are now available in image sensor format, which, used in conjunction with flood illumination, enable high-speed 3D imaging [6].

A drawback of current SPAD dToF image sensors is a relatively low lateral resolution [7]. This low resolution creates challenges in applications such as autonomous driving, where objects need to be classified as well as localised, so that appropriate action is taken [8]. Although there is indication that by applying neural networks to photon timing data, the resolution problem may be overcome [9], these approaches are yet to be generalised to arbitrary scenes. Noise in depth estimates under strong sunlight can lead to a further loss in detail in depth maps, as illustrated in Fig. 1.

 figure: Fig. 1.

Fig. 1. Depth map frame of people walking using AirSim (for more details on the generation process refer to Section 2.) The frame has been produced using a 30$^{\circ }$ field-of-view and a signal-to-noise ratio of 1. a) Low-resolution depth map with noisy pixels due to solar background radiation (64$\times$32 pixels). b) Noise-free version of a), which is a nearest-neighbours resampling of the high-resolution depth map. c). High-resolution, noise-free depth map with a $\times$4 increase with respect a).

Download Full Size | PDF

We therefore seek a processing scheme to improve the quality of depth maps by increasing the lateral resolution (also known as upscaling or super-resolution), whilst mitigating the effect of photon noise [10]. A wide-range of super-resolution methods have been proposed in literature, and although many of these are primarily devised for RGB or grayscale intensity data, they are applicable to depth too. Interpolation-based schemes such as bicubic or Lanczos resampling are computationally simple but can suffer from poor accuracy [11,12], especially when the input is noisy. Alternatively, reconstruction methods [13,14] use prior knowledge from the scene or sensor fusion (e.g. intensity guidance [15]) to improve spatial detail at the expense of, typically, high computational costs. Learning-based methods can offer high output quality in combination with computational efficiency by learning statistical relationships between the low-resolution (LR) and the high-resolution (HR) equivalent [1618]. Recently, there has been increased interest in deep learning methods, which are a sub-branch of machine learning algorithms that aim to directly learn key features of the input to produce an output. These have been demonstrated to outperform previous approaches [1921].

The bulk of the research on deep-learning super-resolution methods considers the use of a single LR image to produce an HR output [10,22]. In contrast, video super-resolution schemes exploit the temporal correlations in multiple consecutive LR images to produce an improved HR reconstruction in exchange for frame rate [23]. Most of these approaches are based on a combination of two networks: one for inter-frame alignment [24] and another one for feature extraction/fusion to produce a HR space [25,26]. Other methods do not use alignment but exploit the spatio-temporal information for feature extraction using 2D or 3D convolutions [27], or recurrent neural networks RCNN [28]. All of the aforementioned video superresolution schemes work best in the case of sub-pixel displacements between consecutive frames.

From the perspective of depth upscaling, a key limitation of existing video super-resolution schemes is that they were conceived for RGB data and thus are not optimised for the characteristics of depth frames. Although there has been some research centered on depth, the papers assume of a static scene and camera or a static scene and a moving camera [29,30]. This paper therefore aims to devise an effective video super-resolution and denoising method targeting depth data using an existing neural network, originally designed for RGB. We consider the case of a high-speed dToF sensor [31], such that displacements in consecutive frames are kept at a sub-pixel level, and account for realistic depth noise. We present a methodology for generating diverse datasets of synthetic SPAD dToF data. The resulting datasets are then used to train a super-resolution and denoising neural network. We assess the performance of the scheme using both real and synthetic input data.

2. Generation of synthetic depth data

Obtaining large and diverse datasets of noise-free, high resolution depth frames, to serve as ground truth data for super-resolution networks, is difficult in practice. In recent years, however, there have been significant advances in software packages capable of simulating realistic environments in detail. In particular, AirSim (built on Unreal Engine [32]) is an open-source simulator for drones or cars with the aim of aiding AI research for autonomous vehicles [33]. In AirSim, one can control a vehicle through diverse virtual environment, populated with a range of objects, and retrieve virtual data from multiple sensors. The camera properties in the vehicle (position, field-of-view, lateral resolution and frame rate) can be chosen arbitrarily. In our work, RGB and depth sequences are generated at a lateral resolution of 256$\times$128 and FoV of 30$^{\circ }$ at different frame rates to ensure small object shifts between frames. A range of outdoor scenarios are simulated to provide a diverse dataset for training the neural network.

The depth maps provided by AirSim do not account for the main noise in single-photon dToF sensors: the inherent Poisson noise in the signal and background photon counts [34]. The high-resolution depth map of size 256$\times$128 along with the grayscale version of the high-resolution RGB frame (intensity map, 256$\times$128) are therefore used to synthesize SPAD data in the form of 16-bin temporal photon histograms. The model, similar to that used in [35], assumes a SPAD with a multi-event histogramming TDC architecture [31,36], with adjustable bin size. The system instrument response function (IRF), representing the combination (convolution) of the temporal profile of the laser pulse, the SPAD jitter and the TDC jitter, is presumed to have a Gaussian profile with a FWHM of the order of ns. An array size of 64$\times$32 pixels is assumed, each pixel being composed of 4$\times$4 SPAD detectors. For each detector, the mean signal and background photon rates depend on the surface observed by the detector. The signal rate is proportional to the surface reflectivity and has an inverse square relationship with the distance to the surface. The rate of background photons is proportional to the target reflectivity and the ambient level, which can be varied to change the overall signal-to-background ratio level in the captured depth data The SBR is defined as the ratio of signal photons and background photons across all bins in the histogram, averaged for the whole FoV of the scene. The photon timing histograms computed for each pixel are randomised according to Poisson statistics, and then processed using centre-of-mass peak extraction to produce depth maps [36].

The last pre-processing step is to normalise the depth data between 0 and 1 and concatenate it in groups of 2T$_R$+1 frames, where T$_R$ is the temporal radius. In other words, the input of the network consists of T$_R$ prior and posterior frames to super-resolve the central frame. The output to the neural network is compared with the ground truth depth, which is also normalised between 0 and 1, using the Peak Signal-to-Noise ratio (PSNR) and Structure Similarity Index (SSMI) [37,38]. The metrics are calculated using the following equations:

$$\mbox{PSNR} = 20 \log_{10}\left( \frac{1}{\sqrt{\frac{1}{HW}\sum_{0}^{H-1}\sum_{0}^{W-1}||Y_{GT}-Y_{SR}||^2}} \right)$$
$$\mbox{SSIM}= \frac{(2\mu_{GT}\mu_{SR}+c_1)(2\sigma_{GT-SR}+c_2)}{(\mu_{GT}^2+\mu_{SR}^2+c_1)(\sigma_{GT}^2+\sigma_{SR}^2+c_2)},$$
where the numerator in Eq. (1) refers to the maximum value in the image (1 in our case) and the denominator corresponds to the mean-squared error. H corresponds to the image height (in pixels), W to the image width (in pixels), Y$_{GT}$ to the ground truth frame and Y$_{SR}$ to the super-resolved frame. In Eq. (2), $\mu$ corresponds to the mean of the indicated frame, $\sigma$ to the variance and $c_1 = 0.0001$ and $c_2 = 0.0009$ are parameters to stabilize division for weak denominators [27]. PSNR is bound from 0 to $\infty$ and SSIM from 0 to 1. In both cases, a higher number suggests a better reconstruction. Figure 2 shows the key steps in the training and assessment of the neural network proposed in this paper, from capturing scenes in AirSim, converting them into SPAD-like data and comparing the output of the network against the ground truth.

 figure: Fig. 2.

Fig. 2. Workflow diagram showing the process of: capturing virtual scenes; converting them into SPAD-like data; using the neural network to super-resolve frames and, analysing the network performance. It is emphasized that the captured scenes are video sequences and not single frames.

Download Full Size | PDF

3. Super-resolution and denoising network

In this paper, the structure of the VSR-DUF neural network is adapted to perform video super-resolution and denoising, as shown in Fig. 3 [27]. This network has been selected as it offers high super-resolution performance without requiring a frame re-alignment stage, which may be impacted by noise in the depth data. Although 3D convolution-based approaches are typically computationally expensive, in our case this is mitigated by the low transverse resolution of the inputs and by using a single channel (depth) rather than RGB. The network architecture is based on blocks of 3D convolutions and a set of dynamic upsampling filters to extract spatiotemporal features without the need of performing frame realignment [24]. The blocks consist of batch normalisation (BN), ReLU, 1$\times$1$\times$1 convolution, BN, ReLU and 3$\times$3$\times$3 convolution and there are a total of 3+T$_R$ blocks. The network also extracts a residual map that is added to the central frame to enhance the sharpness of the final output.

 figure: Fig. 3.

Fig. 3. Neural network structure, adapted from [27].

Download Full Size | PDF

The input to the network has a variable size of 64$\times$32$\times$(2T$_R$+1), depending on the temporal radius that is chosen and the output is a $\times$4 super-resolved frame in both transverse axes (256$\times$128). When the temporal radius is set to 0, single depth frame is used as input whereas a greater value (here 1 to 4) exploits the temporal correlation in multiple successive depth frames. At the start (or end) of a video sequence, there are no prior (or posterior) frames available. The corresponding frames are therefore temporally padded with a copy of the initial (or last) frame to fulfil the input size imposed by $T_R$. Additionally, the data is randomly shuffled to prevent biasing the neural network’s weights to a specific scenario. Not doing so leads to lower performance since the optimisation lands on local minima rather than the absolute minimum.

The neural network is implemented in Tensorflow using Keras [39]. The training is performed using the Adam optimiser [40] in a desktop computer (HP EliteDesk 800 G5 TWR) with the assistance of a RTX2070 GPU to accelerate the task. The loss function for this network is the Huber loss [41], often used to guarantee stable convergence during the optimisation of the model. The Huber loss is defined in Eq. (3) as

$$\mathcal{L} = \begin{cases} \frac{1}{2}||Y_{GT}-Y_{SR}||_2^2, & ||Y_{GT}-Y_{SR}||_1 \leq \delta \\ \delta ||Y_{GT}-Y_{SR}||_1 -\frac{1}{2}\delta^2, & \mbox{otherwise} \end{cases},$$
where $Y_{GT}$ is the HR ground truth, $Y_{SR}$ is the super-resolved HR depth map and $\delta = 0.01$ is a threshold parameter. The performance of the model is tracked at every step by the PSNR. Other parameters of interest in the neural network are the epochs and the batch size, with values of 100 and 4, respectively. Early stopping is introduced to avoid overfitting the model, thus terminating the training process after a certain amount of epochs (here a patience of 5 is used) and saving the weights corresponding to the minimum loss recorded. The learning rate of the network is set initially at 0.001 and it decreases by a factor of 10 every 10 epochs.

4. Results

The performance of the video super-resolution network presented here is evaluated for a model trained with 15500 examples. These include a varied range of depth maps which feature people, vehicles, bicycles, and other common objects with different noise levels (ranging from SBR 0.1 to 2), different depth ranges (ranging from 0 to 35 m) and different sampling rates, corresponding to objects shifting a variable amount of SPAD pixels in between frames (ranging from 0.2 to 5 SPAD pixels). Given the assumed FoV and SPAD pixel resolution, an object moving at 50 km/h at a distance of 35 m captured at 100 FPS, shifts by 0.5 SPAD pixels in between frames. Different versions of the network are trained for different temporal radii (ranging from 0 to 4, corresponding to 1 to 9 input images). The validation dataset consists of 1500 examples (3 sequences of 500 frames each), providing an unbiased evaluation of the neural network. Similarly, the test dataset also contains 1500 examples from 3 different sequences and is used to perform the evaluation of the network. Table 1 and Fig. 4 summarise the performance of the network for different temporal radii in terms of PSNR, SSIM and processing speed in FPS.

 figure: Fig. 4.

Fig. 4. FPS, PSNR and SSIM for different temporal radii training networks for the 3 scenes comprising the test dataset. FPS are colour-coded and TR0 and TR4 plot points are marked for reference.

Download Full Size | PDF

Tables Icon

Table 1. Performance of super-resolution network for different temporal radii in terms of PSNR, SSIM, and FPS.a

Visualization 1, Visualization 2, and Visualization 3 in the supplemental material show the input, the super-resolved depth maps and the ground truth for different radii for the sequences in the test dataset. Generally, the video sequences corresponding to single-image super-resolution show an improvement in lateral resolution but lack temporal coherence and introduce temporal artifacts. On the other hand, using multiple inputs reduces significantly the presence of temporal artifacts and improves the overall quality of the output frame. This is reflected by the increase of PSNR and SSIM with increasing number of inputs. The network is robust to different levels of depth noise, achieving better denoising with increasing temporal radii. In terms of computational effort, the use of multiple inputs slows the processing speed due to a larger sized network. We do not consider larger temporal radii on the basis that it would lead to lower processing speeds, and given that TR3 and TR4 do not have a significant difference in performance. Additionally, when using the maximum temporal radius considered here (TR4), the network is still able to work at video rate (30 FPS). When using TR4 and capturing (and processing) at 30 FPS, the processing would operate with a latency of 150 ms, considered below the visual reaction time of humans (200 ms [42]) and therefore suitable for applications like object detection or obstacle avoidance. Note that the processing is done in a rolling fashion, meaning that for subsequent frames a single additional frame is required, instead of 9 more.

The differences between TR0 and TR4 in temporal coherence are studied in more detail by considering a sequence with a person (Visualization 4). Temporal coherence (Tc) is defined here as a measure of a super-resolution method’s ability to capture changes in a scene without introducing temporal artifacts. To measure it, the difference between consecutive frames (for both ground truth and super-resolved frames) is taken to obtain delta frames. All pixels with motion (according to the ground truth delta frames) are then assigned zero values. Therefore, the amount of remaining non-zero pixels gives a measure of the temporal coherence for a given frame, a larger amount indicating more pronounced temporal artifacts as quantified by Eq. (4)

$$\mathrm{Tc}_t = \sum_{i,j=0}^{H,W} \begin{cases} 1, & ||GT_{i,j,t}-GT_{i,j,t+1}|| = 0 \hspace{2mm} \& \hspace{2mm} ||SR_{i,j,t}-SR_{i,j,t+1}|| > 0\\ 0, & \mbox{otherwise} \end{cases}.$$

For the sequence considered, the output for TR0 has roughly double as many pixels with artifacts compared with TR4, demonstrating that the use of multiple inputs means that motion is more accurately reproduced in the output frames. An additional study has been carried in the Supplemental material comparing the performance of this network with an upscaling factor of $\times$8. The study suggests that the $\times$8 network is not learning additional features with respect to $\times$4, but is able to produce smoother edges in certain scenarios.

5. Super-resolution at different FPS

A dedicated dataset of people walking at 4.3 km/h is captured at different FPS to explore the use of temporal information in super-resolution. A main sequence is captured at 100 FPS and the lower FPS versions are generated by skipping frames with respect the original one (e.g. skip every other frame to generate a 50 FPS sequence). Given the assumed FoV and SPAD pixel resolution, people at a distance of 9 m captured at 100 FPS, shift by 0.16 SPAD pixels in between frames. As with previous studies, Table 2 shows the performance in terms of PSNR and SSIM for different temporal radii and is illustrated in Fig. 5, where the parameters are compared to a single-image super-resolution approach, displayed on a green plane.

 figure: Fig. 5.

Fig. 5. Results of a super-resolved sequence at different temporal radii and FPS for a). PSNR. b). SSIM. The green plane denotes the average PSNR or SSIM of a single-image super-resolution technique (TR0).

Download Full Size | PDF

Tables Icon

Table 2. Performance of super-resolution network of a scene (0.67 average SBR, 298 average signal photons) in terms of PSNR and SSIM for different temporal radii and captured at different FPS.a

The results indicate that when capturing at speeds greater than approximately 7 FPS, which corresponds to shifts of 2.2 SPAD pixels in between frames given the conditions above, the movement of people walking is connected sufficiently to exploit the temporal information in multiple consecutive input frames, thus the increase in performance with respect the single image approach. The general trend is that the quality of the reconstruction improves with increasing FPS, to a point where it plateaus. At higher frame rates, where the movement of objects is densely sampled, the inputs become too similar to extract additional temporal information from them. While there would not be improvement in terms of resolving features, the use of multiple inputs still has an advantage in denoising frames over single image reconstructions independently of the dynamics of the scene. It is also observed that for a given capturing speed, the performance increases with the use of more inputs, which is consistent with the results in Fig. 4. Conversely, when the frame rate is low and the movement of objects is sparsely sampled, the network no longer recognises the temporal correspondence between frames. This results in lower quality reconstructions than a single image approach with the addition of temporal artifacts.

We note that the above study assumed a constant SBR. In reality, varying the frame rate can impact the exposure time, and in turn the SBR of input frames, and can lead to motion blur effects in the case of long exposures. Although not considered here, motion blur is likely to have a significant effect on the quality of reconstruction (for both single and multiple input approaches) due to the resulting disparity between the input and ground truth. We also note that this study has been carried for people, but we expect it to be generalisable to arbitrary objects in terms of the optimal shift per frame.

6. Comparison with existing super-resolution techniques

We use the synthetic test dataset to compare the performance of our network with a bicubic interpolation method [11], a state-of-the-art neural network using SPAD histogram data (and higher resolution intensity data) [35] and a video super-resolution network originally for RGB images but retrained with our depth data (Visualization 5) [43]. Table 3 and Fig. 6 show the performance of the different techniques in terms of average PSNR, SSIM and FPS.

 figure: Fig. 6.

Fig. 6. Example output frames comparing super-resolution methods, along with the low-resolution input and ground truth.

Download Full Size | PDF

Tables Icon

Table 3. Performance of super-resolution network of test dataset in terms of PSNR, SSIM, and FPS for different methods (Bicubic, iSeeBetter, and HistNet networks and our approach for TR1 and TR4), in bold for highest value.a

Bicubic interpolation, despite being the fastest upscaling technique, fails to remove the noise from depth data leading to poor reconstructions overall. Although PSNR can reach relatively high values, the SSIM reflects the limits of this interpolation technique. The iSeeBetter network is able to super-resolve features in a similar way with our method and removes the majority of depth noise. Additionally, with iSeeBetter, the edges of some objects (such as the bike) are reproduced in sharper way in z compared with our method (and HistNet), where these edges are somewhat averaged with the background [44]. However, on the whole, our approach appears to reconstruct objects with higher accuracy and reduced noise on surfaces. We note, for example, that objects such as the furniture (scene 3, centre) and the truck (scene 2, right) are showing less blurriness and distortion in contour and z profile than in the iSeeBetter and HistNet output. This is reflected in the higher PSNR and SSIM values for our approach compared with the iSeeBetter and HistNet. We also note, with reference to Visualization 5, that iSeeBetter and HistNet do not preserve temporal coherence, similarly to TR0 in our approach. In the case of iSeeBetter, this might be attributed to the use of an additional network for the frame re-alignment step, which is not robust to noisy images like the ones used here [45]. This step can produce undesired alignments that can lead to temporal artifacts in the super-resolved frames.

Figure 7 shows the error in the reconstructions, or in other words, the absolute difference between the super-resolved frame and the ground-truth for different methods. Depths have been normalised so that error images have underlying values ranging from 0 to 1. The images provide a further illustration for the observations above. Whilst iSeeBetter frames are seen to have lower error values around some of the object edges, our method has a lower error overall across the whole frame. We observe that with our method flat surfaces are accurately reproduced whereas iSeeBetter and HistNet introduce offsets and fluctuations in depth.

 figure: Fig. 7.

Fig. 7. Images indicating the error in the output of different super-resolution methods for the example data in Fig. 6. Depth has been normalised between 0 and 1 for displaying error images.

Download Full Size | PDF

The performance of the network is evaluated for a scene in the test dataset for different levels of SBR to test the robustness of the network, as compared with the iSeeBetter super-resolution network. Figure 8 shows the low resolution input for different average SBR levels (0.06, 0.08, 0.1, 0.15, 0.2 and 2, corresponding to 31, 41, 56, 83, 110 and 1125 signal photons), the super-resolved frames for iSeeBetter and our network (TR4) and the corresponding ground truth. It can be observed that for low SBR, the iSeeBetter network struggles to reconstruct the most noisy parts of the frame. On the other hand, our network appears to be more robust to noise even in the lowest SBR scenario and has overall higher PSNR and SSIM values than iSeeBetter, as can be seen in Fig. 9.

 figure: Fig. 8.

Fig. 8. Example frames comparing our approach with iSeeBetter for different SBR levels along with the low-resolution input and ground truth.

Download Full Size | PDF

 figure: Fig. 9.

Fig. 9. Evolution of a) PSNR and b) SSIM metrics over SBR (in logarithmic scale) for iSeeBetter network and our approach.

Download Full Size | PDF

Two additional studies have been carried in the Supplemental material: the first one compares the performance of this network with iSeeBetter as the frame rate of a synthetic depth sequence is varied. The study confirms the robustness of our approach for large shifts between frames, with the approach offering higher quality reconstructions than iSeeBetter even for inputs at low frame rates. The second one compares the performance of this network with a method exploiting the high-speed nature of the SPAD sensor, where multiple histogram frames are summed together prior to depth extraction. The study shows denoised frames but due to the movement of objects, motion blur is present in the scene. This results in objects appearing wider than in reality and leads to poor reconstruction performance. Therefore, the technique is useful to denoise frames but it is not effective in providing super-resolution reconstructions.

7. Super-resolution of experimental data

The network is tested experimentally with a state-of-the-art SPAD dToF sensor which captures depth images at a pixel resolution of 64$\times$32 [46]. The SPAD sensor uses on-chip multi-event histogramming which enables the high-speed acquisition of depth data without the need to sum binary frames of photon time stamps. The camera comprises of a NIR laser source (850 nm), which is triggered from the SPAD. The compact laser outputs a peak power of 60 W which is spread over the FoV of the SPAD (20$\times$5$^{\circ }$) by the use of a cylindrical lens to match the sensor FoV. The laser emits pulses of 10 ns with repetition rate of 1.2 MHz, suitable for mid range imaging. A 25 mm/f1.4 lens (Thorlabs MVL25M23) is used is used in front of the SPAD, together with a 10 nm bandwidth ambient filter (Thorlabs FL850-10). Figure 10(a) shows a schematic representation of the setup used during the acquisition of experimental data and Fig. 10(b) depicts example depth frames, and the corresponding super-resolution output from the network.

 figure: Fig. 10.

Fig. 10. a). Experimental setup comprising of SPAD dToF sensor, imaging lens, ambient filter, NIR laser and cylindrical lens for FoV matching. Objects at mid range (10-20 m) such as people, cars and bicycles are targeted to perform super-resolution and are shown in b).

Download Full Size | PDF

Visualization 6 shows a comparison between the input and the output of the network for different sequences captured at 50 FPS, involving a cycle, a black model car (scaled 1:4), two people passing a ball to each other and two people running (the latter captured at 100 FPS. Note that the person running closest to the camera is wearing a hood). The super-resolution frames show an improvement in the profile of objects and people, and smoother surfaces. Visualization 7 demonstrates the advantage of using multiple input frames as opposed to a single frame, in terms of resolved features, the level of denoising and temporal coherence. We note that despite the fact that the network is trained purely on virtual data, it appears to be effective when presented with real experimental data. Additional data from another SPAD sensor was captured featuring highly dynamic indoor scenes using the experimental setup described in [7], the results of the processing being shown in Visualization 8.

8. Conclusion

We have presented the development and application of a neural-network-based video super-resolution scheme tailored for SPAD dToF data, which tries to overcome the transverse resolution limitations of dToF image sensors. A virtual dataset involving people, cars and other common objects is gathered using Unreal and is processed into realistic SPAD data with adjustable levels of SBR.

The results from test datasets indicate a noticeable improvement in PSNR and SSIM when using multiple inputs with respect a single image super-resolution approach. To exploit the temporal information in multiple input frames, the frame rate should be set such that object edges moving at high speed shift by less than 2-3 sensor pixels between frames.

A comparison study with state-of-the-art methods demonstrates the advantages of our method in offering effective reconstruction of features and denoising, whilst running at high frame rates. The network, purely trained on virtual data, also appears to visually enhance real SPAD dToF data in mid-range LIDAR.

We believe that the network could be particularly well suited to autonomous systems requiring accurate depth maps of the surroundings with low latency. These systems often rely on object detection schemes, so a potential extension of the present work would see an assessment of object detection performance when carried out on super-resolution depth frames as opposed to depth or histogram data at the native resolution of the sensor [47].

Funding

Defence Science and Technology Laboratory (DSTLX1000147352, DSTLX1000147844); Engineering and Physical Sciences Research Council (EP/M01326X/1, EP/S001638/1).

Acknowledgments

The authors are grateful to STMicroelectronics for chip fabrication.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-Janua, 6526–6534 (2017).

2. J. Rapp, J. Tachella, Y. Altmann, S. McLaughlin, and V. K. Goyal, “Advances in single-photon lidar for autonomous vehicles: Working principles, challenges, and recent advances,” IEEE Signal Process. Mag. 37(4), 62–71 (2020). [CrossRef]  

3. G. Pan, S. Han, Z. Wu, and Y. Wang, “3d face recognition using mapped depth images,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Workshops, (2005), p. 175.

4. C. Jing, J. Potgieter, F. Noble, and R. Wang, “A comparison and analysis of RGB-D cameras’ depth performance for robotics application,” 2017 24th International Conference on Mechatronics and Machine Vision in Practice, M2VIP 2017 2017-Decem, 1–6 (2017).

5. R. Horaud, M. Hansard, G. Evangelidis, and C. Ménier, “An overview of depth cameras and range scanners based on time-of-flight technologies,” Mach. Vis. Appl. 27(7), 1005–1020 (2016). [CrossRef]  

6. R. K. Henderson, N. Johnston, F. M DellaRocca, H. Chen, D. D. Li, G. Hungerford, R. Hirsch, D. Mcloskey, P. Yip, and D. J. S. Birch, “A 192×128 time correlated SPAD image sensor in 40-nm CMOS technology,” IEEE J. Solid-State Circuits 54(7), 1907–1916 (2019). [CrossRef]  

7. S. W. Hutchings, N. Johnston, I. Gyongy, T. Al Abbas, N. A. W. Dutton, M. Tyler, S. Chan, J. Leach, and R. K. Henderson, “A reconfigurable 3-D-stacked SPAD imager with in-pixel histogramming for flash LIDAR or high-speed time-of-flight imaging,” IEEE J. Solid-State Circuits 54(11), 2947–2956 (2019). [CrossRef]  

8. S. Scholes, A. Ruget, G. M. Martín, F. Zhu, I. Gyongy, and J. Leach, “Dronesense: The identification, segmentation, and orientation detection of drones via neural networks,” IEEE Access 10, 38154–38164 (2022). [CrossRef]  

9. A. Turpin, G. Musarra, V. Kapitany, F. Tonolini, A. Lyons, I. Starshynov, F. Villa, E. Conca, F. Fioranelli, R. M. Smith, and D. Faccio, “Spatial images from temporal data,” Optica 7(8), 900–905 (2020). [CrossRef]  

10. H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, and C. Zhu, “Real-world single image super-resolution: A brief review,” Inf. Fusion 79, 124–145 (2022). [CrossRef]  

11. R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Trans. Acoust., Speech, Signal Process. 29(6), 1153–1160 (1981). [CrossRef]  

12. C. E. Duchon, “Lanczos filtering in one and two dimensions,” J. Appl. Meteorol. Climatol. 18(8), 1016–1022 (1979). [CrossRef]  

13. S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos, “Softcuts: A soft edge smoothness prior for color image super-resolution,” IEEE Trans. on Image Process. 18(5), 969–981 (2009). [CrossRef]  

14. Q. Yan, Y. Xu, X. Yang, and T. Q. Nguyen, “Single image superresolution based on gradient profile sharpness,” IEEE Trans. on Image Process. 24(10), 3187–3202 (2015). [CrossRef]  

15. C. Callenberg, A. Lyons, D. Brok, A. Fatima, A. Turpin, V. Zickus, L. Machesky, J. Whitelaw, D. Faccio, and M. Hullin, “Super-resolution time-resolved imaging using computational sensor fusion,” Sci. Rep. 11(1), 1689 (2021). [CrossRef]  

16. Z. Song, Z. Chen, and R. Shi, “Fast map-based super-resolution image reconstruction on gpu-cuda,” in Geo-Informatics in Resource Management and Sustainable Ecosystem, F. Bian and Y. Xie, eds. (Springer, 2015), pp. 170–178.

17. W. Yang, Y. Tian, F. Zhou, Q. Liao, H. Chen, and C. Zheng, “Consistent coding scheme for single-image super-resolution via independent dictionaries,” IEEE Trans. Multimedia 18(3), 313–325 (2016). [CrossRef]  

18. Y. Kang, R. Xue, X. Wang, T. Zhang, F. Meng, L. Li, and W. Zhao, “High-resolution depth imaging with a small-scale spad array based on the temporal-spatial filter and intensity image guidance,” Opt. Express 30(19), 33994–34011 (2022). [CrossRef]  

19. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Comput. 1(4), 541–551 (1989). [CrossRef]  

20. I. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds. (Curran Associates, Inc., 2014).

21. Q. Tang, R. Cong, R. Sheng, L. He, D. Zhang, Y. Zhao, and S. Kwong, “BridgeNet: A joint learning network of depth map super-resolution and monocular depth estimation,” in Proc. ACM MM (2021).

22. Z. Sun, D. B. Lindell, O. Solgaard, and G. Wetzstein, “Spadnet: deep rgb-spad sensor fusion assisted by monocular depth estimation,” Opt. Express 28(10), 14948–14962 (2020). [CrossRef]  

23. H. Liu, Z. Ruan, P. Zhao, C. Dong, F. Shang, Y. Liu, L. Yang, and R. Timofte, “Video super-resolution based on deep learning: a comprehensive survey,” Artificial Intelligence Review (2022).

24. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).

25. B. Bare, B. Yan, C. Ma, and K. Li, “Real-time video super-resolution via motion convolution kernel estimation,” Neurocomput. 367, 236–245 (2019). [CrossRef]  

26. R. Kalarot and F. Porikli, “Multiboot vsr: Multi-stage multi-reference bootstrapping for video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019).

27. Y. Li, H. Zhu, Q. Hou, J. Wang, and W. Wu, “Video super-resolution using multi-scale and non-local feature fusion,” Electronics 11(9), 1499 (2022). [CrossRef]  

28. X. Zhu, Z. Li, X.-Y. Zhang, C. Li, Y. Liu, and Z. Xue, “Residual invertible spatio-temporal network for video super-resolution,” in AAAI Conference on Artificial Intelligence (2019).

29. G. M. Martín, A. Halimi, R. K. Henderson, J. Leach, and I. Gyongy, “High-ambient, super-resolution depth imaging with a spad imager via frame re-alignment,” in Proceedings of IISW International Image Sensor Workshop (2021).

30. J. Kim, J. Han, and M. Kang, “Multi-frame depth super-resolution for tof sensor with total variation regularized l1 function,” IEEE Access 8, 165810–165826 (2020). [CrossRef]  

31. I. Gyongy, A. T. Erdogan, N. A. W. Dutton, G. M. Martín, A. Gorman, H. Mai, F. M. Della Rocca, and R. K. Henderson, “A direct time-of-flight image sensor with in-pixel surface detection and dynamic vision,” (2022).

32. Games Epic, “Unreal engine,”.

33. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, M. Hutter and R. Siegwart, eds. (Springer International Publishing, 2018), pp. 621–635.

34. N. A. W. Dutton, I. Gyongy, L. Parmesan, and R. K. Henderson, “Single photon counting performance and noise analysis of cmos spad-based image sensors,” Sensors 16(7), 1122 (2016). [CrossRef]  

35. A. Ruget, S. McLaughlin, R. K. Henderson, I. Gyongy, A. Halimi, and J. Leach, “Robust super-resolution depth imaging via a multi-feature fusion deep network,” Opt. Express 29(8), 11917–11937 (2021). [CrossRef]  

36. I. Gyongy, S. W. Hutchings, A. Halimi, M. Tyler, S. Chan, F. Zhu, S. McLaughlin, R. K. Henderson, and J. Leach, “High-speed 3D sensing via hybrid-mode imaging and guided upsampling,” Optica 7(10), 1253–1260 (2020). [CrossRef]  

37. F. A. Fardo, V. H. Conforto, F. C. de Oliveira, and P. S. Rodrigues, “A formal evaluation of psnr as quality measurement parameter for image segmentation algorithms,” (2016).

38. J. Nilsson and T. A. Möller, “Understanding ssim,” (2020).

39. F. Chollet, “Keras,” (2015).

40. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” (2017).

41. K. Gokcesu and H. Gokcesu, “Generalized huber loss for robust learning and its efficient minimization for a robust statistics,” (2021).

42. P. Thompson, J. Colebatch, P. Brown, J. Rothwell, B. Day, J. Obeso, and C. Marsden, “Voluntary stimulus-sensitive jerks and jumps mimicking myoclonus or pathological startle syndromes,” Mov. Disorders: Official J. Mov. Disord. Soc. 7(3), 257–262 (1992). [CrossRef]  

43. A. Chadha, J. Britto, and M. M. Roja, “iSeeBetter: Spatio-temporal video super-resolution using recurrent generative back-projection networks,” Comp. Visual Media 6(3), 307–317 (2020). [CrossRef]  

44. X. Song, Y. Dai, and X. Qin, “Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network,” in Computer Vision – ACCV 2016, S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, eds. (Springer International Publishing, 2017), pp. 360–376.

45. P. Ce, “Beyond pixels : exploring new representations and applications for motion analysis,” Ph.D. thesis, Massachusetts Institute Technology (2009).

46. I. Gyongy, A. T. Erdogan, N. A. W Dutton, H. Mai, F. M. DellaRocca, and R. K. Henderson, “A 200kfps, 256×128 spad dtof sensor with peak tracking and smart readout,” in Proceedings of IISW International Image Sensor Workshop (2021).

47. G. M. Martín, A. Turpin, A. Ruget, A. Halimi, R. Henderson, J. Leach, and I. Gyongy, “High-speed object detection with a single-photon time-of-flight image sensor,” Opt. Express 29(21), 33184–33196 (2021). [CrossRef]  

Supplementary Material (9)

NameDescription
Supplement 1       Supplemental Document
Visualization 1       Test dataset results for different temporal radii (Scene 1)
Visualization 2       Test dataset results for different temporal radii (Scene 2)
Visualization 3       Test dataset results for different temporal radii (Scene 3)
Visualization 4       Temporal coherence comparison for temporal radius 0 and 4.
Visualization 5       Comparison between super-resolution methods.
Visualization 6       Super-resolution from experimental data
Visualization 7       Super-resolution of experimental data for different temporal radii
Visualization 8       Super-resolution from experimental data. BALL= 1kFPS (300us exposure). FAN = 1kFPS (300us exposure). JUGGLING = 600FPS (1ms exposure). BALLOON = 950kFPS (400us exposure).

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (10)

Fig. 1.
Fig. 1. Depth map frame of people walking using AirSim (for more details on the generation process refer to Section 2.) The frame has been produced using a 30$^{\circ }$ field-of-view and a signal-to-noise ratio of 1. a) Low-resolution depth map with noisy pixels due to solar background radiation (64$\times$32 pixels). b) Noise-free version of a), which is a nearest-neighbours resampling of the high-resolution depth map. c). High-resolution, noise-free depth map with a $\times$4 increase with respect a).
Fig. 2.
Fig. 2. Workflow diagram showing the process of: capturing virtual scenes; converting them into SPAD-like data; using the neural network to super-resolve frames and, analysing the network performance. It is emphasized that the captured scenes are video sequences and not single frames.
Fig. 3.
Fig. 3. Neural network structure, adapted from [27].
Fig. 4.
Fig. 4. FPS, PSNR and SSIM for different temporal radii training networks for the 3 scenes comprising the test dataset. FPS are colour-coded and TR0 and TR4 plot points are marked for reference.
Fig. 5.
Fig. 5. Results of a super-resolved sequence at different temporal radii and FPS for a). PSNR. b). SSIM. The green plane denotes the average PSNR or SSIM of a single-image super-resolution technique (TR0).
Fig. 6.
Fig. 6. Example output frames comparing super-resolution methods, along with the low-resolution input and ground truth.
Fig. 7.
Fig. 7. Images indicating the error in the output of different super-resolution methods for the example data in Fig. 6. Depth has been normalised between 0 and 1 for displaying error images.
Fig. 8.
Fig. 8. Example frames comparing our approach with iSeeBetter for different SBR levels along with the low-resolution input and ground truth.
Fig. 9.
Fig. 9. Evolution of a) PSNR and b) SSIM metrics over SBR (in logarithmic scale) for iSeeBetter network and our approach.
Fig. 10.
Fig. 10. a). Experimental setup comprising of SPAD dToF sensor, imaging lens, ambient filter, NIR laser and cylindrical lens for FoV matching. Objects at mid range (10-20 m) such as people, cars and bicycles are targeted to perform super-resolution and are shown in b).

Tables (3)

Tables Icon

Table 1. Performance of super-resolution network for different temporal radii in terms of PSNR, SSIM, and FPS.a

Tables Icon

Table 2. Performance of super-resolution network of a scene (0.67 average SBR, 298 average signal photons) in terms of PSNR and SSIM for different temporal radii and captured at different FPS.a

Tables Icon

Table 3. Performance of super-resolution network of test dataset in terms of PSNR, SSIM, and FPS for different methods (Bicubic, iSeeBetter, and HistNet networks and our approach for TR1 and TR4), in bold for highest value.a

Equations (4)

Equations on this page are rendered with MathJax. Learn more.

PSNR = 20 log 10 ( 1 1 H W 0 H 1 0 W 1 | | Y G T Y S R | | 2 )
SSIM = ( 2 μ G T μ S R + c 1 ) ( 2 σ G T S R + c 2 ) ( μ G T 2 + μ S R 2 + c 1 ) ( σ G T 2 + σ S R 2 + c 2 ) ,
L = { 1 2 | | Y G T Y S R | | 2 2 , | | Y G T Y S R | | 1 δ δ | | Y G T Y S R | | 1 1 2 δ 2 , otherwise ,
T c t = i , j = 0 H , W { 1 , | | G T i , j , t G T i , j , t + 1 | | = 0 & | | S R i , j , t S R i , j , t + 1 | | > 0 0 , otherwise .
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.