Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

LWIR sensor parameters for deep learning object detectors

Open Access Open Access

Abstract

Deep learning has been well studied for its application to image classification, object detection, and other visible spectrum tasks. However, deep learning is only beginning to be considered for applications in the long-wave infrared (LWIR) spectrum. In this work, we attempt to quantify the imaging system parameters required to perform specific deep learning tasks without significant pre-processing of the LWIR images or specialized training. We show the capabilities of uncooled microbolometer sensors for Fast Region-based Convolution Neural Networks (Fast R-CNN) object detectors and the extent to which increased sensitivity and resolution will affect a Fast R-CNN object detector’s performance. These results provide guidelines for design requirements for uncooled microbolometers in industries such as commercial autonomous vehicle navigation that will use deep learning object detectors.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

1. Introduction

LWIR imagers are quickly becoming important for autonomous vehicle applications and search operations because of their ability to provide usable data in low light conditions and when camouflaged targets become indistinguishable from the background [13]. An important metric for LWIR imagers is achieving an acceptable signal to noise ratio (SNR) performance at a given range while maintaining a minimum number of pixels on target. For deep learning neural networks, range performance is not the only evaluation criteria for object detection and classification as these neural networks have significant training and computational requirements [3,4]. With the cost of microbolometer based imaging systems falling drastically in recent years, it is becoming commercially viable to use these imaging systems integrated with deep learning neural networks.

In the visible spectrum, the reflected light intensity in multiple spectral bands is the primary input. RGB images inherently provide a significant number of feature descriptors for an object detector and classifier. These descriptors, such as edges, color, size, shape, etc., become harder to detect in low-light scenes resulting in poor performance. By switching from the visible spectrum to LWIR, the three-channel RGB reduces to a single band consisting of thermal radiation reflected by the object, emissive radiation based on its temperature, and radiation scattered by the atmosphere. Combining the thermal radiation from these multiple sources results in scenes that lack texture and may have low contrast.

Several studies have focused on obstacle avoidance using LWIR object detection since hot objects are relatively easy to identify in this spectral region. They have used various datasets and compared the results to those obtained with RGB imagery [5,6]. One such study relied on hot spot classification using Maximally Stable Extremal Regions (MSER). It provided good performance on both high- and low-resolution images to detect people and distinguish people from clutter. MSER, as described in [6], showed better performance for LWIR compared to low-light RGB. It should be noted that the thermal scene contrast between the people and the background was not quantified.

People detection has become popular for evaluating how well different algorithms developed for the visible spectrum function on LWIR imagery [6,7]. In most cases, the structure of the object detector’s algorithm is not altered but trained through transfer learning with an LWIR dataset. By fine-tuning the algorithms and testing on short-range images, high levels of performance can be achieved.

Image degradation in visible imaging systems is well-studied, and solutions have been proposed to improve deep learning neural networks for classification and scene segmentation Some of the approaches taken include pre-processing with denoising and deblurring algorithms and fine-tuning with more robust data sets consisting of degraded images [810]. These methods should not be used exclusively as the optical system and sensor would still impose an upper-performance limit. They may also be computationally intensive.

The frame rates of uncooled LWIR systems are slow compared to visible systems because of the high thermal time constants related to sensing with microbolometers. The resulting slower frame rates lead to motion blur. Conversely, increasing the frame rates lead to decreased SNR. CNNs can be designed specifically to remove noise or blur as a pre-processing step before localization and classification are performed [8]. Another technique is to adapt the object detector to handle blurred images by training it on blurred images [9]. When trained on blurred images, a deep learning training algorithm would have different degrees of activations occurring at each pooling level in response to the blur. Researchers studying this phenomenon postulated that a network could be designed to accommodate blur.

In 2011, the Jet Propulsion Laboratory, California Institute of Technology, performed a detailed analysis using uncooled LWIR cameras for ground vehicle navigation focusing on using only passive systems to minimize detection [2]. At the time, current technology provided sensors with a 25 µm pitch Vanadium Oxide (VOx) microbolometer detectors, a noise equivalent temperature difference (NETD) of less than 50 mK, and a thermal time constant of 13-14 ms. The NETD is representative of the smallest temperature difference that can be resolved by the system. By today’s standards, primitive computational methods were used for terrain segmentation, object detection, and obstacle avoidance. Blur was a significant challenge for the uncooled LWIR microbolometers and limited vehicle operation to speeds less than 12 m/s. Time of day and temperature crossovers were also problematic for classification. State of the art detectors now have pitches between 10-12 µm, NETDs better than 20 mK, and thermal time constants less than 12 ms.

A multifaceted study was performed using a variety of object detectors in difficult weather conditions [11]. This study looked at training dataset size and different weather conditions, including clear, fog, and rain, along with various ranges. The results showed that the multiple detectors used, including R-CNN, Cascade R-CNN, and YOLOv3, provided similar performance on LWIR images and that processing speed was the only significant difference. There was also a threshold for the minimum number of training images needed before the performance plateaued.

In this paper, we concentrate on sensor sensitivity in terms of the NETD for cases of reduced SNR and the angular resolution resulting from changing the system's modulation transfer function (MTF) for the combination of different optics and sensors. This research aims to relate these parameters to the performance of a deep learning neural network used for object detection without applying image enhancement algorithms or extensive training data manipulation.

In most cases, past research has focused on improving the computer vision algorithms without consideration of the optical system and sensor as the input source. The main contributions of this paper are a) describing the relative dependence of an object detector’s performance on LWIR optical system and sensor parameters, b) analysis of the effect of thermal scene contrast on an object detector’s performance, and c) an original framework for understanding hardware effects on the performance of deep learning object detectors in the LWIR spectrum independent of software.

2. Imaging system model

This section describes the imaging system model used for this work. The process in which blur and noise were added to the image dataset for this work is explained. The camera used to create the FLIR Advanced Driver Assist Systems (ADAS) Thermal Dataset was a FLIR Tau2 640 × 512 with a detector pitch of 17 µm. The optics had an equivalent focal length of 13 mm providing for an instantaneous field of view (IFOV) of 1.31 mrad. The IFOV is the angular size of a scene that is seen by a single detector or pixel. The camera used for the dataset collection had a specified sensitivity of less than 50 mK at F/1.0 [12]. The object detector, based on the Fast R-CNN using a RESNET-50 classifier, is detailed along with the pre-processing and training steps used.

2.1 Imaging system MTF

In a real imaging system, noise will be present before interaction with the optics, and additional noise will be generated at the detector array. Since we consider uncooled detectors in this research, we assume that noise originating at the scene will be much smaller than the detector noise. This assumption allows us to treat noise generation and blur as separate processes and first consider the optics’ effects then add noise. For an ideal uncooled microbolometer system, the system MTF can be described as the product of the MTF of the optics and the MTF of the detector array [13]. Since the object detector is working on the digital output, the display and eye's contributions are not considered in this work. For this research, we treat the camera as having circular optics with square detectors, and both the camera and objects are static with no motion blur effects.

As an ideal system, the resulting irradiance, I, given in W/cm2 at each detector is the result of optical diffraction and the detector’s geometry where “*” is the convolution operation and hoptics and hdetector are the respective optical transfer functions (OTFs),

$${I_{detector}}({x,y} )= {I_{source}}(x,y) \ast {h_{optics}}(x,y) \ast {h_{detector}}(x,y)$$

For the irradiance on the array, the detector is multiplied by an array function where $\delta \delta $ represents a 2-D comb function for the sampling associated with the detector array,

$${I_{array}}(x,y) = {I_{detector}}({x,y} )\frac{1}{{IFO{V^2}}}\delta \delta \left( {\frac{x}{{IFOV}},\frac{y}{{IFOV}}} \right)$$

The OTFs represented by h(x,y) in spatial coordinates are converted using Fourier transforms to MTFs represented by H(ξ,η) expressed in angular frequencies [cycles/rad], $\xi \textrm{\; and}\; \eta $,

$$\begin{array}{l} {I_{array}}({\xi ,\eta } )= [{{I_{source}}({\xi ,\eta } ){H_{optics}}({\xi ,\eta } ){H_{detector}}({\xi ,\eta } )} ]\\ \ast \delta \delta ({IFOV \times \xi ,IFOV \times \eta } )\end{array}$$

We consider the microbolometer’s detectors as having a square shape and 100% fill factor. At 100% fill factor, the detector's IFOV will be the same as the detector angular subtense (DAS) where DAS = d/f. In this case, d is the detector size, and f is the focal length resulting in the MTF of a single detector as

$${H_{detector}}({\xi ,\eta } )= \textrm{sinc}({\pi \times IFOV \times \xi } )\textrm{sinc}({\pi \times IFOV \times \eta } )$$

The optics of the camera cause diffraction and yield an MTF contribution given by Eq. (5) where the optical cutoff frequency is ρc = D/λ and ξ and η are the 2-D angular frequencies. D is the diameter of the aperture of the ideal single lens system.

$$\begin{array}{l} {H_{optics}}({\xi ,\eta } )= \\ \frac{2}{\pi }\left[ {{{\cos }^{ - 1}}\left( {\frac{{\sqrt {{\xi^2} + {\eta^2}} }}{{{\rho_c}}}} \right) - \left( {\frac{{\sqrt {{\xi^2} + {\eta^2}} }}{{{\rho_c}}}} \right)\sqrt {1 - \frac{{{\xi^2} + {\eta^2}}}{{{\rho_c}^2}}} } \right] \end{array}$$

The resulting MTF of the system is the product of the optics MTF, Hoptics(ξ,η), and detector MTF, Hdetector(ξ,η). It may also include replicas of the signal, resulting from convolution with the 2-D comb function at multiples of the sampling frequency s = 1/IFOV) due to aliasing as shown in Fig. 1.

 figure: Fig. 1.

Fig. 1. MTFs of ideal LWIR imaging system.

Download Full Size | PDF

As we are interested in altering the FLIR ADAS Dataset to simulate imagery taken from different camera configurations, a Gaussian blur is added. The blur represents a different IFOV which is a combination of changes to the focal length, detector pitch, or both. Gaussian blur is implemented where the diameter of the blur, 2b, in image space measured between the e points. The OTF representing the Gaussian blur applied is defined as

$${h_{Gaussian}}({x,y} )= \frac{1}{{{b^2}}}\exp \left[ { - \pi \left( {\frac{{{x^2} + {y^2}}}{{{b^2}}}} \right)} \right]$$

When included in the system MTF, the Gaussian blur shifts the system MTF to the left while also reducing any spurious aliased signals. Figure 2 shows the 6-pixel Gaussian blur MTF in comparison to the detector and optics MTFs and its inclusion in the overall system MTF. While the suppression of high-frequency artifacts may be achieved with the application of Gaussian blur, some of the real image's high-frequency components will also be affected. The amount of blur added will affect the Fast R-CNN as an edge box detection algorithm will be used for the region proposals and will be affected by the reduction of the higher frequencies [14].

 figure: Fig. 2.

Fig. 2. MTFs of LWIR imaging system with 6-pixel Gaussian blur.

Download Full Size | PDF

To compare the blurred images to those notionally obtained from an equivalent system, we will match an ideal system by changing its IFOV until it resembles the blurred system MTF as shown in Fig. 3 and represented by Eq. (7),

$${H_{blurredsystem}}\left( {\xi ,\eta } \right) \approx {H_{optics}}\left( {\xi ,\eta ,IFOV} \right){H_{detector}}\left( {\xi ,\eta ,IFOV} \right)$$

 figure: Fig. 3.

Fig. 3. Comparison of MTFs for a blurred system and equivalent ideal system

Download Full Size | PDF

The IFOV of the ideal system MTF is changed until the best fit to the blurred image MTF is obtained using a least mean square method. Note that uncooled LWIR systems are usually sampling limited, and the FLIR Tau2 used for the dataset collection can be described by a Fλ/d = 1.04. For this reason, we can characterize these systems in terms of IFOV [13]. At six pixels of blur applied to the FLIR Tau2 images, similar performance would be expected to be obtained if the optics or detector were changed to provide an increased IFOV of 1.99 mrad.

2.2. System noise

Applying a Gaussian blur will also reduce the image's inherent noise and make the image appear as if the NETD of the imaging system was lower. For each blurred pixel, it would be the weighted sum resulting from convolving the image with a Gaussian kernel of dimensions m-by-n,

$${I_{out}}(x,y) = \sum\limits_{i = - m}^m {\sum\limits_{j = - n}^n {\frac{1}{{{b^2}}}\exp \left[ { - \pi \left( {\frac{{{i^2} + {j^2}}}{{{b^2}}}} \right)} \right]} } {I_{in}}\left( {x + i,y + j} \right)$$

From [15], if the noise associated with each pixel is independent and random, then the variance of their sum is equal to the sum of their variances. Therefore, the noise variance, σ2, resulting from blurring would change the input amount to a corresponding output amount given by,

$${\sigma _{out}}^2 = {\sigma _{in}}^2\sum\limits_{i ={-} m}^m {\sum\limits_{j ={-} n}^n {{{\left\{ {\frac{1}{{{b^2}}}\exp \left[ { - \pi \left( {\frac{{{i^2} + {j^2}}}{{{b^2}}}} \right)} \right]} \right\}}^2}} }$$

If the kernel is large enough, Eq. (9) can be approximated with an integral, and the new noise can be analytically calculated,

$${\sigma _{out}}^2 = {\sigma _{in}}^2\int\limits_{ - \infty }^\infty {\int\limits_{ - \infty }^\infty {{{\left\{ {\frac{1}{{{b^2}}}\exp \left[ { - \pi \left( {\frac{{{i^2} + {j^2}}}{{{b^2}}}} \right)} \right]} \right\}}^2}didj} } = \frac{{{\sigma _{in}}^2}}{{2{b^2}}}$$

By Eq. (10), for increasing the blur radius, b, the noise variance related to the blurred image will decrease by a factor of 2b2.

Increasing the NETD requires that an assumption be made about the contrast in the FLIR ADAS Dataset. Without radiometric images, we will assume that the average scene contrast, $\overline {{\Delta }T}$, for each image as 5 K. Based on this assumption and NETD derivation for an uncooled microbolometer in [16], an SNR for the FLIR Tau2 can be determined using its known NETD of 50 mK. This SNR is then used to determine the noise variance at a given $\overline {{\Delta }T} $ in gray levels,

$${\sigma _{noise}}^2\left( {\overline {{\Delta }T} } \right) = \frac{{{\sigma _{image}}^2 \times NETD}}{{\overline {{\Delta }T} }}$$

The resulting noise variance to be added to the image after accounting for its reduction due to blurring from Eq. (10) and increasing the NETD (ΔNETD>0) from Eq. (11) would be,

$${\sigma _{additive}}^2 = {\sigma _{noise}}^2\left( {\overline {{\Delta }T} } \right)\left( {1 - \frac{1}{{2{b^2}}} + \frac{{{\Delta }NETD}}{{NETD}}} \right)$$

3. Experimental procedure XPE

3.1. Image pre-processing

The FLIR ADAS Dataset is organized with approximately 10,000 training images and 4,000 validation images. The dataset was available as either 14-bit TIFF without automatic gain control (AGC) applied or 8-bit JPEG with AGC applied [12]. The AGC images are undesirable for this research as histogram equalization, thresholding, and filtering have been used, altering noise and contrast in the images to some extent. Instead, our own pre-processing was applied to each 14-bit TIFF image since it was taken from the camera pre-AGC. This same pre-processing would be used later when blur and noise were added to the images to simulate different camera parameters.

The pre-processing consisted of applying saturation that rescaled the images by discarding 0.5 percent at each end of the histogram and then performing a linear histogram stretch. The resulting images were output at 640 × 512 pixels. For our research, we determined an average range to objects as being 32 meters based on an average height of a person in the database as 45.6 pixels matched to an assumed height of 1.7 meters.

3.2. Deep learning network architecture

Our research's object detector used a Fast R-CNN with a ResNet-50 to perform the image classification step [17,18]. The Fast R-CNN deep learning network provides a baseline with one benefit: regions of interest (ROIs) are generated separately from the image classification. Since the training data is 14-bit grayscale images, using an edge box function for generating the needed ROI proposals is straight forward [14]. Objects in the LWIR images are inherently low contrast and lacking in texture, and the image classifier used must be capable of handling these characteristics. For this, we chose to use ResNet-50 with its residual learning format as the basis for performing transfer learning. Based on the approach taken by Microsoft Research in developing this type of ResNet network, it is easier to fit residual mapping than it is to fit nonlinear weighted network layers [18]. Figure 4 shows the steps where the ROI proposal processes an input image, objects are classified in each ROI, and then the network provides performance metrics.

 figure: Fig. 4.

Fig. 4. ROI Proposal to Object Detector workflow.

Download Full Size | PDF

3.3. Training and validation

The set of 10,000 training images in 14-bit TIFF format was pre-processed as described previously and feed into the Fast R-CNN for training. MATLAB’s trainFastRCNNObjectDetector function was used to train the object detector using the default stochastic gradient descent with momentum (SGDM) solver. The maximum number of epochs was set to 10 with an initial learning rate of 0.005 and halved every two epochs. The 10,000 image set was processed in mini-batches of 10 and shuffled between each epoch. The training set was not augmented with altered images with additional noise or blur. Other researchers have shown this to cause the detector to improve performance on altered visible spectrum images at the cost of lower performance on unaltered visible spectrum images [9].

Validation was performed on the 4,000 validation images included with the FLIR ADAS Dataset to determine the network’s baseline performance. To evaluate performance, we used the average precision (AP) for each class and the overall mean average precision (mAP). The AP was obtained by integrating under the precision-recall curve for each class. To generate the precision-recall curve, the object detector was evaluated at confidence thresholds from 0.0 to 1.0. Accordingly,

$$Precision = \frac{{Correct Detections}}{{All Detections}}$$
$$Recall = \frac{{Detected Objects}}{{All Objects}}$$

At a threshold of 0.90, the detector must have at least 90 percent confidence in its prediction to be considered a detection. With a higher threshold, the precision is usually higher, while a lower threshold would result in more false detections lowering the precision. Simultaneously, a high threshold would result in low recall values as fewer detections would be present out of all those possible. In addition, an intersection over union (IoU) value of 0.50 is imposed for the detection to be correct, meaning at least half of the bounding box must overlap with the ground truth.

During the validation process, the images had no blur or noise added, and the APs shown in Table 1 were obtained. As a comparison, we performed a baseline validation on both the JPEG images with FLIR’s AGC applied and the TIFF images with our pre-processing.

Tables Icon

Table 1. Average precision for FLIR ADAS dataset.

It should be noted that the object detector performed slightly better using our pre-processing compared to FLIR's AGC method. As mentioned in section 3.1, FLIR's AGC uses localized histogram equalization, which tends to amplify noise in the images. While localized histogram equalization generally improves contrast for human viewing, this non-uniform distortion caused by AGC hurts this Fast R-CNN object detector slightly. Bicycles were the exception and this may be due to partial obscuration by other objects and their lack of solid structure that would benefit from localized enhancement.

3.4. Model testing

The same 4,000 validation images in 14-bit TIFF format were used for each iteration where blur and noise were added. For each image, the image variance was computed and used to determine the amount of noise present before blurring. The image was blurred with a Gaussian filter with a diameter ranging from 0 to 20 pixels. After blurring, noise was added according to Eq. (12) to simulate the desired NETD assuming a 5 K average scene contrast.

At this point, the same pre-processing procedure previously discussed was applied to every image in the 4,000 set for the Fast R-CNN to process, detect, and localize the three categories of interest.

4. Results

Using the process detailed in Section 2.1 and Eq. (7), a set of equivalent IFOVs were obtained, and a subset of them are shown in Table 2. The focal lengths shown in Table 2 are calculated with a detector pitch of 17 µm, which is the FLIR Tau2’s specification. As IFOV is calculated from both focal length and detector pitch, different focal lengths could be derived if the detector pitch was changed.

Tables Icon

Table 2. Gaussian blur to equivalent IFOV conversion.

First, we considered the effect of increasing the IFOV for cameras with different NETDs as shown in Fig. 5. A 5 K average scene contrast was used for computing the existing noise in the image with Eq. (11). The resulting mAPs for each case decreased as expected with minor exceptions. When a 2-pixel Gaussian blur was added, there was no decrease in performance. This was most likely caused by a reduction in false detections as the ROI Proposal may have seen too many sharp edges in the image unrelated to the desired categories. Since the ROI proposal used was the edge box detection algorithm, smoothing the image would increase performance slightly. There were also irregularities in the 70 mK curve, which may be smoothed with a more extensive testing image set.

 figure: Fig. 5.

Fig. 5. mAP values with increasing IFOV.

Download Full Size | PDF

Average scene contrast significantly affects the object detector’s performance, especially since increasing the system’s NETD would be occurring at a much lower overall SNR. As a comparison, Fig. 6 shows the case of the equivalent system with an IFOV = 1.99 mrad (6-pixel Gaussian blur added). Average scene contrast of 2 K and 8 K is compared (see Fig. 6) to the 5 K used for Figs. 5 and 7.

 figure: Fig. 6.

Fig. 6. mAP values with increasing IFOV.

Download Full Size | PDF

 figure: Fig. 7.

Fig. 7. mAP values with increasing NETD.

Download Full Size | PDF

The difference in performance when the average scene contrast was increased from 5 K to 8 K showed moderate improvement from 0.236 to 0.265 at NETD = 50 mK. At NETD = 95 mK, the mAP improved slightly more from 0.160 to 0212. When the average scene contrast is reduced to 2 K, the performance at NETD = 50 mK is reduced from 0.236 to 0.140, and the object detector fails to detect any objects at NETD = 95 mK.

Replotting the data from Fig. 5 to show mAP vs. NETD for different IFOV showed less overall effect when NETD was increased. In Fig. 7, the data point with no blur (IFOV = 1.31 mrad) and NETD = 95 mK had the same mAP as 4 pixels of blur (IFOV = 1.65 mrad) and NETD = 50 mK.

5. Discussion

To better quantify the effects of changing the IFOV and NETD of an imaging system, fitting the results is useful. Fitting also allows us to predict the performance of a system with different sensor parameters. Figure 8 shows a 3-D fit of the data showing mAP vs. IFOV and NETD.

 figure: Fig. 8.

Fig. 8. mAP values vs. IFOV and SNR. Visible dots represent data points with mAP values greater than the fitted curve.

Download Full Size | PDF

For this ResNet-50 based object detector, an appropriate characteristic equation (with r2 = 0.98) representing the effects of changing the IFOV and NETD is

$$mAP ={-} 0.347 + \frac{{\textrm{8}\textrm{.846 mK}}}{{NETD}} + \frac{{0.5999\sqrt {\textrm{mrad}} }}{{\sqrt {IFOV} }}$$

An average scene contrast must be assumed to interpret the results and compare different uncooled microbolometer-based imaging systems. Using an average scene contrast of 5 K, several inexpensive low-resolution commercial systems at various ranges are shown for comparison in Table 3. The estimated average range, 32 m, calculated from the FLIR ADAS dataset is used to demonstrate performance at different ranges. A simple geometric approach can be used where the ratio of the ranges scales the IFOV to keep the size of the target in the image constant. Therefore, Eq. (15) is amended to include this ratio,

$$mAP(Range) ={-} 0.347 + \frac{{\textrm{8}\textrm{.846 mK}}}{{NETD}} + \frac{{0.5999\sqrt {\textrm{mrad}} }}{{\sqrt {\frac{{Range}}{{\textrm{32 m}}}IFOV} }}$$

Tables Icon

Table 3. Estimated performance of various short- and long-range systems at 5 K average scene contrast.

The first two cameras are riflescope imaging systems, and the second two are wide fields of view. If the NETD or the IFOV was lower than that of the FLIR Tau2, the mAPs are extrapolated using Eq. (16). The riflescopes were compared at three distances, 50, 100, and 200 m. The Senopex Sniper35 Scope at a range of 200 m would provide a mAP of 0.22 against targets of cars, people, bicycles. This is not to say that there is a 22% chance of detection at an average 200 m range, but instead a measure of performance over the entire precision-recall curve. The system may still have multiple detections at low confidence thresholds, but the number of detections would drop quickly as the threshold increases. In other words, if multiple false alarms are acceptable to the users, the system may still meet requirements. One should also keep in mind that the IoU = 0.50 was imposed for a valid detection. Overfilled bounding boxes could result in a rejected detection since the ground truths were very tightly annotated [1].

A Seek Thermal and FLIR Tau336 were considered for wide field of view applications. These types of systems have applications for security systems and obstacle avoidance for vehicles. From Table 3, these low-resolution systems have a maximum usable range of around 50 meters when the average scene contrast is 5 K. At close ranges, 50 m or less, NETD is not a significant factor as the mAPs are very close for the two cameras. Several cameras with high NETDs and low IFOV could be integrated to create panoramic views rather than a single camera with a wide field of view. If these systems are used in a high scene contrast environment ($\overline {{\Delta }T} $>8 K), such as detecting pedestrians or animals on the road during winter months, performance would increase between 12% and 32%, as shown by Fig. 6. Conversely, low contrast environments like temperature crossovers at dawn or dusk environment ($\overline {{\Delta }T} $>2 K) would cause over a 40% decrease in performance.

One additional consideration is the effect of supplementing the initial training set. Like RGB imagery studied in [9], augmenting the training set with images with blur and noise added caused the detector to “tune” itself to perform better at higher NETD and IFOV. We supplemented the training set with duplicate images with noise and blur to simulate NETD = 65 mK and IFOV = 1.99 mrad. The new fit curve (with r2 = 0.99) changed to

$$mAP = - 0.384 + \frac{{\textrm{8}\textrm{.541 mK}}}{{NETD}} + \frac{{0.5456\sqrt {\textrm{mrad}} }}{{\sqrt {IFOV} }}$$

The overall effect was decreased performance at almost all NETD and IFOV values. At NETD = 50 mK and IFOV = 1.31 mrad, the mAP decreased from 0.382 to 0.358. An even more significant decrease occurred at NETD = 65 mK and IFOV = 1.99 mrad where the mAP decreased from 0.203 to 0.0852. Compared to Eq. (15), the new coefficients showed a downward shift in the mAP and a decrease in sensitivity to both parameters. These results showed LWIR object detectors would suffer some of the same negative effects as a visible object detector that had degraded images add to its training sets as described in [9].

6. Conclusions and future work

In this paper, we have shown how an R-CNN object detector performance is affected by the parameters of an uncooled microbolometer camera system. Initially, we applied increasing amounts of blur to simulate increased IFOV and noise to simulate a higher sensor NETD. These new images were fed through a Fast R-CNN object detector that used the ResNet-50 classifier. While the overall mAP would likely be higher or lower for other object detectors such as Faster R-CNN or YOLOv3, these relative results show how the mAP is affected and provide a framework for evaluating future developments in both hardware and software. This framework can be used to design an LWIR imaging system with appropriate parameters in terms of focal length, detector pitch, and NETD to achieve a desired level of performance. In the future, we plan to quantify other factors such as atmospheric conditions that can be represented with additional MTFs and incorporate them into our system model. It would also be valuable to our study to include datasets with known average scene contrasts to better quantify sensor sensitivity to the overall object detector's performance.

Funding

General Technical Services, LLC (GTS-S-19-048).

Acknowledgments

This research was supported by the Sensors and Electron Devices Directorate, U.S. Army Combat Capabilities Development Command, Army Research Laboratory.

Disclosures

The authors declare no conflicts of interest.

References

1. N. Pinchon, M. Ibn Khedher, O. Cassignol, A. Nicolas, B. Frederic, P. Leduc, J.-P. Tarel, R. Bremond, E. Bercier, and G. Julien, “All-weather vision for automotive safety: which spectral band?” in Advanced Microsystems for Automotive Applications, (2019).

2. A. Rankin, A. Huertas, L. Matthies, M. Bajracharya, C. Assad, S. Brennan, and P. Bellutta, “Unmanned ground vehicle perception using thermal infrared cameras,” Proc. SPIE 8045, 804503 (2011). [CrossRef]  

3. S. Kim, W.-J. Song, and S.-H. Kim, “Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 195–202, DOI: 10.1109/CVPRW.2017.30, (2017).

4. I. Rodger, B. Connor, and N. M. Robertson, “Classifying objects in LWIR imagery via CNNs,” Proc. SPIE 9987, 99870H (2016). [CrossRef]  

5. R. Abbott, J. Del Ricon, B. Connor, and N. Robertson, “Deep object classification in low resolution LWIR imagery via transfer learning,” in Proceedings of the 5th IMA Conference on Mathematics in Defence, (2017).

6. M. Mueller, T. Teutsch, M. Huber, and J. Beyerer, “Low resolution person detection with a moving thermal infrared camera by hot spot classification,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 209–216, DOI: 10.1109/CVPRW.2014.40, (2014).

7. V. John, S. Mita, Z. Liu, and B. Qi, “Pedestrian detection in thermal images using adaptive fuzzy C-means clustering and convolutional neural networks,” in Proceedings of the 14th IAPR International Conference on Machine Vision Applications (MVA), 246–249, DOI: 10.1109/MVA.2015.7153177, (2015).

8. S. Diamond, V. Sitzmann, S. P. Boyd, and F. Heide, “Dirty pixels: Optimizing image classification architectures for raw sensor data.” arXiv preprint arXiv:1701.06487 (2017).

9. I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich, “Examining the impact of blur on recognition by convolutional networks,” CoRR abs/1611.05760 (2016).

10. R. Nihei, Y. Tanaka, H. Iizuka, and T. Matsumiya, “Simple correction model for blurred images of uncooled bolometer type infrared cameras,” Proc. SPIE 11001, 40 (2019). [CrossRef]  

11. M. Kristo, M. Ivasic-Kos, and M. Pobar, “Thermal object detection in difficult weather conditions using YOLO,” IEEE Access 8, 125459–125476 (2020). [CrossRef]  

12. FLIR, Thermal Datasets for ADAS Algorithm Training, https://www.flir.com/oem/adas/dataset/ (2019).

13. G. C. Holst and R. G. Driggers, “Small detectors in infrared system design,” Opt. Eng. 51(9), 096401 (2012). [CrossRef]  .

14. C. Zitnick and P. Dollar, “Edge boxes: locating object proposals from edges,” in Proceedings of the European Conference on Computer Vision, 391–405, (2014).

15. R. C. Gonzalez and R. E. Woods, Digital Image Processing, 4th ed. (Prentice Hall, 2017).

16. R. G. Driggers, P. Cox, and T. Edwards, Introduction to infrared and electro-optical systems, Artech House, Boston1999.

17. R. Girshick, “Fast R-CNN,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 1440–1448, DOI: 10.1109/ICCV.2015.169, (2015).

18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, DOI: 10.1109/CVPR.2016.90, (2016).

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (8)

Fig. 1.
Fig. 1. MTFs of ideal LWIR imaging system.
Fig. 2.
Fig. 2. MTFs of LWIR imaging system with 6-pixel Gaussian blur.
Fig. 3.
Fig. 3. Comparison of MTFs for a blurred system and equivalent ideal system
Fig. 4.
Fig. 4. ROI Proposal to Object Detector workflow.
Fig. 5.
Fig. 5. mAP values with increasing IFOV.
Fig. 6.
Fig. 6. mAP values with increasing IFOV.
Fig. 7.
Fig. 7. mAP values with increasing NETD.
Fig. 8.
Fig. 8. mAP values vs. IFOV and SNR. Visible dots represent data points with mAP values greater than the fitted curve.

Tables (3)

Tables Icon

Table 1. Average precision for FLIR ADAS dataset.

Tables Icon

Table 2. Gaussian blur to equivalent IFOV conversion.

Tables Icon

Table 3. Estimated performance of various short- and long-range systems at 5 K average scene contrast.

Equations (17)

Equations on this page are rendered with MathJax. Learn more.

I d e t e c t o r ( x , y ) = I s o u r c e ( x , y ) h o p t i c s ( x , y ) h d e t e c t o r ( x , y )
I a r r a y ( x , y ) = I d e t e c t o r ( x , y ) 1 I F O V 2 δ δ ( x I F O V , y I F O V )
I a r r a y ( ξ , η ) = [ I s o u r c e ( ξ , η ) H o p t i c s ( ξ , η ) H d e t e c t o r ( ξ , η ) ] δ δ ( I F O V × ξ , I F O V × η )
H d e t e c t o r ( ξ , η ) = sinc ( π × I F O V × ξ ) sinc ( π × I F O V × η )
H o p t i c s ( ξ , η ) = 2 π [ cos 1 ( ξ 2 + η 2 ρ c ) ( ξ 2 + η 2 ρ c ) 1 ξ 2 + η 2 ρ c 2 ]
h G a u s s i a n ( x , y ) = 1 b 2 exp [ π ( x 2 + y 2 b 2 ) ]
H b l u r r e d s y s t e m ( ξ , η ) H o p t i c s ( ξ , η , I F O V ) H d e t e c t o r ( ξ , η , I F O V )
I o u t ( x , y ) = i = m m j = n n 1 b 2 exp [ π ( i 2 + j 2 b 2 ) ] I i n ( x + i , y + j )
σ o u t 2 = σ i n 2 i = m m j = n n { 1 b 2 exp [ π ( i 2 + j 2 b 2 ) ] } 2
σ o u t 2 = σ i n 2 { 1 b 2 exp [ π ( i 2 + j 2 b 2 ) ] } 2 d i d j = σ i n 2 2 b 2
σ n o i s e 2 ( Δ T ¯ ) = σ i m a g e 2 × N E T D Δ T ¯
σ a d d i t i v e 2 = σ n o i s e 2 ( Δ T ¯ ) ( 1 1 2 b 2 + Δ N E T D N E T D )
P r e c i s i o n = C o r r e c t D e t e c t i o n s A l l D e t e c t i o n s
R e c a l l = D e t e c t e d O b j e c t s A l l O b j e c t s
m A P = 0.347 + 8 .846 mK N E T D + 0.5999 mrad I F O V
m A P ( R a n g e ) = 0.347 + 8 .846 mK N E T D + 0.5999 mrad R a n g e 32 m I F O V
m A P = 0.384 + 8 .541 mK N E T D + 0.5456 mrad I F O V
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.