Heterogeneous camera array for multispectral light field imaging

Yang Zhao; Tao Yue; Linsen Chen; Hongyuan Wang; Zhan Ma; David J. Brady; Xun Cao

doi:10.1364/OE.25.014008

1. Introduction

Both light field and multispectral imaging are hot research topics in computational photography for their potential applications on various computer vision tasks and many other scenarios such as remote sensing. Compared with traditional photography, they provides extra information from either angular or spectral dimensions of light rays.

Many commercial multispectral cameras capture different channels sequentially by the aid of tunable filters [1] or push-broom imaging frameworks [2]. Lots of snap-shot hyperspectral imaging systems [3] such as Computed Tomography Imaging Spectrometer (CTIS) [4], Coded Aperture Snapshot Spectral Imager (CASSI) [5–9] and Prism-based Multispectral Video Imaging Spectrometer (PMVIS) [10] have been proposed to capture videos [11]. Besides, 4D light fields are proposed to simplify the 7D plenoptic function [12–15]and several methods have been proposed for capturing the light fields for both static and dynamic scenes [16–19]. However, capturing the multispectral light field is still difficult due to the increased dimensionality of the problem.

Inspired by Anaglyph 3D theory, i.e., the ability of humans to synthesize full-color stereoscopic perceptual by encoding binocular views with chromatically complemented color filters (typically red and cyan), we try to extend the binocular stereo sensing to multi-camera cases for multispectral light field imaging. The main challenge for this idea is how to pair the heterogeneous images captured at different views with different color filters. Existing stereo matching algorithms such as Normalized Cross Correlation (NCC) [20, 21] and hidden Markov tree model [22] cannot handle this problem because they assume the corresponding points in different views share the same intensities, which leads to the most important fidelity term in the objective function of their matching algorithm. In the cases that this assumption does not hold, most existing stereo matching algorithms fail to offer reasonable output.

However, in natural physiological phenomena, we, human beings can handle two heterogeneous views easily, even without any pre-training. Hence, it implies the solvability of aforementioned problem, and further more indicates that the intensity fidelity constraint does not play an important role in human visual system.

Recently, with the development of deep neural networks, tasks that have puzzled computer scientists a long time but that can be well interpreted by human brains, are perfectly resolved. For example, Zbontar and Yan [23, 24] propose a stereo matching method by training a convolutional neural network (CNN) to simulate human eye’s behavior for image patch comparison and depth information extraction from a rectified image pair. Motivated by this work, we attempt to train a deep network which can handle the heterogeneous stereo matching like human brains do. The Siamese network (Bromley et.al., 1993), which is composed by two identical or similar sub-networks, is applied to handle the stereo image pairs [25]. Different channels of images in standard stereo matching datasets, such as KITTI [26, 27] and Middleburry vision benchmark [28, 29] are used to generate the heterogeneous training images.

Meanwhile, we present a prototype array system composed of eight cameras using heterogeneous wide-band color filters to capture the multispectral light field at the same time. A spectral de-multiplexing algorithm is proposed to extract 24 spectral channels from eight heterogeneously filtered trichromatic cameras. In particular, we conduct the stereo matching among different wideband-filtered images captured at different views by training a convolutional neural network, and construct a 24-channel light field with different spectral response curves by warping the images according to the estimated stereo matching. By delicately designing the broad band spectral filters, the 24-spectral-channel light field can be computed using the demultiplexing algorithm [30].

In all, there are three main contributions of this work, e.g., (1) we propose to capture the multispectral light field using a camera array where each camera is coupled with a heterogeneous wide-band color filter; (2) we demonstrate a heterogeneously matching algorithm by using Convolutional Neural Network to simulate human eyes; (3) we present a prototype system with eight cameras for high quality 24-channel spectral light field imaging to capture both indoor and outdoor, static and dynamic scenes.

2. Camera array system for multispectral light field imaging

In this section, we present our camera array system using heterogeneous wide-band color filters for multispectral light field imaging.

2.1. System overview

As shown in Fig. 1, we reconstruct the multispectral light field by capturing the heterogeneous multiview images by using our eight camera array system. Fig. 2 shows the system architecture of our proposed system, which consists of three main subsystems: heterogeneous cameras, processing boards (Nvidia Jetson TX1), and a PC. Considering the trade off between the system complexity and the number of captured channels, without loss of generality, a prototype system composed of eight cameras are used to take 24 channels per snapshot in this paper. However, the proposed method can be easily extended to capture light fields with more spectral channels and viewpoints. We use eight off-the-shelf RGB cameras, e.g., Point Grey GS3-U3-51S5C-C [31] with 25mm(F/16) lenses, each of which offers spatial resolution at 2448×2048 and temporal resolution up-to 75 frames per seconds (FPS). These eight cameras are mounted parallelly on a printed metal stand as a 2×4 camera array to enable us to move them flexibly. Each camera is connected to a Nvidia Jetson TX1 board for image processing (eg. data compression, ISP and data storage) before sending it to the PC in either raw/JPEG form or as an MPEG2 video stream. The PC host controls the system configuration (such as initialization, synchronous triggering) and data postprocessing (such as image stitching). Eight plastic filters with different color bands are mounted with cameras (as shown in Fig. 2) to capture spectrally heterogeneous images. Besides, it is noted that the default white balance must be turned off to avoid the unnecessary color manipulation during acquisition.

Fig. 1 This image shows an overview of our system. Our system introduces the stereo matching method with convolutional neural network and exploits the different spectral sensitivities of the filter array to reconstruct multispectral light field through the CNN-based heterogeneous stereo matching and spectral demultiplexing.

Download Full Size | PDF

Fig. 2 Camera array configuration.

Download Full Size | PDF

After capturing the spectrally heterogeneous images/videos, the disparity maps between views can be computed using the proposed heterogeneous stereo matching network and spectral de-multiplexing algorithm respectively. The rectification is applied before matching to correct the system errors. The heterogeneous measurements from different views can be warped to their correspondences and form 24 wide-band channels of all the views. By applying the spectral de-multiplexing algorithm on 24-channel images of each view, the final multispectral light field can be reconstructed.

2.2. CNN-based heterogeneous stereo matching

A convolutional neural network (CNN) based heterogeneous stereo matching algorithm is developed to compute the correspondences between the heterogeneous multiview images.

2.2.1. Network architecture

We use the same network architecture as presented in [24]. To make the paper self-complete, we briefly introduce the networks here. The Siamese network [25], i.e., two shared-weight sub-networks with joint top layers are applied. Two sub-networks are composed of four spatial convolutional layers, followed by a rectified linear unit for each layer. For each convolutional layer, 112 3×3 filters are used to extract features. Four fully connected layers with 384 units are followed to estimate the disparity from the features. Each pixel is computed using a 9×9 patch where it locates at the center of the template and other positions are filled with neighbors. The raw outputs of the network still have some errors, especially in low-texture regions and occluded areas. Thus, a series of post-processing steps, i.e., cross-based cost aggregation, semi-global matching, a left-right consistency check, subpixel enhancement, a median, and a bilateral filter, are applied to refine the quality of raw disparity maps.

2.2.2. Model training

With the network architecture aforementioned, we train the model using heterogeneous images generated from the training datasets of KITTI 2015 [26, 27] and test the model with the testing dataset of KITTI 2015.

We train the network parameters using image pairs with different color channels. Specifically, we extract a single channel from the left image of a image pair, and as for the right image, one of the rest two channels are selected to make sure the input image pair has different channels. By traversing all the possbile combinations, 200×6 image pairs are generated to train our network.

The results of proposed method with different channel inputs, as well as the results of Zbontar et.al. [24] with full-channel inputs are shown in Fig. 3. It contains three pairs of examples from the KITTI2015 dataset [26, 27], together with the disparity predictions under two single-color-channel inputs. Similarity is measured as the percentage of pixels where the two disparity maps differ less than two pixels. Although the inputs are reduced to single channel, our results are quite similar to those with full-channel inputs, except for some occluded areas. The results are in accordance with our expectations since the Convolutional Neural Network mimics the human eye neurons well and can extract feature vectors more accurately than traditional stereo methods without the need of intensity fidelity constraint. Therefore, the proposed CNN-based stereo matching method using anaglyph glasses can potentially work as well as human eyes, and thus we can get the light field information through camera-arrays comprising multiple single-channel spectral cameras (realized by placing filters in front of commercial RGB cameras).

Fig. 3 Examples of predicted disparity maps on the KITTI 2015 dataset [26, 27] using our proposed method with different channel inputs (the even rows), as well as the results of Zbontar et.al. [24] with full-channel inputs (the odd rows). note that objects closer to the camera have larger disparities than objects farther away, with warmer colors representing larger values of disparity and smaller values of depth. When taking the single channel as the input, we try different pairs of RGB channels, and they all get very high similarities.

Download Full Size | PDF

2.3. Spectral demultiplexing

Given the camera spectral response, by assuming Lambertian scenes, the imaging model of proposed heterogeneous camera array system can be expressed as:

p_{m, k} (x) = \int_{Ω} s (λ, x) c_{k}^{c a m e r a} (λ) c_{m}^{f i l t e r} (λ) d λ,

where p_m,k (x) is the intensity of pixel x, k ∈ {r, g, b} is the channel index of the image, and m is the camera/view index, Ω = [400nm, 700nm] is the range of the visible spectrum, s(λ, x) is the spectral reflectance of scene point x, and

c_{k}^{c a m e r a} (λ)

is the camera response curve of the k-th channel,

c_{m}^{f i l t e r} (λ)

is the transmission curve of the filter at camera m.

For a multi-camera system with M trichromatic cameras (M = 8 in our system), we can capture the wideband spectrally multiplexed images with 3 × M channels. By selecting N = 3 × M spectral channels from the visible spectrum range, we obtain the spectral sensing matrix C combining both camera responses and filter transmissions together:

C = [\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{3 \times M, N} \\ c_{2, 1} & c_{2, 2} & \dots & c_{3 \times M, N} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{3 \times M, 1} & c_{3 \times M, 2} & \dots & c_{3 \times M, N} \end{matrix}],

where C_3×(_m₋₁₎₊_k,i denotes the spectral sensitivities of i-th narrowband channel in the k-th channel of camera m. Specifically, each row of C is the combination of spectral response curves of both camera sensors and our wideband filters. For a scene point with spectrum s = [s₁, s₂, …, s_N]^T, we get the discrete version of Eq. (1),

p_{m, k} = \sum_{i = 1}^{N} C_{3 \times (m - 1) + k, i} \cdot s .

Considering we have m cameras and each of them has 3 channels, the above equation system can be expressed in a matrix format as:

P = Cs,

where P = [ p₁ p₂ ⋯ p_N ] is the heterogeneous wideband measurements of a single pixel (which can be derived by matching the correpondences of the captured heterogeneous images), and the narrowband spectrum s = [s₁ s₂ ⋯ s_N ] can be computed by solving the matrix with given C.

Eq. (4) is the final formulation that forms the core of our spectral de-multiplexing reconstruction system, and can be solved by minimizing the following objective function in a least squares sense with respect to s:

\hat{s} = \arg \min_{s} {‖ P - Cs ‖}^{2},

where ŝ denotes estimated spectrum from given measurements.

Since the illumination and surface spectra are generally continuous with seldom sharp edges in real world, and the surface spectra should be positive all the time, we introduce the smoothness and non-negative constraints into the objective function to make it an optimization problem:

\begin{matrix} \hat{s} = \arg \min_{s} {‖ P - Cs ‖}^{2} + λ {‖ \nabla s ‖}^{2} \\ s . t . s (i) \geq 0 f o r a l l i, \end{matrix}

where ∇ is the differential operator, λ is the weight for smoothness constraint, and in our experiment λ is set to be 0.01 experimentally.

The projected Gradient based method is applied for minimizing this problem. Specifically, for each iterative step of normal gradient descent, the projecting manipulation is added to keep the searching inside the feasible region. Considering the speed and convergence properties of different optimization methods which as discussed in detail in Sec.4.1, we finally apply conjugate gradient method to solving this optimization problem.

3. Implementation details

3.1. System configuration

Our proposed prototype system consists of eight cameras in two rows. Each row has four cameras that are placed without any gap, so that the horizontal baseline of proposed system is exactly the width of the camera (30 mm). As for the vertical space, the eight camera are placed on two rungs with about 50mm interval, leading to 50mm vertical baseline.

The stereo matching problem assumes all the cameras are placed parallelly. However, in practice, it is impossible to meet this requirement. Thus, the system calibration and rectification are required to correct the system errors and make the corresponding points in different cameras in the same epiploar line. In this paper, we use the camera calibration toolbox [32] to calibrate the intrinsic and extrinsic parameters. After knowing the stereo camera projection matrices, the rectifying transformation can be calculated by solving relationships between original projection matrices and rectified projection matrices with the line through two camera centers as the baseline. Then, we can rectify the real captured images by applying the rectifying transformation to the camera array [33].

3.2. Responses calibration and filter selection

Response calibration

To make the problem solvable, we need to calibrate the spectral sensitivities of the camera array with filters employed in our system, in other words, calibrate sensing matrix C in Eq. (6) since the accuracy of C greatly depends on the detection accuracy of the eight wide band filters’ spectra. To estimate the spectral curves of filter arrays, we measured the spectral signals s0 of the Macbeth color chart using a hyperspectral camera (Prism-Mask Imaging Spectrometer [34]) and s1 of the same color chart with wideband plastic filters covered in front of the hyperspectral camera. The spectral sensitivities of each filter c_filter can be obtained through s1/s0.

The total error of sensing matrix C can be calculated as:

E_{C} = \frac{\frac{s 1 + Δ s 1}{s 0 + Δ s 0} - \frac{s 1}{s 0}}{\frac{s 1}{s 0}} = \frac{\frac{Δ s 1 s 0 - Δ s 0 s 1}{s 0^{2}}}{\frac{s 1}{s 0}} = | \frac{Δ s 1}{s 1} | + | \frac{Δ s 0}{s 0} |

where

| \frac{Δ s 1}{s 1} |

and

| \frac{Δ s 0}{s 0} |

are both hyperspectral camera’s relative measurement error, which can be presented by E as upper limit, and thus will be no larger than 2E. That is to say, the total error E_C of sensing matrix C is in a controllable range. Thus, the accuracy of the matrix C only depends on the detection accuracy of the hyperspectral camera, which can be further calibrated by using high-sensitivity hyperspectral camera sensor.

After test, we have chosen a group of eight qualified plastic filters with lowest condition number, whose spectral curves are plotted in Fig. 4. These eight spectral responses are well-conditioned and robust to small changes of the inputs in the function, and thus provide enough variance to solve our problem in Eq. (6). The standard spectral response of the Point Grey GS3-U3-51S5C-C camera sensor array (Fig. 4(c)) is provided in Point Grey website and can be downloaded directly, as shown in Fig. 4(d).

Fig. 4 Camera array with heterogeneous wide-band color filters. (a) shows eight qualified plastic filters used in our prototype camera-array system; (b) respectively illustrates their spectral sensitivities, which provide enough independent measurements of incoming light spectrum. The standard spectral response of the Point Grey GS3-U3-51S5C-C camera sensor array (c) [31] used in our prototype system is shown in (d).

Download Full Size | PDF

Filter selection

The premise of our work is that correspondences in different cameras provide uncorrelated measurements of the spectra of a single point to enable the full reconstruction of the spectral curve. Thus, the accuracy of reconstruction depends on the correlation between the spectral sensitivities of different cameras. The best scenario would arise when the spectral responses of different filters are completely uncorrelated. The worst case would be the spectral sensitivities of different filters are almost identical. We analyze the spectral sensitivities of different filters to validate that they provide enough independent measurements of the incoming light spectrum. About 18 types of plastic filters are tested, and eight of them are selected by minimizing the condition number of resulted sensing matrix C as shown in Fig. 4(a). From Fig. 4(b), we can see the selected eight filters has different transmission curves and thus can sense the spectrum accurately.

3.3. Image registration

As for the two images at different rows and different columns, instead of computing the stereo matching directly, we introduce the intermediary image to facilitate system calibration and computation. Given an arbitrary image pair at different rows and columns, there exist two intermediary images which are at the same row of one input image and at the same column with the other. By introducing the intermediary images, the corresponding points in arbitrary image pairs can be matched by the aid of their common correspondences in intermediary images. By using the intermediary images, any image pairs can be easily aligned and warped without the need of rectification between images in different rows and columns, which is difficult in these cases since epipolar lines are neither horizontal or vertical.

By applying the stereo matching between all image pairs, we derive disparity maps between all image pairs, and thus can warp all the images to any view of eight cameras. In fact, to derive the whole multispectral light field, all the images are warped to all the views, so that eight 24-channel images can be derived. Note that if the users are only interested in a certain view, the rest images do not need to be warped, since the spectral multiplexing can be achieved independently on a single 24-channel image.

4. Experiments

4.1. Experiments on synthetic data

We first perform experiments on synthetic data to validate our algorithm as well as to analyze the effect of the number of the cameras. We have synthesized three groups of data using Autodesk 3ds Max software [35] with off-the-shelf 3D models and arranged the simulated cameras 2 inches apart and paralleled so that their fields of views overlap completely about 10 feet from the array.

(a) Data synthesis

To derive the spectral images, the training based algorithm [36] is applied to generate the spectral images from RGB images directly. Then, the filter transmission curves and the camera response curves are used to estimate real capture images by using Eq. (3). As is shown in Fig. 5(a), three examples of multiview images simulated as wideband-filtered RGB images are presented.

Fig. 5 (a) Simulated color images captured by 2 × 4 camera arrays with filters in front of them. (b) Simulated image registration with upper left image as the reference image. We also measure the Peak-Signal-to-Noise-Ratio(PSNR) for images in different viewpoints, and an increasing distance between target camera and reference camera decrease the accuracy of image registration, hence the multispectral reconstruction quality.

Download Full Size | PDF

(b) Image registration

We then do image registration for images in different viewpoints. The registration results of the first view (top left view) warped from all the other views are shown in Fig. 5(b), and the Peak-Signal-to-Noise-Ratio (PSNR) of warped images are given in the lower right corner of each images. We can see that the registration quality are much better for the nearer views than further ones. Since we show the registration results of the top left view, the warping images from the nearest views, i.e. Row 1, Column 2 and Row 2, Column 1, gives much better results than the images warped from the furthest views, i.e. bottom right. Besides, we note that the simulated scenes on the bottom are of complex periodic details and large view changes, and thus have worse registration result than other examples.

(c) Optimization

Multispectral reconstruction can be solved by minimizing the objective function in Eq. (6). We have compared three commonly used optimization methods, which are gradient descent method, conjugate gradient method and least square QR f actorization method respectively on simulated CAT image, the comparison results are shown in Fig. 6, from which we can see the three optimization methods converge to the same solution, hence share the same accuracy. Furthermore, the running time is fastest for conjugate gradient method and slowest for gradient descent method, so we finally choose conjugate gradient method to solve this optimization problem considering the tradeoff between efficiency and complexity.

Fig. 6 The comparison of three commonly used optimization methods. These three optimization methods converge to the same solution, hence share the same accuracy. Furthermore, the running time is fastest for conjugate gradient method and slowest for gradient descent method with iteration steps in the same order of magnitude.

Download Full Size | PDF

(d) Reflectance reconstruction

We reconstruct 24-channel spectral reflectance for each view by using spectral de-multiplexing algorithm and implement Cubic Spline Interpolation function on standard trichromatic curves of camera sensor to get discrete distribution of RGB channels (the discrete value is 24). Then we respectively apply these three 24 × 1 distributions on each single spectral image to obtain trichromatic channels, and hence convert it into pseudo-color image. By reconstruct all the eight spectral views, the final multispectral light field is recovered. As shown in Fig. 7, six selected spectral channels for the first (top left) view of reconstructed results are presented. The quantitative evaluation of reconstruction errors for spectral reflectance both of each channel and average in Peak-Signal-to-Noise-Ratio (PSNR) and Structural-Similarity-Index (SSIM) are given in Table. 1.

Fig. 7 Reconstructed multispectral channels of the first (top left) view of our eight camera array system. We select six single-spectral reflectance from all 24 reconstructed channels and compare the results with simulated ground truth reflectance.

Download Full Size | PDF

Table 1. Evaluating multispectral reflectance reconstruction errors from three groups of 2 × 4 simulated light field datasets.

View Table

To evaluate our algorithm’s dependence on spectral sensing matrix C, we add Gaussian noise N(µ, σ²) into filter array’s spectra to simulate inaccurate matrix C. The PSNR of simulated CAT image with different parameter σ are evaluated, as is shown in Fig. 8(a), where the accuracy of reconstructed multispectral images turns down gradually as parameter σ of Gaussian noise component increases. In Fig. 8(b), we illustrate both reconstructed and Groundtruth spectral reflectance curves of two selected points of Fig. 8(a), and pseudo-color images in chosen spectrum marked by dotted line (586nm for point A, 618nm for point B). We can see that although the overall reconstruction performance remains consistent when matrix C is inaccurate, some areas such as the cat feet are greatly influenced by noise, as shown in the enlarged rightmost image.

Fig. 8 (a) PSNR of simulated CAT image with different parameter σ of additional Gaussian noise. (b)illustrates both reconstructed and GroundTruth spectral reflectance curves of two selected points of (a), and pseudo-color images in chosen spectrum marked by dotted line (586nm for point A, 618nm for point B).

Download Full Size | PDF

4.2. Experiments on light field datasets

We also test proposed method on the publicly available light field image datasets captured by Computer Graphics Laboratory in Stanford University using their multi-camera array [17]. As shown in Fig. 9(a), we use the training based spectral reconstruction algorithm [36] to generate multispectral light field from the existing light field data with RGB images. Fig. 9(b) shows the registration results of the first view (top left view) warped from all the other views and PSNR of warped images are given in the lower right corner of each image. Fig. 9(c) illustrates 100 × 100 reflectance patches in four selected spectral channels(514nm, 528nm, 610nm and 634nm) of respectively reconstructed results for all eight views. As can be seen, our method accurately reconstructs multispectral images in each viewpoint.

Fig. 9 Testing results of our proposed method on the light field dataset Toy HumveeandSoldier captured by Computer Graphics Laboratory in Stanford University. We choose a part of 2 × 4 image arrays from the whole 256 views on a 16 × 16 grid which have been calibrated already.(a) shows simulated light field color images with filters,(b) simulates image registration of the first(top left)view of the eight images. (c) illustrates four single-spectral reflectance images from all 24 reconstructed channels warped to all the views for 100 × 100 red patches in (b).

Download Full Size | PDF

4.3. Experiments on real data

For the experiments on real data, we captured color images of several indoor scenes under iodine-tungsten illumination using our prototype camera array system introduced in Section 2.1, and the resolution of the examples in this paper is 1920 × 1080 that covers nonplanar objects.

We first compare multispectral reconstruction results of our method with ground truth curves of the standard Macbeth color checker to verify our algorithms and part of the comparison results are shown in Fig. 10. As can be seen, the proposed method promisingly reconstructs 24 multispectral images of the classic color checker. It is worth noting that we should remove the illumination interference first which can be obtained through capturing a standard white board before recovering spectral reflectance of the scene. Fig. 11 illustrates various scenes under indoor iodine-tungsten illumination captured by our camera system. We select several typical points(such as red, blue and green points) from the images and illustrate their 24-channel single-spectral reflectance curves respectively in the rightmost column of Fig. 11. Meanwhile, we can also obtain the light field of the same scene simultaneously using these eight commercial digital cameras. Fig. 12 show several reconstructed single-spectral reflectance patches chosen from all 24 reconstructed channels for all the eight viewpoints, from which we can see images obtained by different cameras are registrated well except for some planer regions, where the disparity map may not be accurate enough. So far, we have successfully proved that we can obtain the light field and multispectral information simultaneously using our heterogeneous camera array system and algorithms.

Fig. 10 Verification experiment using a Macbeth color chart. The results from our method and the Ground Truth curves of the color checker are compared. We randomly choose six patches from all 24 color patches of the colorboard and illustrate their both reconstructed and standard spectral reflectance curves of 24 channels from 450nm to 634nm, with an interval of 8nm.

Download Full Size | PDF

Fig. 11 Real color images captured by our proposed 2 4 camera array system with heterogeneous wideband filters. These two scenes are both×captured under indoor iodinetungsten illumination and we can obtain the illumination spectra through capturing the standard white board. we also randomly select several points with different colors and illustrate their reconstructed 24-channel single-spectral reflectance curves in the rightmost column.

Download Full Size | PDF

Fig. 12 Multispectral image reconstruction of various light field datasets captured by our own system under indoor iodine-tungsten illuminations and the detail results of the same patch in all eight views. We respectively select two single-spectral reflectance from all 24 reconstructed multispectral channels for each scenario, which are rendered as RGB images using the spectral sensitivities of the Point Grey GS3-U3-51S5C-C camera sensor.

Download Full Size | PDF

5. Conclusion

We have introduced a framework for affordable and easy-to-use multispectral light field imaging using heterogeneous cameras array system. By exploiting anaglyph theory, multispectral images can be reconstructed through spectral demultiplexing. The proposed system can flexibly increase spectral channels by adding more cameras into the camera array. We have demonstrated the effectiveness and accuracy of our system using various synthesized and real examples.

The work has left out a few issues that deserve to be explored in depth. For example, accuracy of estimated multispectral images depends severely on the disparity mapping algorithms using convolutional neural networks in this paper and we hope to do further optimization in postprocessing of stereo matching in the next step to further improve the performance. Reconstruction time acceleration is also on the list of our future work.

Funding

National Natural Science Foundation of China (NSFC) (61422107, 61571215, 61627804, 61671236); National Science Foundation for Young Scholars of Jiangsu Province, China (BK20160634, BK20140610).

References and links

1. N. Gat, “Imaging spectroscopy using tunable filters: a review,” Proc. SPIE 4056(1), 50–64 (2000) [CrossRef] .

2. K. C. Lawrence, B. Park, W. R. Windham, and C. Mao, “Calibration of a pushbroom hyperspectral imaging system for agricultural inspection,” Trans. Ame. Soc. of Agril. Engg. 46(2), 513 (2003).

3. X. Cao, T. Yue, X. Lin, S. Lin, X. Yuan, Q. Dai, L. Carin, and D. J. Brady, “Computational Snapshot Multispectral Cameras: Toward dynamic capture of the spectral world,” IEEE Sig. Proc. Mag. 33(5), 95–108 (2016) [CrossRef] .

4. M. Descour and E. Dereniak, “Computed-tomography imaging spectrometer: Experimental calibration and reconstruction results,” Appl. Opt. 34(22), 4817 (1995) [CrossRef] [PubMed] .

5. D. J. Brady and M. E. Gehm, “Compressive imaging spectrometers using coded apertures,” Proc. SPIE 6246, 62460A (2006) [CrossRef] .

6. A. Mrozack, D. L. Marks, and D. J. Brady, “Coded aperture spectroscopy with denoising through sparsity,” Opt. Express 20(3), 2297–2309 (2012) [CrossRef] [PubMed] .

7. D. J. Brady, K. Choi, D. L. Marks, R. Horisaki, and S. Lim, “Compressive Holography,” Opt. Express 17(15), 13040–13049 (2009) [CrossRef] [PubMed] .

8. X. Lin, G. Wetzstein, Y. Liu, and Q. Dai, “Dual-coded compressive hyperspectral imaging,” Opt. Lett. 39(7), 2044–2047 (2014) [CrossRef] [PubMed] .

9. H. Rueda, H. Arguello, and G. R. Arce, “Dual-ARM VIS/NIR compressive spectral imager,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2006), pp. 2572–2576.

10. X Cao, H Du, X Tong, et al., “A Prism-Mask System for Multispectral Video Acquisition,” IEEE Trans. Pat. Ana. Machine Intell. 33(12), 2423–2435 (2011) [CrossRef] .

11. J. Jia, K. J. Barnard, and K. Hirakawa, “Fourier Spectral Filter Array for Optimal Multispectral Imaging,” IEEE Trans. Image Process. 25(4), 1 (2016) [CrossRef] .

12. L. Mcmillan and G. Bishop, “Plenoptic modeling: an image-based rendering system,” in Proceedings of Conference on Computer Graphics and Interactive Techniques. ACM29(5), 39–46 (1995).

13. M. Landy and J. Movshon, “The plenoptic function and the elements of early vision,” MIT Press1, 3–20 (1997).

14. M. Levoy, “Light fields and computational imaging,” Computer 39(8) 46–55 (2006) [CrossRef] .

15. S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen, “The Lumigraph,” in Proceedings of Conference on Computer Graphics and Interactive Techniques. ACM96, 43–54 (2001).

16. D. Wood, D. Azuma, K. Aldinger, B. Curless, T. Duchamp, D. H. Salesin, and W. Stuetzle, “Surface light fields for 3D photography,” in Proceedings of the Conference on Computer Graphics and Interactive Techniques. ACM287–296 (2000).

17. B. Wilburn, N. Joshi, V. Vaish, E. V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM Trans. on Graphics 24(3), 765–776 (2005) [CrossRef] .

18. E. H. Adelson and J. Y. Wang, “Single lens stereo with a plenoptic camera,” IEEE Trans. on Pattern Analysis and Machine Intelligence 14(2), 99–106 (1992) [CrossRef] .

19. L. Marc, N. Ren, A. Andrew, F. Matthew, and H. Mark, “Light field microscopy,” ACM Trans. on Graphics 25(3), 924–934 (2006) [CrossRef] .

20. J. P. Lewis, “Fast normalized cross-correlation,” Vision Interface 10(1), 120–123 (1995).

21. X. Shen, L. Xu, Q. Zhang, and J. Jia, “Multi-modal and multi-spectral registration for natural images,” Euro. Conf. Comput. Vision 2014, pp. 309–324.

22. E. T. Psota, J. Kowalczuk, M. Mittek, and L. C. Perez, “MAP Disparity Estimation Using Hidden Markov Trees,” in Proceedings of the IEEE International Conference on Computer Vision (IEEE, 2015), pp. 2219–2227.

23. J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 1592–1599.

24. J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” J. Mach. Learn. Res. 17, 1–32 (2016).

25. Y. Lecun, “Learning Invariant Feature Hierarchies,” Euro. Conf. Comput. Vision 2012, pp. 496–505.

26. M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 3061–3070.

27. M. Menze, C. Heipke, and A. Geiger, “Joint 3dD estimation of vehicles and scene flow,” ISPRS Ann. Photogramm,” Remote Sens. Spatial Inf. Sci. , II-3-W5, 427–434 (2015).

28. H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for stereo matching,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2007), pp. 1–8.

29. D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” German Conference on Pattern Recognition8753, 31–42 (2014).

30. J. I. Park, M. H. Lee, M. D. Grossberg, and S. K. Nayar, “Multispectral imaging using multiplexed illumination,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2007), pp. 1–8.

31. PointGrey, “Grasshopper3 5.0 MP Color USB3 Vision,” https://www.ptgrey.com/grasshopper3-50-mp-color-usb3-vision-sony-pregius-imx250.

32. J. Heikkila and O. Silvéln, “A four-step camera calibration procedure with implicit image correction,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 1997), pp. 1106–1112.

33. Y. S. Kang, C. Lee, and Y. S. Ho, “An efficient rectification algorithm for multi-view images in parallel camera array,” 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (IEEE, 2008), pp. 61–64.

34. C. Ma, X. Cao, X. Tong, Q. Dai, and S. Lin, “Acquisition of high spatial and spectral resolution video with a hybrid camera system,” Int. J. Comput. Vis. 110(2), 141–155 (2014) [CrossRef] .

35. Autodesk 3ds Max, “3D computer graphics program”, http://www.autodesk.com/products/3ds-max/overview.

36. Nguyen, M. H. Rang, D. K. Prasad, and M. S. Brown, “Training-Based Spectral Reconstruction from a Single RGB Image,” Euro. Conf. Comput. Vision 2014, 186–201.

Cat	PSNR (dB)	450nm 35.11	458nm 34.35	466nm 33.28	474nm 32.78	482nm 32.67	490nm 32.36	498nm 32.05	506nm 31.47	Avg
		514nm 30.67	522nm 30.06	530nm 29.67	538nm 29.91	546nm 30.10	554nm 30.56	562nm 30.91	570nm 30.88	30.90
		578nm 29.94	586nm 29.17	594nm 28.71	602nm 28.58	610nm 28.97	618nm 29.59	626nm 29.92	634nm 30.00	30.90
	SSIM	450nm 0.9246	458nm 0.9236	466nm 0.9238	474nm 0.9231	482nm 0.9208	490nm 0.9195	498nm 0.9164	506nm 0.9153	Avg
		514nm 0.9114	522nm 0.9064	530nm 0.9049	538nm 0.9049	546nm 0.9062	554nm 0.9090	562nm 0.9112	570nm 0.9095	0.9118
		578nm 0.9008	586nm 0.8993	594nm 0.9012	602nm 0.9038	610nm 0.9074	618nm 0.9107	626nm 0.9136	634nm 0.9163	0.9118
House	PSNR (dB)	450nm 35.67	458nm 34.73	466nm 33.18	474nm 32.40	482nm 32.31	490nm 31.90	498nm 31.51	506nm 30.55	Avg
		514nm 29.35	522nm 28.78	530nm 28.09	538nm 28.72	546nm 29.04	554nm 29.87	562nm 30.63	570nm 31.12	30.94
		578nm 30.81	586nm 30.37	594nm 30.04	602nm 29.93	610nm 30.36	618nm 31.02	626nm 31.29	634nm 31.10	30.94
	SSIM	450nm 0.8628	458nm 0.8679	466nm 0.8745	474nm 0.8784	482nm 0.8779	490nm 0.8735	498nm 0.8852	506nm 0.8887	Avg
		514nm 0.8832	522nm 0.8877	530nm 0.8876	538nm 0.8874	546nm 0.8780	554nm 0.8832	562nm 0.8830	570nm 0.8713	0.8726
		578nm 0.8499	586nm 0.8459	594nm 0.8486	602nm 0.8600	610nm 0.8623	618nm 0.8646	626nm 0.8672	634nm 0.8737	0.8726
Park	PSNR (dB)	450nm 29.55	458nm 29.22	466nm 28.21	474nm 27.79	482nm 27.73	490nm 27.50	498nm 27.17	506nm 26.52	Avg
		514nm 25.87	522nm 25.45	530nm 25.07	538nm 25.33	546nm 25.37	554nm 25.64	562nm 25.71	570nm 25.53	26.72
		578nm 25.08	586nm 25.01	594nm 25.36	602nm 26.05	610nm 27.01	618nm 28.12	626nm 28.56	634nm 28.40	26.72
	SSIM	450nm 0.8385	458nm 0.8426	466nm 0.8429	474nm 0.8430	482nm 0.8444	490nm 0.8462	498nm 0.8442	506nm 0.8402	Avg
		514nm 0.8314	522nm 0.8198	530nm 0.8125	538nm 0.8091	546nm 0.8050	554nm 0.8017	562nm 0.7935	570nm 0.7756	0.8223
		578nm 0.7534	586nm 0.7636	594nm 0.7921	602nm 0.8209	610nm 0.8402	618nm 0.8540	626nm 0.8601	634nm 0.8591	0.8223

Heterogeneous camera array for multispectral light field imaging

Abstract

1. Introduction

2. Camera array system for multispectral light field imaging

2.1. System overview

2.2. CNN-based heterogeneous stereo matching

2.2.1. Network architecture

2.2.2. Model training

2.3. Spectral demultiplexing

3. Implementation details

3.1. System configuration

3.2. Responses calibration and filter selection

Response calibration

Filter selection

3.3. Image registration

4. Experiments

4.1. Experiments on synthetic data

(a) Data synthesis

(b) Image registration

(c) Optimization

(d) Reflectance reconstruction

4.2. Experiments on light field datasets

4.3. Experiments on real data

5. Conclusion

Funding

References and links

Cited By

Figures (12)

Tables (1)

Equations (7)

Optics Express