Learning based compressive snapshot spectral light field imaging with RGB sensors

Tianyu He; Wenyi Ren; Yang Feng; Ruoning Yu; Dan Wu; Rui Zhang; Yanan Cai; Yingge Xie; Jian Wang

doi:10.1364/OE.502690

1. Introduction

Over the last two decades, significant strides have been made in the field of multidimensional optical imaging. Conventional digital imaging models project a three-dimensional (3D) real-world scene onto a two-dimensional (2D) sensor to achieve high spatial sampling in the 2D spatial dimensions and RGB color imagining in the one-dimensional (1D) spectral dimension. The sampling capability for other dimensions is greatly limited. Multidimensional optical imaging often requires capturing all seven dimensions of the plenoptic function [1], which include 2D spatial intensity distribution($x,y$), 1D spatial depth($z$), propagation polar angles($u,v$), 1D temporal, wavelength ($\lambda$) for spectral intensity and time ($t$). With the rapid development of optical instruments and computational capabilities, there has been extensive exploration beyond traditional digital imaging in dimensions such as depth, spectral, and angular, including depth/3D imaging [2–4], hyperspectral imaging [5–8], and light field(LF) imaging [9]. The trend in computational imaging is to integrate higher-dimensional optical information while maintaining their respective resolutions. Currently, there is ongoing research being conducted in various areas, including LF super-resolution [10], hyperspectral 3D imaging [11], and SLF imaging [12,13]. Commercial LF cameras, such as the Lytro camera, have become increasingly convenient in daily use. This has opened up new opportunities and applications in solving complex computer vision tasks [14–16]. LF cameras capture spatial (2D) and angular (2D) information of the entire field of view, allowing for further applications such as depth estimation and geometric reconstruction. At the same time, hyperspectral imaging technology is also rapidly advancing. Numerous methods have been proposed to directly reconstruct high-spectral information from widely available RGB images [6,17,18]. These methods offer the advantage of not requiring cumbersome imaging systems or expensive spectral cameras, while still maintaining high spectral resolution and imaging performance. Integrating LF and spectral information is a new research direction in computational imaging. The prevailing techniques for acquiring high-dimensional LF and spectral information predominantly relies on scanning and snapshot strategies. Scanning manners utilize microlens arrays combined with tunable filters to individually scan the spectral dimension [19]. On the other hand, snapshot strategies, which have experienced an increase in popularity, present various implementations. For example, Hua et al. achieved ultra-compact SLF imaging by using a horizontally dispersed microlens array, and they recorded 5D SLF with a single snapshot using a monochrome sensor [13]. Xiong et al. designed a novel hybrid camera system that directs incident light from the scene into a Lytro camera and a coded aperture snapshot spectral imaging (CASSI) system through a beam splitter, then described the reconstruction of high-spectral plenoptic function as a sparse-constrained optimization problem and provided undersampled measurements and a self-learning dictionary [12]. These methods might suffer from intricate design, high costs, and limited portability, inhibiting their widespread application. Consequently, multidimensional techniques are often considered in combining conventional imaging techniques, which have excellent performance but require integration of multiple optical components, such as lenses, gratings, masks, and prisms. Overcoming these limitations is the primary challenge facing the practical application of multidimensional imaging systems.

To overcome the above limitations, a compressive snapshot SLF imaging strategy with a commercial RGB LF camera is developed by incorporating a U-shaped transformer Fourier network (UTFNet) in this paper. The structure of this paper is as follows. In Sect. 3.1, we provide a detailed representation of the SLF imaging model. Sect. 3.2 analyzes the sub-aperture images(SAIs), where we add noise distribution consistent with SAIs and apply vignetting effects to train UTFNet for SLF reconstruction tasks using data from the NTIRE [20] and ICVL [21] benchmark datasets. In Sect. 3.3, UTFNet architecture and training process are depicted, Multi-head self-attention and unparameterized Fourier transform modules are elaborated. In Sect. 4, quantitative results and visual comparisons are provided. Since the performance of the network is tested in real scenarios, the results show that the model has good generalization ability for LF data, which is superior to other deep learning methods.

2. Related work

RGB to hyperspectral imaging. Deep learning approaches have indeed become a valuable tool in the reconstruction of hyperspectral data from RGB images. Because of the ill-posed nature of recovering high-spectral information from these images, these deep learning models frequently utilize prior knowledge, often by training on high dimensional datasets. This helps to constrain possible solutions and predict more accurate reconstructions. By utilizing complex neural network architectures, these deep learning approaches can learn to model nonlinear relationships between RGB images and the corresponding hyperspectral images. They can effectively learn mappings from a lower-dimensional RGB space to a higher-dimensional hyperspectral space. This not only overcomes the ill-posed nature of the problem but also allows for better performance even in challenging scenarios. Models like convolutional neural networks (CNNs) [17], generative adversarial networks (GANs) [22], and visual transformers [18] have been popularly used in these tasks.

Light field imaging. Sensors can only measure two dimensions of a scene at a given moment. To obtain a 4D LF, multiple images need to be captured along the angular dimension. Existing LF acquisition methods can be categorized into three basic types [14,23,24]: multi-sensor capture, time-sequential capture and multiplexed imaging. Now, LF imaging has been successfully commercialized and is widely used by both general consumers and in laboratories.

Spectral light field imaging. The trend in computational imaging is to integrate higher-dimensional optical information together, and the existing correlations between different optical dimensions provide possibilities for this development. SLF imaging is a new direction, and current methods include scanning the spectral dimension using a micro-lens array combined with tunable filters [19]. Xiong et al. used a beam splitter to evenly divide the incident light from the scene, followed by capturing it with a LF camera and CASSI where the LF camera records the 4D LF and spectral information, while CASSI encodes a 2D image with 27 spectral bands [12]. To reconstruct the final SLF information, a self-learning dictionary was trained to leverage the significant correlation between the angular and spectral dimensions.

3. Method

3.1 Spectral light field imaging model

The 5D SLF is represented as $f(x, y, u, v,\lambda )$, where $(x, y)$ is the spatial coordinates, $(u, v)$ is the angle coordinates, and $\lambda$ represents the spectral wavelength.The RGB LF can be expressed as

(1)$$\mathcal{L}_{k}(u, v, x, y) = \int_{\lambda_{m}}^{\lambda_{M}} f(u, v, x, y, \lambda)\omega_k(\lambda) d{\lambda},$$

where $k = 1, 2, 3$ corresponds to blue, green, and red channels, respectively. $\omega _k(\lambda )$ represents the spectral response function to each channel of RGB LF camera. When the spectral response functions are unknown, CIE color matching functions are used for simulation [25]. $\lambda _{m}$ and $\lambda _{M}$ are the minimal and maximal central wavelengths of the discretized wavebands, respectively. The discretization form of Eq. (1) can be expressed as

(2)$$\mathcal{L}_{k}(u, v, x, y) = \Sigma_{l=1}^{l=N_{\lambda}} f(u, v, x, y, l)\omega_{k,l},$$

where $l$ and $N_{\lambda }$ are the index and total number of the wavebands, respectively. For a 5D SLF, an SAI can be obtained by sampling at a fixed coordinate $(u^{\ast }, v^{\ast })$ and defined as

(3)$$\mathcal{L}_{k}(u^{{\ast}}, v^{{\ast}}, x, y) = \Sigma_{l=1}^{l=N_{\lambda}} f(u^{{\ast}}, v^{{\ast}}, x, y, l)\omega_{k,l}.$$

The simplest discrete model of Eq. (3) is given by

(4)$$\mathcal{L}_{k,\gamma, \beta}(m, n) = \Sigma_{l=1}^{l=N_{\lambda}} f(\gamma, \beta, m, n, l)\omega_{k,l},$$

where $\gamma =1,2,\ldots,S_u$ and $\beta =1,2,\ldots,S_v$ are the horizontal and vertical indices of the LF with respect to $(u^{\ast }, v^{\ast })$. $S_u$ and $S_v$ are the numbers of the horizontal and vertical LFs, respectively. $m=1,2,\ldots,N_x$ and $n=1,2,\ldots,N_y$ are the horizontal and vertical indices of the spatial location with respect to $(x, y)$. $N_x$ and $N_y$ are the column and row numbers of the digital image of a certain SAI. The matrix form of spectral sensitivity, $\Omega _\lambda \in \mathbb {R}^{3 \times N_{\lambda } }$, is given by $[\omega _{1,1},\omega _{1,2},\ldots,\omega _{1,N_\lambda };\omega _{2,1},\omega _{2,2},\ldots,\omega _{2,N_\lambda };\omega _{3,1},\omega _{3,2},\ldots,\omega _{3,N_\lambda }]$. The vector form of spectral information of a particular point $(m,n)$ is denoted as $\mathbf {F}_{\gamma,\beta,m,n}\in \mathbb {R}^{N_{\lambda }\times 1}$ and expressed as $[f(\gamma, \beta, m, n, 1),f(\gamma, \beta, m, n, 2),\ldots, f(\gamma, \beta, m, n, N_\lambda )]^{T}$. The superscript ’$T$’ represents the transpose operation. Thereby, the matrix form of Eq. (4) is

(5)$$\mathbf{L}_{\gamma, \beta,m,n} = \Omega_\lambda\mathbf{F}_{\gamma,\beta,m,n},$$

where $\mathbf {L}_{\gamma, \beta,m,n}\in \mathbb {R}^{3 \times 1}$ is the RGB value of pixel $(m,n)$. For all pixels of the $(\gamma,\beta )^{th}$ LF, the 3D spectral datacube, $\mathbf {F}_{\gamma,\beta }\in \mathbb {R}^{N_x \times N_y \times N_{\lambda }}$, can be vectorized into $\mathbf {F}_{\gamma,\beta }^{\star }\in \mathbb {R}^{N_xN_yN_{\lambda }\times 1}$ which is given by ${small}[\mathbf {F}_{\gamma,\beta,1,1}^T, \mathbf {F}_{\gamma,\beta,1,2}^T, \ldots, \mathbf {F}_{\gamma,\beta,1,N_x}^T, \ldots, \mathbf {F}_{\gamma,\beta,N_y,1}^T, \mathbf {F}_{\gamma,\beta,N_y,2}^T, \ldots, \mathbf {F}_{\gamma,\beta,N_y,N_x}^T]^T$. Similarly, the RGB measurement of the $(\gamma,\beta )^{th}$ LF is denoted as $\mathbf {L}_{\gamma, \beta }\in \mathbb {R}^{N_x \times N_y \times 3}$ and vectorized into $\mathbf {L}_{\gamma, \beta }^{\star }\in \mathbb {R}^{3N_xN_y \times 1}$. Thereby, the imaging model of the field can be determined as

(6)$$\mathbf{L}_{\gamma, \beta}^{{\star}} = \Omega\mathbf{F}_{\gamma,\beta}^{{\star}},$$

where $\Omega \in \mathbb {R}^{3N_xN_y \times N_xN_yN_{\lambda }}$ can be calculated by $\mathbf {I} {\scriptstyle \bigotimes } \Omega _\lambda$. $\mathbf {I}$ is an identity matrix of size $N_xN_y$. ${\scriptstyle \bigotimes }$ denotes Kronecker product. In the actual imaging process, measurement noise is present, and the final imaging model is given by

(7)$$\mathbf{L}_{\gamma, \beta}^{{\star}} = \Omega\mathbf{F}_{\gamma,\beta}^{{\star}} + E,$$

where $E \in \mathbb {R}^{3N_xN_y \times 1}$ respesents the nosie vector. Due to the inherent nature of an under-determined problem, as described in Eq. (7), it is hard to obtain the accurate estimation of the SLF $\mathbf {F}_{\gamma,\beta }^{\star }$ via the traditional methods such as pseudo-inverse. It is known that most of natural signals and images can be sparsely represented on some bases [26]. According to the compressive sensing theory [27], $\mathbf {F}_{\gamma,\beta }^{\star }$ can be recovered by solving the following optimization problem instead

(8)$$\mathbf{\tilde{F}}_{\gamma,\beta}^{{\star}}= \mathrm{argmin}\{\Vert \mathbf{L}_{\gamma, \beta}^{{\star}} - \Omega\mathbf{\tilde{F}}_{\gamma,\beta}^{{\star}}\Vert^2_2+\tau \Vert \mathbf{\tilde{F}}_{\gamma,\beta}^{{\star}} \Vert_1\},$$

where the $\Vert \cdot \Vert _2$ term is $l_2$ norm of ($\mathbf {L}_{\gamma, \beta }^{\star } - \Omega \mathbf {\tilde {F}}_{\gamma,\beta }^{\star }$), the $\Vert \cdot \Vert _1$ term is $l_1$ norm of estimation value $\mathbf {\tilde {F}}_{\gamma,\beta }^{\star }$, and $\tau$ is a regularization parameter; $E$ is omitted for simplification. Several methods have been suggested to address this optimization problem, including the Two-step Iterative Shrinkage/Thresholding (TwIST) algorithm [28], the Gradient Projection for Sparse Reconstruction (GPSR) algorithm [29], and various learning based approaches [17,18].

3.2 Sub-aperture image analysis

Figure 1 provides examples of LF images captured by the Lytro camera, including a set of SAIs with slightly different views extracted from the raw image. Compared to traditional images with only 2D spatial coordinates, the latest advancements in micro-lens array enable the instantaneous capture of the LF, leveraging the spatial and angular information of the scene (Fig. 1(b)). In a LF camera, the acquisition of the LF is achieved by incorporating a micro-lens array between the sensor and the main lens. This concept was initially described by Lippmann [30] and modernized by Ng et al. [24], leading to the successful commercialization of plenoptic cameras such as the Lytro camera, which has gained significant popularity. As shown in Fig. 2, the main lens focuses the target scene onto the microlens array, and the microlens array separates the converging light rays onto the sensor. Furthermore, the micro-lens array obstructs light transmission, leading to increased photon noise. The vignetting effect causes a radial decrease in the brightness of each micro-lens. This is attributed to the shortening of light rays that are inclined with respect to the optical axis and the obstruction of light by the aperture or the edges of the lens.

Fig. 1. Lytro camera and a measurement: (a) Lytro camera; (b) The raw measurement; (c) SAIs.

Download Full Size | PDF

Fig. 2. Optical configurations for plenoptic cameras.

Download Full Size | PDF

Noise. Previous studies on denoising of LF images [31,32] have assumed additive white Gaussian noise (AWGN) with signal-independent standard deviation as the noise model. AWGN primarily occurs during the image acquisition and transmission processes, and lens-based LF cameras also use conventional imaging sensors, making AWGN an appropriate noise model for LF images. Additionally, the micro-lens array hinder light transmission, resulting in insufficient photon counts. In such cases, the dominant noise models are Poisson noise, which, as signal-correlated noise, conforms to the noise characteristics of Lytro camera. It is supposed that the SAI of $(\gamma, \beta )^{th}$ ligh field with noisy is expressed as

(9)$$\mathbf{L}_{\gamma, \beta} = \mathcal{\hat{L}}_{\gamma, \beta} + \alpha \mathcal{N}_g + \mu \mathcal{N}_p,$$

where $\mathcal {L}_{\gamma, \beta }$, $\mathcal {\hat {L}}_{\gamma, \beta }$, $\mathcal {N}_g \in \mathbb {R}^{N_x \times N_y\times 3}$, and $\mathcal {N}_p \in \mathbb {R}^{N_x \times N_y\times 3}$ are the noisy SAI, noise-free SAI, additive white Gaussian nose, and Poisson noise, respectively. $\alpha$ and $\mu$ are the quantitative parameters. Finally, we have chosen additive white Gaussian and Poisson noises as the noise models to generate the training dataset in Sect. 4.1.

Vignetting. To assess vignetting, we extracted SAIs from 10 sets of diffuse reflectance whiteboard photos under varied illuminations. As seen in Fig. 3, we determined the average of the 10 sets of SAIs and fitted each SAI’s intensity to a convex surface. Thus, each SAI has an intensity distribution. Based on these findings, we can simulate the Lytro camera’s vignetting effect to construct the training dataset in Sect. 4.1. We update different intensity distributions according to different scenes, aiming to better simulate the actual imaging effects of the Lytro camera.

Fig. 3. Vignetting effect of the diffusing whiteboard: (a) SAIs of the diffuse whiteboard; (b) its intensity distribution maps.

Download Full Size | PDF

3.3 Spectral reconstruction from SAIs

According to Eq. (8), recovering hypersperctral information from LF SAIs is a many-to-one problem. The key step in this problem is to obtain the mapping from LF image to SLF image. By solving the compressed imaging inverse problem, it is possible to fully reconstruct the 5D SLF from each channel of LF SAIs. The block diagram of proposed method is depicted in Fig. 4. The LF SAIs are actually composed of RGB images from multiple viewpoints. Due to the significant dimensional gap, recovering the complete 5D SLF from the under-sampled RGB LF SAIs are a severely ill-posed problem that has received little attention in previous research. Inspired by the tremendous success of deep learning in natural image restoration, we propose a Transformer-based architecture to learn an end-to-end mapping from RGB images to HSIs. Specifically, the 5D SLF can be regarded as a concatenation of 4D SLFs, and we only need to recover the corresponding HSI from the RGB image of each sub-view to obtain the final 5D SLF.

Fig. 4. Overview of the proposed system. The LF camera (Lytro) captures RGB LF SAIs, which are used as inputs to the pre-trained recovery network to reconstruct the 5D SLF.

Download Full Size | PDF

Network architecture. First, we introduce the overall pipeline and hierarchical structure for spectral reconstruction. Then, we provide a detailed description of its fundamental components. As shown in Fig. 5, the proposed network follows a U-shaped architecture, similar to the original U-Net network. It incorporates skip connections between the encoder and decoder, with the Transformer module serving as the basic building block. Specifically, our network takes an RGB image $\mathbf {L} \in \mathbb {R}^{N_x \times N_y \times 3}$, which is an SAI, as an input. It first undergoes a $3 \times 3$ convolutional layer for initial feature extraction, resulting in a low-level feature map $\mathbf {X}_{0} \in \mathbb {R}^{N_x \times N_y \times N_{\lambda }}$. Following the U-shaped structure, the features $\mathbf {X}_{0}$ are then passed through three encoder stages, each consisting of spectral-wise attention block (SAB) and fourier transform block (FTB). The decoder adopts a symmetric structure, with upsampling performed using pixelshuffle operations. To prevent information loss during downsampling, skip connections are utilized between the encoder and decoder, following the design philosophy of the U-Net framework.

Fig. 5. Network structure of UTFNet with a base network and three modules SAB, C-MSA, and FTB.

Download Full Size | PDF

Fourier transform block. In fact, replacing the sub-attention layers with a simple linear transformation that mixes input tokens can significantly speed up the Transformer encoder architecture at a limited precision cost [33]. In our model, we utilize a standard unparameterized Fourier transform as a substitute for the sub-attention layer in the Transformer encoder. Research by Lee-Thorp et al. has demonstrated that this approach effectively improves the pre-training and running speed of the model on GPUs while maintaining accuracy [34]. This enables us to use a deeper model without increasing the number of parameters and without impacting the model’s inference speed.

The Fourier transform decomposes a function into its constituent frequencies. There are two main methods for computing the discrete Fourier transforms (DFT): fast Fourier transform (FFT) and matrix multiplication. Research has shown that, for all sequence lengths, FFT is faster than DFT matrix multiplication on GPUs [34]. The FTB is constructed by substituting FFT computations for the conventional multi-head self-attention layer in the Transformer block while maintaining the other layers unchanged. As depicted in Fig. 5, this module consists of an FFT sublayer and a feed forward sublayer. Between each FFT sublayer and feed forward sublayer, a LayerNorm sublayer is applied, and a residual connection is applied after each module. This structural pattern is frequently employed in visual transformer networks. The FFT sublayer is defined as

(10)$$y = {\rm Real}(\mathcal{F}_{seq}(\mathcal{F}_{hidden}(\mathbf{X}_0))),$$

where $\mathcal {F}_{hidden}$ and $\mathcal {F}_{seq}$ are the Fourier transform along the hidden and sequence dimensions, and ${\rm Real}(\ast )$ represents retaing only the real part. As indicated by Eq. (10), FFT sublayer only keep the real part of the results. Therefore, we do not need to modify the feed-forward sublayer or output layer to handle complex numbers, as they are ignored. In addition, the standard FFT algorithm we used is the Cooley-Tukey algorithm [35], which recursively re-expresses the DFT of a sequence of length $N = N_{1}N_{2}$ as $N_{1}$ smaller DFTs of size $N_{2}$, reducing the computational complexity to $\mathcal {O}(N {\rm log} N)$.

Channel multi-head self-attention. Actually, the SAIs obtained by the LF camera are a set of RGB images. Therefore, our model aims to reconstruct the HSI from the RGB information captured by three monochrome channels. RGB is not just the information from three monochrome channels because there are no infinitely narrow-band filters. And, the filters designed to extract the three channel information are originally designed to be broadband. Therefore, RGB images actually map the information from three broadband filters onto a three-channel image, which inherently contains broadband information. Based on this principle, we introduce channel multi-head self-attention (C-MSA) to our spectral transformer. C-MSA is different from the original multi-head self-attention (MSA) [36]. By exploiting the inter-dependencies between channels maps, we could emphasize the interdependent feature mappings and improve the semantic representation of features in the channel dimension.Therefore, we established a channel attention module to learn long-range dependencies between channels.

As shown in Fig. 5, we suppose $\mathbf {X} \in \mathbb {R}^{N_x \times N_y \times D}$ as the input to the C-MSA module and reshape it into tokens $\mathbf {X} \in \mathbb {R}^{N_x \times N_y \times D}$, where $D$ is the number of channels. Then, $\mathbf {X}$ is subjected to an affine transformation to obtain query $\mathbf {Q} \in \mathbb {R}^{N_xN_y \times D}$, key $\mathbf {K} \in \mathbb {R}^{N_xN_y \times D}$, and value $\mathbf {V} \in \mathbb {R}^{N_xN_y \times D}$, the affine transformation is defined as

(11)$$\mathbf{Q} = \mathbf{X}\mathbf{W}_{\mathbf{q}}, \mathbf{K} = \mathbf{X}\mathbf{W}_{\mathbf{k}}, \mathbf{V} = \mathbf{X}\mathbf{W}_{\mathbf{v}},$$

where $\mathbf {W}_{\mathbf {q}}$, $\mathbf {W}_{\mathbf {k}}$, and $\mathbf {W}_{\mathbf {v}}$ $\in \mathbb {R}^{D \times D}$ are learnable parameters. We ignore their biases to simplify the computational process. Subsequently, we feed $\mathbf {Q}$, $\mathbf {K}$, and $\mathbf {V}$ into $J$ individual heads, each with a dimension of $d_{h} = D/J$. In contrast to the original MSAs, the design along the spectral dimension treats each spectral representation as a token and computes self-attention for each head

(12)$$\mathbf{A}_{j} = {\rm softmax}(\sigma _{j} \mathbf{K}^{T}_{j}\mathbf{Q}_{j}), head_{j} = \mathbf{V}_{j}\mathbf{A}_{j}.$$

Due to the significant variation of spectral intensity with wavelength, we introduce a learnable parameter $\sigma \in \mathbb {R}^{1}$. Finally, we concatenate the outputs of $J$ heads and perform a linear projection

(13)$${\rm C\mbox{}\hbox{-}{\rm }MSA}(\mathbf{X}) = \mathop{\text{Concat}}_{j=1}^{J} (head_{j}) \mathbf{W},$$

where $\mathbf {W} \in \mathbb {R}^{D \times D}$ is learnable parameters. Concat operation means concatenating each $head$ along the spectral dimension.

Feed forward module. As highlighted in previous works [37,38], the standard feed-forward Module in the Transformer model has limited capability in utilizing local context. To overcome this issue, we follow recent works [37] by incorporating a deep convolutional block into the Feed forward module of the Transformer-based architecture. As illustrated in Fig. 6, we adopt the approach proposed in [39], where a depth-wise convolutional block is introduced within the feed forward. The entire feed forward module consists of a depth-wise convolutional block, two GELU activation layers, and a reshape operation. The feed forward module is utilized to encode positional information from different spectra.

Fig. 6. Feed forward module.

Download Full Size | PDF

We use L1 loss in the training process, which is a PSNR-oriented optimization for the system. Let $\mathbf {X}_{in} \in \mathbb {R}^{B \times N_x \times N_y \times 3}$ denotes the input of model, and $\mathbf {\hat {X}} \in \mathbb {R}^{B \times N_x \times N_y \times N_\lambda }$ is the ground truth, where $B$ is the batch size of data. The L1 loss is defined as:

(14)$$Loss ={\parallel} {\rm G}(\mathbf{X}_{in}) - \mathbf{\hat{X}} \parallel _{1},$$

where ${\rm G}(\ast )$ is the proposed network.

4. Experiment

4.1 Dataset

Due to the limited availability of publicly accessible SLF datasets, our spectral reconstruction model selected a large-scale dataset provided by the NTIRE 2022 Spectral Reconstruction Challenge [20]. This dataset consists of 1000 RGB-HSI pairs, where each HSI data has a resolution of $482 \times 512$ and contains 31 wavelength bands ranging from 400nm to 700nm. Additionally, we utilized the ICVL HSI dataset [21], which comprises 201 natural scene images with dimensions of $1390 \times 1300 \times 31$. The spectral bands are evenly spaced within the range of 400 to 700nm. The entire dataset was randomly shuffled and divided into training, validation, and testing sets in an 18:1:1 ratio. The corresponding RGB image $\mathbf {L_M} \in \mathbb {R}^{N_x \times N_y \times 3}$ is obtained by applying a transformation matrix $\mathbf {M} \in \mathbb {R}^{N_{\lambda } \times 3}$ to the ground-truth HSI cube $\mathbf {Y} \in \mathbb {R}^{N_x \times N_y \times N_{\lambda }}$ as:

(15)$$\mathbf{L_M} = \mathbf{Y} \times \mathbf{M}.$$

In this context, $\times$ operation represents matrix multiplication with broadcasting, where the smaller matrix is broadcasted to the shape of the larger matrix to make them compatible for the operation. The transformation matrix $\mathbf {M}$ representing the mapping includes the spectral response curves of the Lytro Illum.

The spectral sensitivity curves of the Lytro illum used in our method is measured to generate RGB images. To simulate realistic LF SAIs, we also analyzed the noise distribution. The noise distribution is comparable to standard image sensors but incorporates signal-dependent Poisson noise and additive white noise. Because the micro-lens array blocks light transmission, photon noise increases and varies spatially. Lytro cameras have signal-dependent noise. We added additive Gaussian noise and signal-dependent Poisson noise to approximate Lytro camera noise distribution for best training data. Due to LF cameras’ unusual optical architecture, decoded LF SAIs still exhibit certain vignetting. Global intensity modifications on each RGB input picture in the dataset addressed this issue. Based on the varying intensity distribution of vignetting effects obtained from LF SAIs taken at different angles. Based on these findings, we generated eight images with varying intensity levels from clean hyperspectral data of the same scene to simulate real-world images.

4.2 Ablation study

To verify the effect of FTB in the network we designed, we carry out ablation studies. The results are presented in Table. 1. Due to the FTB includes the Feed Forward Module, which consists of two 1$\times$1 convolution layers and a 3$\times$3 depth-wise separable convolution layer, the number of parameters decreases when all FTBs are removed, whereas the Fourier transform itself is parameterless. Meanwhile, UTFNet shows improvements in every metric. This demonstrates that the combination of FTB and SAB is of significant importance for further improving accuracy.

Table 1. Ablation study on the effect of the FTB. 2.0mm

View Table | View all tables in this article

4.3 Implementation details

We train the UTFNet model using RGB patches of size $128 \times 128$ and their corresponding hyperspectral cubes. The RGB patches were scaled to the range of $[0, 1]$. Batch size is set to 10 and the optimizer is Adam [40] by setting $\beta _{1} = 0.9$ and $\beta _{2} = 0.999$. We set the initial learning rate as 0.0001, and use cosine annealing algorithm to periodically change the learning rate. We implemented the proposed network using the PyTorch framework and trained it on an RTX A30 GPU with 24 GB memory. The training process took approximately 40 hours. During the testing phase, RGB images were also scaled to the range of $[0, 1]$ and fed into the network. UTFNet model took 80.2ms to reconstruct a single image ($622 \times 432 \times 3$) on an RTX 3090 GPU.

4.4 Results and discussion

Qualitative results on valid set. To validate the performance of our model, we compared it with four existing algorithms: MST++ [18], DRCR [6], Rsetormer [41], and HSCNN+ [17]. These methods were retrained using our modified dataset. We evaluated the performance using metrics such as peak signal-to-noise ratio (PSNR), spectral angle mapper (SAM), mean relative absolute error (MRAE), and root mean square error (RMSE). Params and FLOPs are used to measure the complexity and computational requirements of the model. The results are presented in Table. 2. The top-performing approach is bolded, and our algorithm achieved first palce in terms of PSNR, MRAE, RMSE, and SAM. On the validation set, our proposed model showed improvements in multiple metrics.

Table 2. PSNR, MRAE, RMSE, SAM, FLOPs and Params by different algorithms. 3.5mm

View Table | View all tables in this article

Compared to the second-best algorithm MST++, our network achieved a PSNR improvement of 0.41, a decrease in MRAE of 0.0046, an increase in SAM of 0.0068, and an increase in RMSE of 0.0013 on valid dataset. The improvement can be attributed to the adoption of the parameter-free Fourier transform module, which not only replaces the attention sub-layers but also transforms the input feature maps into the frequency domain, effectively suppressing noise.

Qualitative results on real scene. To validate the generalization of our model in real scenes captured by the Lytro camera, we prepared several sets of LF data and used a spectrometer to obtain ground truth spectral density curves for evaluating our spectral reconstruction algorithm. This dataset was captured using a LF camera (Lytro illum) and consists of 5 scenes with different geometric shapes and radiance properties of materials. Each scene’s original data includes a $9 \times 9$ array of angular views with a spatial resolution of $622 \times 432$. To acquire the spectral information, we used a spectrometer (AvaSpec-UKS2048CL-EVO-ES-UA with a spectral range of 200-1100nm) to sample surfaces of different materials and colors, resulting in a total of 20 groups of data.

Due to the importance of accurate spectral estimation in hyperspectral imaging, we evaluate the quality of spectral reconstruction by extracting recovered spectral profiles from different spatial positions, as shown in Table 3 and Fig. 7. In contrast, our UTFNet exhibits a spectrum profile that closely matches the ground truth values, validating its spectroscopic learning and generality ability on the real scence.

Fig. 7. Spectral profiles of different points: (a) Central view of SAIs; (b)-(e) Spectral signatures of selected points from different hyperspectral recovery methods.

Download Full Size | PDF

Table 3. SAM of spectral signatures at each points in Fig. 7. 3.5mm

View Table | View all tables in this article

From Fig. 8, it can be observed that despite the presence of unique noise distribution and vignetting effects in Lytro’s SAIs, our model is able to effectively recover the scene content with lower errors in local regions, although some noise is still visible.

Fig. 8. Visual comparison of five selected bands for hyperspectral reconstruction from real scene.

Download Full Size | PDF

Our method effectively addresses the issue of vignetting effects in the LF, which leads to higher intensity and signal-to-noise ratio (SNR) in the central view compared to the edge views, even when SAIs are captured at different angles. The reconstructed LF spectra in two spectral bands are depicted in Fig. 9. It is evident that this approach achieves a uniform intensity distribution regardless of the direction of the SAI.

Fig. 9. Reconstructed SLF. On the left side are five selected bands from the central view, and on the right side are two bands LF with nine angular views. For better visualization, the complete SLF is also provided (see Visualization 1).

Download Full Size | PDF

5. Conclusion

In conclusion, this paper proposed a snapshot SLF imaging strategy by integrating deep learning with the traditional RGB LF imaging. A typical hand-held LF camera decodes SAIs as RGB input. The SLF reconstruction approach was based on the previous model with considerable RGB-to-HSI improvement. Using current spectral reconstruction datasets, we described a key problem in LF image spectral reconstruction and improve datasets based on LF image noise distribution and vignetting effects. The existing spectral light field compressive sensing algorithms have advantages in terms of interpretability but require careful design of sparse transforms and measurement matrices and are sensitive to noise. The proposed CNN-based RGB-to-HSI reconstruction model has improved evaluation metrics and has been optimized for LF image datasets, enhancing its generalizability. This imaging approach is more cost-effective, simpler, and more versatile compared to other methods because it does not demand strict experimental conditions. The limitation of this method is that the reconstructed spectral range is confined to the visible light spectrum (400-700nm), and the reconstruction accuracy is significantly influenced by the training dataset.

Due to the high-performance parallel batch processing capability of deep learning models, the computation can be accelerated by assembling the proposed method on GPUs, and the entire estimation process can be completed without expensive and redundant hardware setups. The snapshot SLF imaging method allows simultaneous acquisition of spectral and angular information under limited photon budget conditions, eliminating many application barriers. For example, it facilitates the rendering surface textures in dynamic scenes. Additionally, it provides valuable support to a wide range of computer vision applications, including but not limited to s 3D reconstruction, segmentation and matting, saliency detection, object detection and recognition. The ability to effectively capture higher optical dimensions significantly amplifies the overall performance of these tasks.

Funding

National Key Research and Development Program of China (2022YFD1900802); Chinese Universities Scientific Fund (2452022382); National Natural Science Foundation of China (12204380).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. E. H. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision, in ”Computational Models of Visual Processing,M. S. Landy and J. A. Movshon, eds. (The MIT Press, 1991)., pp. 3–20

2. A. Bhandari and R. Raskar, “Signal processing for time-of-flight imaging sensors: An introduction to inverse problems in computational 3-d imaging,” IEEE Signal Process. Mag. 33(5), 45–58 (2016). [CrossRef]

3. Y. Wang, L. Wang, Z. Liang, et al., “Occlusion-aware cost constructor for light field depth estimation,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 19809–19818.

4. D. Kim, W. Ka, P. Ahn, et al., “Global-local path networks for monocular depth estimation with vertical cutdepth,” arXiv, arXiv preprint arXiv:2201.07436, (2022).

5. C. Yu, J. Yang, M. Wang, et al., “Research on spectral reconstruction algorithm for snapshot microlens array micro-hyperspectral imaging system,” Opt. Express 29(17), 26713–26723 (2021). [CrossRef]

6. J. Li, S. Du, C. Wu, et al., “Drcr net: Dense residual channel re-calibration network with non-local purification for spectral super resolution,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 1259–1268.

7. Z. Yu, D. Liu, L. Cheng, et al., “Deep learning enabled reflective coded aperture snapshot spectral imaging,” Opt. Express 30(26), 46822–46837 (2022). [CrossRef]

8. Z. Meng, Z. Yu, K. Xu, et al., “Self-supervised neural networks for spectral snapshot compressive imaging,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 2622–2631.

9. G. Wu, B. Masia, A. Jarabo, et al., “Light field image processing: An overview,” IEEE Signal Process. Mag. 11(7), 926–954 (2017). [CrossRef]

10. S. Wang, T. Zhou, Y. Lu, et al., “Detail-preserving transformer for light field image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol.36, (2022), pp. 2522–2530.

11. L. Wang, Z. Xiong, G. Shi, et al., “Simultaneous depth and spectral imaging with a cross-modal stereo system,” IEEE Trans. Circuits Syst. Video Technol. 28(3), 812–817 (2018). [CrossRef]

12. Z. Xiong, L. Wang, H. Li, et al., “Snapshot hyperspectral light field imaging,” in 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 3270–3278.

13. X. Hua, Y. Wang, S. Wang, et al., “Ultra-compact snapshot spectral light-field imaging,” Nat. Commun. 13(1), 2732 (2022). [CrossRef]

14. M. Levoy, R. Ng, A. Adams, et al., “Light field microscopy,” ACM Trans. Graph. 25(3), 924–934 (2006). [CrossRef]

15. C. Hahne and A. Aggoun, “Plenopticam v1. 0: A light-field imaging framework,” IEEE Trans. on Image Process. 30, 6757–6771 (2021). [CrossRef]

16. P. Gao and C. Yuan, “Resolution enhancement of digital holographic microscopy via synthetic aperture: a review,” Light: Advanced Manufacturing 3, 105–120 (2022). [CrossRef]

17. Z. Shi, C. Chen, Z. Xiong, et al., “Hscnn+: Advanced cnn-based hyperspectral recovery from rgb images,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2018), pp. 939–947.

18. Y. Cai, J. Lin, Z. Lin, et al., “Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 745–755.

19. R. Leitner, A. Kenda, and A. Tortschanoff, “Hyperspectral light field imaging,” in Proc. SPIE, Vol. 9506 (2015), p. 20–27.

20. B. Arad, R. Timofte, R. Yahel, et al., “Ntire 2022 spectral recovery challenge and data set,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 863–881.

21. B. Arad and O. Ben-Shahar, “Sparse recovery of hyperspectral signal from natural rgb images,” in 2016 European Conference on Computer Vision (ECCV), (2016), pp. 19–34.

22. A. Alvarez-Gila, J. Van De Weijer, and E. Garrote, “Adversarial networks for spatial context-aware spectral image reconstruction from rgb,” in 2017 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), (2017), pp. 480–490.

23. I. Ihrke, J. Restrepo, and L. Mignard-Debise, “Principles of light field imaging: Briefly revisiting 25 years of research,” IEEE Signal Process. Mag. 33(5), 59–69 (2016). [CrossRef]

24. R. Ng, M. Levoy, M. Brédif, et al., “Light field photography with a hand-held plenoptic camera”, Ph.D. thesis, Stanford University (2005).

25. J. Schanda, M. Levoy, M. Brédif, et al., “Cie 1931 and 1964 standard colorimetric observers: history, data, and recent assessments,” Encyclopedia of Color Science and Technology (ed. Luo, MR) pp. 125–129 (2016).

26. R. M. Willett, M. F. Duarte, M. A. Davenport, et al., “Sparsity and structure in hyperspectral imaging: Sensing, reconstruction, and target detection,” IEEE Signal Process. Mag. 31(1), 116–126 (2014). [CrossRef]

27. R. G. Baraniuk, “Compressive sensing [lecture notes],” IEEE Signal Process. Mag. 24(4), 118–121 (2007). [CrossRef]

28. J. M. Bioucas-Dias and M. A. Figueiredo, “A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

29. M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007). [CrossRef]

30. G. Lippmann, “Epreuves reversibles donnant la sensation du relief,” J. Phys. Theor. Appl. 7(1), 821–825 (1908). [CrossRef]

31. B. Goldluecke and S. Wanner, “;The variational structure of disparity and regularization of 4d light fields,” in 2013 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2013), pp. 1003–1010.

32. J. Chen, J. Hou, and L.-P. Chau, “Light field denoising via anisotropic parallax analysis in a cnn framework,” IEEE Signal Process. Lett. 25(9), 1403–1407 (2018). [CrossRef]

33. Y. Tay, D. Bahri, D. Metzler, et al., “Synthesizer: Rethinking self-attention for transformer models,” in International Conference on Machine Learning, (PMLR, 2021), pp. 10183–10192.

34. J. Lee-Thorp, J. Ainslie, I. Eckstein, et al., “Fnet: Mixing tokens with fourier transforms,” arXiv, arXiv preprint arXiv:2105.03824, (2021),.

35. M. Frigo and S. G. Johnson, “The design and implementation of fftw3,” Proc. IEEE 93(2), 216–231 (2005). [CrossRef]

36. J. Fu, J. Liu, H. Tian, et al., “Dual attention network for scene segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 3146–3154.

37. K. Yuan, S. Guo, Z. Liu, et al., “Incorporating convolution designs into visual transformers,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 579–588.

38. M. Sandler, A. Howard, M. Zhu, et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2018), pp. 4510–4520.

39. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 10012–10022.

40. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv preprint arXiv:1412.6980 (2014).

41. S. W. Zamir, A. Arora, S. Khan, et al., “Restormer: Efficient transformer for high-resolution image restoration,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 5728–5739.

Name	Description
Supplement 1	Supplemental Document
Visualization 1	The visualization video of the 5D spectral light field image data obtained by our imaging model. For better visualization, it is presented as a supplementary display to Figure 9 in this paper.

	FTB	SAB	Params(M)	FLOPs(G)	PSNR	MRAE	RMSE	SAM
ours	✓	✓	12.74	146.67	32.14	0.1522	0.0318	0.0759
UTFNet_S	-	✓	10.73	124.90	31.65	0.1699	0.03281	0.0845

Method	Params(M)	FLOPs(G)	PSNR	MRAE	RMSE	SAM
Ours	12.74	146.67	32.14	0.1522	0.0318	0.0759
MST++	1.62	58.27	31.73	0.1568	0.0305	0.0827
DRCR	30.92	4793.11	31.27	0.1717	0.0339	0.0846
Restormer	15.11	214.02	31.65	0.1661	0.0320	0.0929
HSCNN+	4.65	761.12	24.71	0.3645	0.0653	0.1015

Point	Ours	MST++	Restormer	DRCR	HSCNN+
1	0.072	0.162	0.165	0.077	0.079
2	0.110	0.159	0.277	0.206	0.211
3	0.086	0.116	0.130	0.141	0.124
4	0.123	0.390	0.233	0.315	0.148

	FTB	SAB	Params(M)	FLOPs(G)	PSNR	MRAE	RMSE	SAM
ours	✓	✓	12.74	146.67	32.14	0.1522	0.0318	0.0759
UTFNet_S	-	✓	10.73	124.90	31.65	0.1699	0.03281	0.0845

Learning based compressive snapshot spectral light field imaging with RGB sensors

Abstract

1. Introduction

2. Related work

3. Method

3.1 Spectral light field imaging model

3.2 Sub-aperture image analysis

3.3 Spectral reconstruction from SAIs

4. Experiment

4.1 Dataset

4.2 Ablation study

4.3 Implementation details

4.4 Results and discussion

5. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (2)

Data availability

Cited By

Figures (9)

Tables (3)

Equations (15)

Optics Express