Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Wavelength encoding spectral imaging based on the combination of deeply learned filters and an RGB camera

Open Access Open Access

Abstract

Hyperspectral imaging is a critical tool for gathering spatial-spectral information in various scientific research fields. As a result of improvements in spectral reconstruction algorithms, significant progress has been made in reconstructing hyperspectral images from commonly acquired RGB images. However, due to the limited input, reconstructing spectral information from RGB images is ill-posed. Furthermore, conventional camera color filter arrays (CFA) are designed for human perception and are not optimal for spectral reconstruction. To increase the diversity of wavelength encoding, we propose to place broadband encoding filters in front of the RGB camera. In this condition, the spectral sensitivity of the imaging system is determined by the filters and the camera itself. To achieve an optimal encoding scheme, we use an end-to-end optimization framework to automatically design the filters’ transmittance functions and optimize the weights of the spectral reconstruction network. Simulation experiments show that our proposed spectral reconstruction network has excellent spectral mapping capabilities. Additionally, our novel joint wavelength encoding imaging framework is superior to traditional RGB imaging systems. We develop the deeply learned filter and conduct actual shooting experiments. The spectral reconstruction results have an attractive spatial resolution and spectral accuracy.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

The hyperspectral image (HSI) can be described as a collection of 1D spectral curves with several image resolutions or 2D spatial images with several spectral channels. Due to various objects having distinct spectral properties, spectral imaging techniques have been widely used in agriculture [1], remote sensing [2], biomedicine [3], target tracking [4] and so on. Most currently widely used spectral imaging systems are based on the scanning principle. However, the scanning process is time-consuming and requires the assistance of sophisticated, complex, and bulky mechanical systems. Resulting in a very limited application in scenarios where portability and imaging speed are desired. With the development of computational spectral imaging, numerous encoding and reconstruction-based spectral imaging systems have emerged in recent years [5]. These spectral imaging systems have substantially improved snapshot imaging, reduced system complexity, and volume, etc.

Many scholars have focused their research on reconstructing 3D spectral data cubes from easily accessible RGB clear images. The topic of Spectral Reconstruction from RGB is a popular one in the New Trends in Image Restoration and Enhancement (NTIRE) workshop, which attracts a large number of participants every two years, and there has been a constantly growing flow of related algorithms [6,7]. However, most of these algorithms treat RGB-to-Spectrum reconstruction as a channel-dimensional super-resolution task, focusing mainly on the mapping capacity of convolutional neural networks (CNNs). Various RGB cameras have different spectral sensitivities, corresponding to different image signal processes (ISP) [8]. The generalization of such methods in complex scenes needs to be enhanced. In contrast, some researchers regard RGB images as the result of wavelength encoding and thus understand the spectral reconstruction as a decompression process. More specifically, the CFA of an RGB camera can be regarded as a wavelength encoding element, which modulates the spectrum of the incident light field.

In this work, we proceed from the principle of wavelength-encoding spectral imaging and propose to place broadband filters in front of the RGB camera. By switching filters and taking multiple shots, the diversity of encoding and initial data volume is increased. In this system, the CFA of the camera and the broadband filters work together as the wavelength encoding element. Since the CFA of the camera is immutable, we aim to design the transmittance function of the filters through an end-to-end optimization framework to achieve the optimal wavelength encoding scheme. In the optimization framework, we first propose a CNN-based algorithm for RGB-to-Spectrum reconstruction. Experimental results show that our proposed spatial-channel attention module can effectively improve network performance. We set the transmittance functions of the filters as trainable parameters and optimize the network weights and filters’ transmittance functions through data-driven learning. To facilitate the fabrication of the filters, we impose non-negative and smooth constraints on the filter optimization parameters during the optimization process. We develop a broadband filter and conduct imaging experiments.

Our contribution can be summarized as follows:

  • • We reconsider the RGB imaging from the perspective of wavelength encoding, and we argue that all wavelength-encoding spectral reconstruction tasks should be performed in the RAW domain of RGB images to avoid the influence of some nonlinear transformations in ISP.
  • • We propose to place a broadband filter in front of the RGB camera, and the wavelength encoding can be realized through the combination of the filter and the camera’s CFA. Our proposed system avoids the challenge of integrating specifically designed filters on a single chip and improves the acquisition efficiency of wavelength-encoded images by acquiring a three-channel image in a single shot.
  • • We propose to use an end-to-end optimization framework to simultaneously realize the design of broadband encoding filters and the optimization of network parameters. This approach enables us to achieve the optimal wavelength encoding scheme design in the current scenario. In the end-to-end optimization framework, we propose a neural network for spectral reconstruction based on a spatial-channel attention block. Our experimental results demonstrate that our network has excellent spectral reconstruction ability, which is quite competitive compared with other methods.
  • • We develop a broadband filter based on the designed parameters and construct an experimental system to perform spectral reconstruction of the actual scene. Our reconstruction results have higher spatial resolution and similar spectral curves compared with the commercial spectral imager’s results.

The rest of this paper is organized as follows. In Sec. 2, we briefly introduce and discuss the work related to spectral imaging. Sec. 3 illustrates our imaging model and the implementation details of the end-to-end optimization framework. Detailed numerical simulations and practical experiments are given in Sec. 4. Finally, we discuss and conclude in Sec. 5.

2. Related work

Traditional scanning-based spectral imaging systems usually scan along the spectral or spatial dimension and stitch together multiple measurements to generate a 3D spectral data cube. A typical push-broom imaging system scans along the spatial dimension [9,10] requires the collaboration of slits, dispersive elements (such as gratings or prisms), mechanically moving parts, etc. One-dimensional spatial and corresponding one-dimensional spectral information can be obtained in a single shot. An imaging system that scans along the spectral dimension [11,12] usually needs to put a narrow-band filter in the imaging optical path and obtain the two-dimensional spatial information of the current spectral channel through a single shot. There are also some dispersion-based snapshot hyperspectral imaging systems that enable snapshot acquisition of hyperspectral images, such as Integral Field Units (IFUs) [13]. However, IFUs face challenges of system complexity and a trade-off between spatial resolution and spectral accuracy.

Deep learning has been widely applied in various image restoration and enhancement tasks in recent years, and it has also given rise to several computational spectral imaging systems that do not require a scanning process, also known as encoding and reconstruction-based spectral imaging systems [5]. According to the encoding methods, the existing reconstruction-based spectral imaging systems can be divided into three categories: amplitude encoding-based, phase encoding-based, and wavelength encoding-based.

Amplitude encoding-based systems are based on the compressive sensing theory and employ binary-coded mask and dispersive elements to achieve snapshot spectral imaging. A typical amplitude encoding system is the coded aperture snapshot spectral imaging (CASSI) system proposed by Gehm ME et al. [14]. In CASSI, the incident 3D data cube is firstly modulated by a 0/1 mask, then the encoded data cube is spatially shifted by a dispersion element, and finally, the shifted data cube is integrated along the spectral dimension on the image sensor to obtain a 2D measurement. Since the CASSI system was proposed, numerous improvements have been made in the system structure and reconstruction algorithm. Wu et al. [15] proposed using a digital micromirror device (DMD) as the encoding element to achieve dynamic amplitude modulation. To improve reconstruction results, Parada-Mayorga et al. [16] proposed using color-coded apertures rather than the random black and white coded apertures. A dual optical path system based on a beam splitter was proposed [17,18], one for obtaining a clear image and the other for CASSI, and high-quality HSI can be obtained by fusing and reconstructing the two images. With the growth of computing resources, the reconstruction algorithm based on deep learning has boosted the performance of CASSI systems in recent years [1921]. However, due to its elaborate optical system, the CASSI system can only achieve relatively stable spectral imaging under laboratory conditions [22]. In comparison, our system has a more compact optical system and better portability.

Phase encoding-based systems encode the phase term of the incident light, which in turn modulates the imaging system’s point spread function (PSF). The typical phase encoding elements are diffractive optical elements (DOEs). The optical path difference caused by the surface profile of the DOE and the refractive index difference introduces an additional phase term, which in turn leads to wavelength-dependent PSFs on the sensor plane. Jeon et al. [23] proposed using a single DOE to achieve dispersion and imaging simultaneously. The system’s PSFs exhibit an anisotropic shape with a rotation angle. A 3D data cube can be obtained through a single shot and image restoration. Improvements in imaging systems and reconstruction algorithms based on this work have been proposed in recent years [2426]. Mikko et al. [27] proposed using a conventional camera with a diffraction grating element to achieve snapshot spectral imaging, in which the grating project information with different spectra to different spatial positions. Beak et al. [28] proposed using a deeply learned DOE to achieve spectral and depth imaging at the same time. These systems usually have the advantages of small size, compactness, and snapshot type, but their PSFs usually have a large size, and the initial images are severely degraded, which brings great difficulty for image restoration. In contrast, our method encodes in the wavelength dimension with less image characteristics lost.

Unlike the previously described systems, wavelength encoding-based systems primarily focus on encoding and integrating 3D data cubes in the wavelength dimension without sacrificing spatial resolution. Due to the unique spectral transmittance characteristics of different materials, there are various approaches to achieve wavelength encoding. Zhang et al. [29] proposed using thin-film filters as wavelength encoding elements and using a deep neural network (DNN) to realize filter design and spectrum reconstruction. Similarly, Zhu et al. [30] proposed to use silicon nitride-based photonic crystal (PC) slabs as wavelength encoding elements and cover them on CMOS. Jian et al. [31] also proposed using metasurfaces with different patterns to encode the incident spectra and integrate multiple pixels into a micro-spectrometer. Nie et al. [32] suggest that the CFA of the camera is designed for human perception and is not optimal for spectral reconstruction missions, so they propose to optimize the design of the camera’s response function. Seoung et al. [33] proposed utilizing multiple cameras with different spectral sensitivities to capture the same scene and reconstruct the spectrum from different RGB measurements. Fu et al. [34] proposed reconstructing a 3D data cube from a single RAW mosaic image and investigated the performance of three different network structures on the spectral reconstruction task. Fu et al. [35] proposed a method to select cameras with different spectral response for the best spectral recovery. Meanwhile, a lot of work has also contributed to proposing various spectral reconstruction networks [6,7].

3. Methods

In this section, we first present the image formation model based on wavelength encoding. In addition, we provide a thorough introduction to the image simulation degradation pipeline that will be utilized in subsequent experiments. Then the end-to-end framework for optimizing the filter and network parameters is delivered, with details of the network structure and constraints imposed.

3.1 Image formation model

In our proposed system, assume that we have ${m}$ filters. Select ${n}$ (${n \in [0,1,\ldots,m]}$) pieces from the ${m}$ pieces to form a combined filter, then the transmittance function of the combined filter can be expressed as

$$T_i(\lambda) = \prod_{q=0}^{n} t_q(\lambda) ,$$
where the subscript ${i}$ represents the ${i^{th}}$ combination scheme, ${t_q(\lambda )}$ is the transmittance function of the single-chip filter. When taking images, a combined filter is placed in front of the imaging lens, and the corresponding wavelength-encoded image is acquired. We can infer that the maximum number of shots is
$$N = \sum_{n=0}^{m} C_{m}^{n} = 2^m ,$$
where ${C_{m}^{n}}$ represents the process of selection and combination, ’${!}$’ indicates the factorial operation. As the RGB camera can obtain three color channels of R, G, and B for each shot, the spectral reconstruction network can obtain image inputs with up to ${3N}$ channels.

Consider the imaging scene shown in Fig. 1, an object illuminated by a light source with a spectral distribution of ${ E(\lambda ) }$. The light reflected by the object passes through a filter, an imaging lens, and a color filter array before forming an image on the image sensor. Assuming that the reflectance of the surface of the object is ${R(x,y,\lambda )}$, the transmittance function of the filter is ${T_i(\lambda )}$, and the response function of the camera is ${S_c(\lambda )}$, then the formation model of the RGB image can be described as

$$J_c(x,y) = \int_{\lambda_1}^{\lambda_2} E(\lambda) R(x,y,\lambda) T_i(\lambda) S_c(\lambda) d\lambda ,$$
where ${(x,y)}$ represent the spatial coordinates, the subscript ${c \in \{r,g,b\}}$ represents the color channel, ${[\lambda _1, \lambda _2]}$ indicates the active bandwidth of the system, ${S_c(\lambda )}$ contains the integrated effect of imaging lens’ transmittance, CFA’s transmittance and the quantum efficiency of CMOS. For simplicity, the spectra of the light source ${E(\lambda )}$ and the reflectance of the object ${R(x,y,\lambda )}$ can be combined into the spectral radiance ${I(x,y,\lambda )}$ of the scene. The subsequent spectral reconstruction yields the spectral radiance of the scene, and the relative reflectance of the object can be obtained by spectral calibration. The principle and procedure of spectral calibration are detailed in the Section 1 of Supplement 1. Substitute ${I(x,y,\lambda )}$ into Eq. (3) and rewrite it in discrete matrix form as
$$\begin{bmatrix} J_{1c}\\ J_{2c}\\ \vdots \\ J_{Nc} \end{bmatrix} = \begin{bmatrix} T_{11}S_{c1} & T_{12}S_{c2} & \cdots & T_{1o}S_{co} & \\ T_{21}S_{c1} & T_{22}S_{c2} & \cdots & T_{2o}S_{co} & \\ \vdots & \vdots & \vdots & \vdots & \\ T_{N1}S_{c1} & T_{N2}S_{c2} & \cdots & T_{No}S_{co} & \end{bmatrix} \cdot \begin{bmatrix} I_{1}\\ I_{2}\\ \vdots \\ I_{o} \end{bmatrix} .$$

Furthermore, there are

$$\mathbf{ J = HI } ,$$
where ${\mathbf {H}}$ is the encoding matrix formed jointly by the encoding filter and the RGB camera.

 figure: Fig. 1.

Fig. 1. Schematic diagram of imaging scene. The scene is illuminated by a light source, and the reflected light is then encoded in the wavelength dimension by a combined filter and an RGB camera.

Download Full Size | PDF

Since the spectral reconstruction network is a data-driven algorithm, we propose a simulation image degradation pipeline as shown in Fig. 2.

 figure: Fig. 2.

Fig. 2. Schematic diagram of wavelength-encoded image degradation pipeline. To ensure that it closely resembles the actual image production process, we first integrate the wavelength-encoded HSI along the three channels of R, G, and B, then we degenerate the three-channel image into a RAW image via three 0/1 masks and inject noise into the RAW image. Demosaicing the noisy RAW image yields the degraded RGB image.

Download Full Size | PDF

The input HSI is first modulated by a combined filter and then integrated by the imaging system to form a wavelength-encoded RGB image. Next, we imitate the effect of the camera’s Bayer array and use spatial mosaicing to convert the three-channel image into a single-channel RAW image. We perform noise injection on the RAW image and subsequently use the bilinear interpolation to convert the RAW image to a degraded full-resolution RGB image. The imaging system is designed for multiple shots, so the transmittance function of the filter needs to be changed to generate a new corresponding RGB image. Multiple RGB images are stitched together along the channel dimension to obtain a multi-channel image. The dataset is made up of many data pairs, each consisting of a multi-channel image and an HSI, and these pairs are used to optimize the end-to-end framework.

3.2 End-to-end optimization framework

According to the previous imaging model, the overall wavelength encoding process is implemented jointly by a combined filter and a CFA. In consumer RGB cameras, the CFA is integrated in front of the CMOS, and it is difficult to impose optimization on the transmittance function of the CFA. Therefore, we focus on designing the transmittance function of the broadband filter. To achieve optimal wavelength encoding, we propose an end-to-end optimization framework, as shown in Fig. 3, to perform the joint optimization of filter parameters and the weights of the reconstruction network.

 figure: Fig. 3.

Fig. 3. Schematic diagram of our end-to-end optimization framework. In forward propagation, multi-channel wavelength-encoded images are generated and used to reconstruct HSIs. In the back propagation process, the optimization of the spectral reconstruction network weights and the automatic design of the filter transmittance function are achieved by gradient back propagation.

Download Full Size | PDF

In forward propagation, the optimized parameters are first turned into the transmittance functions of the filters through a non-negative mapping function, and then the filters are selected and combined to form the combined filters. With HSI, the transmittance functions of the combined filters as inputs, a degraded multi-channel image is generated by the image degradation pipeline. This degraded image is then used as the input of the neural network, which predicts and reconstructs the HSI. Since the aforementioned forward propagation process is differentiable, the weights of the neural network as well as the optimized parameters of the filters, are optimized simultaneously during the neural network’s gradient back propagation.

3.2.1 Deeply learned filter

The transmittance function of the designed broadband filter needs to meet certain physical restrictions to ensure that it is meaningful. More specifically, the transmittance function of the filter should satisfy ${t_q (\lambda ) \in (0,1)}$. However, back propagation lacks physical constraints, the parameters of the neural network do not meet this requirement. We propose to use a nonnegative mapping function to convert the network optimization parameters into a filter’s transmittance function with physical meaning. The specific transformation relationship is as follows

$$t_q (\lambda) = \frac{1}{1+e^{-{\omega}_q (\lambda)}},$$
where ${\omega _q(\lambda ) \in (-\infty, +\infty )}$ is the optimization parameters shown in Fig. 3, which is automatically obtained directly from the joint optimization network, ${t_q (\lambda )}$ is the filter’s transmittance function. The above formula can guarantee that the parameters from negative infinity to positive infinity are mapped to ${(0,1)}$, and the mapping relationship between the two variables is shown in Fig. 4. Furthermore, the transmittance curve of the filter is usually expected to be relatively smooth during the design and manufacture process, so the corresponding constraints need to be imposed on the transmittance function of the filter, which is detailed in Sec. 3.2.3.

 figure: Fig. 4.

Fig. 4. The mapping relationship between the filter’s network optimization parameters ${\omega _q(\lambda )}$ and transmittance function ${t_q (\lambda )}$.

Download Full Size | PDF

3.2.2 Network architecture

In order to reconstruct HSIs from degraded multi-channel images, we propose a spectral reconstruction algorithm, specifically a convolutional neural network based on a spatial-channel attention mechanism, which can be called SCA net, and the network structure is illustrated in Fig. 5(a). The input multi-channel image is divided into two branches and processed through two convolution layers with a ${1\times 1}$ kernel size. This produces two features with channel dimensions of 64, one of which is used as a spectral upsampling feature and the other passes through eight SCA blocks in turn. The output of the last SCA block is added to the spectral upsampling feature, and then passed through a convolutional layer with a kernel size of ${1\times 1}$ to obtain a hyperspectral image with 25 channels.

 figure: Fig. 5.

Fig. 5. Our proposed network architecture. (a) is the main structure of the network. (b)-(d) illustrate the detailed components of the SCA Block, Spatial Attention Block and Channel Attention Block.

Download Full Size | PDF

A well-accepted opinion is that the spectral reflectance information of an object can be regarded as a combination of a series of spectral basis [36]. Since the initially acquired RGB image is obtained by encoding in the wavelength dimension with less loss in spatial resolution, we argue that the reconstruction process should mainly concentrate on the channel dimension. A convolutional layer with a kernel size of ${1\times 1}$ can flexibly increase or reduce the dimension of the input feature. The output of the convolutional layer is a linear combination of input features in the channel dimension. In addition, convolutional layers with a kernel size of ${1\times 1}$ have the advantage of low computational resource consumption. Based on these considerations, we use a large number of convolutional layers with ${1\times 1}$ kernel size in our spectral reconstruction network. Furthermore, we employ the channel attention mechanism in the SCA block to assign adaptive weights to different feature channels, increasing the sensitivity to specific feature layers, and we use the spatial attention mechanism and the convolutional layer with a kernel size of ${3\times 3}$ to enhance the network’s resistance to noise and other interference.

The schematic of our proposed SCA block is shown in Figs. 5(b)-(d). The SCA block mainly comprises a channel attention block, a spatial attention block, and two convolutional layers with kernel size ${3\times 3}$, where an activation layer follows each convolutional layer. These three blocks are combined in parallel and summed with the input features to form a residual connection.

3.2.3 Loss function

Optimization of the filter parameters and network weights is achieved by minimizing the loss function between the reconstruction result and Ground Truth. In this work, we propose to use a mixture of mean absolute error (MAE), structural similarity index (SSIM), and gradient difference as part of the loss function. Furthermore, we impose corresponding constraints as the rest of the loss function to achieve filter’s smooth design. Assume that the input multi-channel image is denoted as ${J_{in}^{k}}$ and the corresponding Ground Truth is denoted as ${I_{GT}^{k}}$.

We use MAE loss as the fidelity loss to guide the model to learn sharp image details and ensure the consistency between the reconstruction result and the Ground Truth. The calculation formula can be described as

$$\mathcal{L}_{MAE} = \frac{1}{K} \sum_{k=1}^{K} {\lvert \mathcal{N}(J_{in}^{k}) - I_{GT}^{k} \rvert} ,$$
where ${K}$ is the number of image patches, ${\mathcal {N}(\cdot )}$ denotes the mapping of the network, and ${{\vert \cdot \vert }}$ denotes the absolute value operation.

To ensure the consistency of visual perception and obtain better visual effects, we use SSIM as an item in the hybrid loss function, and the specific calculation method can be expressed as

$$\mathcal{L}_{SSIM}=\frac{1}{K} \sum_{k=1}^K\left\{1-\operatorname{SSIM}\left[\mathcal{N}\left(J_{i n}^k\right), I_{GT}^k\right]\right\},$$
where ${\operatorname {SSIM}[\cdot, \cdot ]}$ means to calculate the structural similarity index of two input images. Since the SSIM between the prediction result and the Ground Truth will eventually tend to 1, we use 1 minus SSIM as the final SSIM loss function.

To preserve the image’s edge and avoid unwanted pseudo-textures, we calculate the gradient for the predicted image and the Ground Truth independently, then subtract the gradient results to get the gradient loss function. The details of the loss function and the gradient are as follows

$$\mathcal{L}_{Grad}=\frac{1}{K} \sum_{k=1}^K \lvert \operatorname{Grad} [\mathcal{N}\left(J_{i n}^k\right)]-\operatorname{Grad}\left(I_{GT}^k\right) \rvert ,$$
$$\begin{aligned} & \operatorname{Grad}(I)=\sqrt{I_{\mathbf{x}}(x, y)^2+I_{\mathbf{y}}(x, y)^2}, \\ & I_{\mathbf{x}}(x, y)=I(x+1, y)-I(x-1, y), \\ & I_{\mathbf{y}}(x, y)=I(x, y+1)-I(x, y-1). \end{aligned}$$

As for the smooth design of the filters’ transmittance function, we calculate the first-order forward difference, take the absolute value and accumulate it, and the detailed calculation method can be expressed as

$$\mathcal{L}_{filter} = \sum_{q=1}^{m}\sum_{p=2}^{o} \lvert t_q(p) - t_q(p-1) \rvert ,$$
where ${m}$ is the number of filters, ${o}$ is the number of spectral channels, and ${p}$ is the current spectral channel index. Finally, the hybrid loss function composed of the above four loss functions can be described as
$$\mathcal{L}_{total} = \alpha\mathcal{L}_{MAE} + \beta\mathcal{L}_{SSIM} + \gamma\mathcal{L}_{Grad} + \eta\mathcal{L}_{filter} ,$$
where ${\alpha, \beta, \gamma, \eta }$ are weight coefficients.

4. Experiments and results

In this section, we will first introduce the training configuration of our end-to-end framework in Sec. 4.1. In Sec. 4.2, we primarily discussed the performance of our spectral reconstruction network in the task of reconstructing HSI from 3-channel RGB images. We demonstrate the superiority of our network compared to other commonly used spectral reconstruction networks. After finalizing the structure of the spectral reconstruction network, we incorporated the design of filters into the joint optimization framework in Sec. 4.3. We investigated the improvements associated with the number of filters. Based on the experimental findings, we set the number of filters to one and completed the design of the deeply learned filter. In Sec 4.4, we conducted practical experiments, providing a detailed showcase of the spatial details and spectral accuracy of the reconstruction results.

4.1 Implementation details

In this subsection, we present how to obtain the spectral response function of the camera, as well as the scheme of dataset allocation, hyperparameters for network training, and metrics for quantitative evaluation.

In this work, we used a Sony A7M3 camera and a Tamron 28-75mm F/2.8 Di III VXD G2 lens to form an RGB imaging system. We used a double monochromator to calibrate the RGB imaging system’s spectral response function, and the detailed experimental procedures are described in the Section 2 of Supplement 1.

To drive the training of the end-to-end framework, we select two publicly available hyperspectral datasets, ICVL [37] and KAIST [38]. Both datasets cover the visible light spectrum and have a spectral interval of 10nm. The shape of the hyperspectral image cube in the ICVL dataset is ${\sim 1392\times 1300\times 31}$, where ${31}$ refers to the number of spectral channels covering 400nm to 700nm. The scenes in the ICVL dataset are mostly recorded outdoors. As for the KAIST dataset, the shape of the hyperspectral image cube is ${2704\times 3376\times 31}$, where ${31}$ refers to the number of spectral channels covering the spectral range from 400nm to 700nm. All scenes are captured indoors with narrow-band filters and a grayscale camera. We select data slices from 420nm to 660nm (25 channels) from the two datasets as our hyperspectral images. KAIST dataset contains 30 scenes and ICVL dataset contains 201 scenes, 8 and 40 scenes are randomly selected from the KAIST and ICVL dataset for testing, and the remaining scenes are used for training. We sampled 9184 image patches with a shape of ${256\times 256\times 25}$ from the hyperspectral dataset for network training.

We use the PyTorch framework to implement our network design. The ADMM optimizer is employed to optimize the parameters of the filters and the network weights. During the training process, we set the mini-batch size to 16, and the initial learning rates of the filters and network are ${6\times 10^{-4}}$ and ${4\times 10^{-4}}$, respectively, and the learning rates are halved every 15 epochs. Gaussian noise with a mean value of 0 and a standard deviation of 0.008 is added to the input multi-channel images. It will take about 7.5 hours to complete the training of 80 epochs on the workstation equipped with AMD EPYC 7543 and NVIDIA RTX A6000.

To quantitatively evaluate the quality of reconstruction results, we select PSNR, SSIM, SAM, and RMSE, which are widely used as evaluation indicators. We compute metrics in each spectral channel and then average them across all channels as the final evaluation metrics. Detailed calculation formulas are detailed in the Section 3 of Supplement 1.

First of all, to examine the performance of our network, we degrade the HSIs using the spectral response function of the RGB camera and feed the RGB images into the network, and the reconstruction results are shown in Fig. 6. As shown in Fig. 6, the HSI reconstructed by our method not only retains rich spatial details but also obtains spectral information that is consistent with Ground Truth.

 figure: Fig. 6.

Fig. 6. The reconstruction results of our method. We demonstrate the spatial details and spectral curves of our reconstructed results compared to the Ground Truth. For better visual perception, we visualize the Ground Truth as an RGB image.

Download Full Size | PDF

4.2 Validation on spectral reconstruction network

To validate the effectiveness of our improvements to the network structure, we conducted ablation experiments. For detailed results, please refer to the Section 4 of Supplement 1. To validate the superiority of our proposed spectral reconstruction network, we compare it with other commonly used methods, including EDSR [39] for single image super-resolution, Unet [40] and Hinet [41] which have been widely used for image restoration tasks, HSCNN+ [42] for RGB-to-Spectra tasks, and Spatial Channel Attention proposed by Liu et al. [43] In order to balance the computational resource consumption, we set the number of dense blocks to 24 in the implementation of HSCNN-D, while we set the number of residual blocks (256 filters in each layer) to 8 in HSCNN-R. These two methods are denoted as HSCNN-D24 and HSCNN-R8 in Table 1, respectively.

Tables Icon

Table 1. Performance of different reconstruction methods on the testing set.

We assume that the number of filters is 0, and only RGB camera imaging is used to obtain a single RGB image for reconstruction. The reconstructed network outputs HSI with 25 channels. The reconstruction results are shown in Table 1.

As can be seen from Table 1, our method is optimal in almost all evaluation metrics. Furthermore, to demonstrate that our method is competitive, we quantitatively evaluate the complexity of the network using floating point operations (FLOPs) and the number of parameters. We feed an image block with a shape of ${256\times 256\times 3}$ into the networks, and the results are shown in Table 2. Furthermore, we utilize various networks to reconstruct 100 RGB images with a spatial resolution of ${1024\times 1024}$. The time taken to reconstruct each image is recorded, and the average time is also presented in Table 2.

Tables Icon

Table 2. Computing resource overhead for different networks.

Combining the data in Table 1 and Table 2, we can conclude that our method has the advantage of less computational resource consumption while performing well. In addition to the previous quantitative evaluation, we also visualize the spectral accuracy and spatial details of the reconstruction results of different methods. We pick two scenes and select four regions of interest from them. Then we calculate the average values of the regions of interest in each spectral channel and plot the spectral curves of Ground Truth and reconstruction results of different methods, respectively, and the results are shown in Fig. 7. Table 3 illustrates the root mean square errors between the various curves in Fig. 7 and Ground Truth.

 figure: Fig. 7.

Fig. 7. Spectral accuracy comparison of different reconstruction methods. We selected four regions of interest from the two scenes for analysis and selected five better-performing methods from Table 1 for comparison. The root mean square errors (RMSEs) between the spectral curves of different methods and Ground Truth is shown in Table 3.

Download Full Size | PDF

Tables Icon

Table 3. The root mean square errors (RMSEs) between the spectral curves of different methods and Ground truth in Fig. 7. ($\times 10^{-3}$)

It can be seen from Fig. 7 that the spectral curves of our method are the closest to Ground Truth and have lower root mean square errors compared with other methods.

To compare the details of different methods’ reconstruction results more intuitively, we calculate the difference between the reconstruction results and Ground Truth at 500 nm and take the absolute value to obtain the errormaps shown in Fig. 8, where the mean absolute errors (MAE) of different methods are labeled.

 figure: Fig. 8.

Fig. 8. Errormaps of different reconstruction methods. At 500 nm, we calculate the absolute value of the difference between the reconstruction result and the Ground Truth, as well as the mean absolute error (MAE) of the whole errormap. The smaller the value in the errormap, the closer to the Ground Truth.

Download Full Size | PDF

As shown in Fig. 8, our method’s errormap is closest to 0, indicating that the difference between our reconstruction result and the Ground Truth is the smallest. The mean absolute errors also demonstrate that our method performs best among these methods.

4.3 Inverse designed filters

According to general rules, more filters for wavelength encoding means more input images for the reconstruction algorithm, which should make it easier to obtain high-quality hyperspectral images. Detailed analysis and discussion can be found in Section 5 of the Supplement 1. However, as the number of filters increases, so does the cost and the number of shots. It is critical to investigate the improvement brought by each deeply learned filter and explore the solutions applicable to the practical scenes. We conducted simulation experiments based on the combination scheme listed in Eq. (2) and obtained the results shown in Table 4.

Tables Icon

Table 4. The number of shots and the metrics of reconstruction results while using different numbers of deeply learned filters. ${m}$ indicates the number of deeply learned filter. ${N}$ indicates the number of shots.

It can be seen from Table 4 that without adding an additional filter, the reconstruction result of the single-shot scheme is inferior. As the number of deeply learned filters increases, the reconstruction result metrics improve. However, this also leads to a dramatic increase in the number of shots. To investigate the improvement from each shot more intuitively, we plot the metrics’ variation with the number of shots, as shown in Fig. 9.

 figure: Fig. 9.

Fig. 9. The accuracy of reconstruction results varies with the number of filters and the number of shots. ${m}$ represents the number of filters.

Download Full Size | PDF

As can be observed in Fig. 9, the slope of the curve is the largest when only one filter is used, indicating that the first increased shot is the most effective for improving the metrics. Although increasing the number of shots can help improve metrics, the efficiency of improvement decreases as the number of shots increases. Considering factors such as cost, shooting time, and reconstruction effect, we intend to use a piece of deeply learned filter in the actual experiment. The designed filter’s transmittance function is shown as the Ideal curve in Fig. 10(a).

 figure: Fig. 10.

Fig. 10. (a) Ideal curve indicates the optimization result of the end-to-end framework. Designed curve indicates the transmittance curve calculated by the film design software (incident angle is 0). Manufactured result indicates the calibration value of the actual filter. (b) is the relative spectral response function of the RGB camera. (c) is the wavelength encoding scheme composed of the deeply learned filter and RGB imaging system.

Download Full Size | PDF

4.4 Results on practical experiments

We set the number of filters to 1 and continue to use the combination of a Sony A7M3 camera and a Tamron 28-75mm F/2.8 Di III VXD G2 lens as an RGB imaging system and carry out practical experiments.

Firstly, we design a thin-film filter based on the transmittance function automatically optimized by the end-to-end framework, and the thin-film filter is periodically stacked by ${Ti_3O_5}$ and ${SiO_2}$. We use the electron beam evaporation coating process to fabricate the filter. For the detailed design and processing parameters of the filter, please refer to the Section 6 in Supplement 1. We calibrate the transmittance function of the manufactured filter. The deeply learned filter’s ideal value, the design value, and the actual manufactured result are shown in Fig. 10(a). The calibration results are re-substituted into the spectral reconstruction network, and the negative effect brought by processing error can be eliminated by fine-tuning the weights of the spectral reconstruction network.

During the imaging process, two shots are required. The first shot makes use of only the RGB camera, with the system’s wavelength encoding scheme being the spectral response function of the camera itself, as seen in Fig. 10(b). For the second shot, the broadband filter needs to be placed in front of the RGB camera. Here, the wavelength encoding scheme of the system is jointly realized by the deeply learned encoding filter and the spectral response function of the camera. The corresponding wavelength encoding curve is shown in Fig. 10(c).

We construct the experimental system shown in Fig. 11. A standard white block is placed in the scene for converting the radiance information into reflectance information. We use the spectral information captured by a GaiaField-V10 HR spectral imager produced by Dualix as the Ground Truth.

 figure: Fig. 11.

Fig. 11. (a) Schematic of our experimental setup. (b) The deeply learned filter that has been manufactured. Its dimensions are ${72mm \times 72mm \times 1mm}$.

Download Full Size | PDF

The demosaiced RAW images of the two shots and the reconstruction results are shown in Fig. 12. The first shot is captured by the RGB camera, while the second shot is derived from the combined encoding of the deeply learned filter and the RGB camera. A multi-channel image is obtained by splicing the two captured images along the channel dimension and is used as the input of the spectral reconstruction network. The reconstructed spectral data cube is shown in Fig. 12. It can be observed that the data cube contains rich spatial detail information with almost no loss of spatial resolution.

 figure: Fig. 12.

Fig. 12. (a) is a RAW image captured with an RGB camera alone. (b) is a RAW image captured using a deeply learned filter and an RGB camera. The right side shows the spatial details of the reconstructed HSI in specific spectral channels.

Download Full Size | PDF

Similarly, we select regions of interest from the captured scene to evaluate the spectral accuracy. The spectral curves of these regions of interest are shown in Fig. 13, where the Ground Truth is derived from the hyperspectral imager in the same region. It can be seen that our reconstruction results not only have high spatial resolution, but also have spectral curves that are highly consistent with those of the hyperspectral imager.

 figure: Fig. 13.

Fig. 13. (a) is the visualization result of the reconstructed hyperspectral image. The spectral curves of the 4 ROIs are shown on the lower side of the image, the red curves are the predictions while the black curves are the results obtained by the commercial spectral imager. (b)-(h) are the reconstruction results in various spectral channels.

Download Full Size | PDF

More generally, we perform spectral evaluations for 24 squares of painted samples on Macbeth ColorChecker, and the corresponding results are shown in Fig. 14, the blue curves represent the reconstruction results using only the 3-channel RGB image (without using a filter), while the red curves represent the reconstruction results when using the deeply learned filter. The average absolute errors between the blue curves, red curves, and the Ground Truth are shown in Table 5. As seen in Fig. 14 and Table 5, our reconstruction results exhibit a high consistency with the ground truth. Moreover, the reconstruction performance with the addition of the deeply learned filter surpasses the case without the filter on the majority of blocks. However, due to calibration and processing errors, as well as factors such as wavelength sampling intervals, the spectral accuracy of the reconstruction results can to be improved. For more experimental results, please refer to the Section 7 of Supplement 1.

 figure: Fig. 14.

Fig. 14. Spectral estimation curves for the ColorChecker. The position of each subplot is relative to the painted blocks when the ColorChecker is placed positively. The red curves are the result with a customized filter, the blue curves are the results using only an RGB camera, and the black are the results measured by the commercial spectral imager.

Download Full Size | PDF

Tables Icon

Table 5. The mean absolute error (${\times 10^{-2}}$) corresponding to the spectral curves in Fig. 14. "Position" represents the location of the color block, where the first number indicates the row and the second number indicates the column. For example, "1,1" corresponds to the block in the first row and first column of the ColorChecker.

Besides, in order to demonstrate the advantages of our method over hyperspectral imagers in terms of imaging speed and image resolution more intuitively, we capture the ISO 12233 Resolution Test Chart using both our system and a commercial hyperspectral imager. The hyperspectral imager consumes about 80 seconds to obtain a spectral data cube with a spatial resolution of ${960\times 936}$. Our method only takes about 2 seconds to complete the switching of filter and two shots. After reconstruction, a 3D data cube with a spatial resolution of ${6000\times 4000}$ is obtained. The two imaging results at 500nm and the corresponding MTF curves are shown in Fig. 15.

 figure: Fig. 15.

Fig. 15. Quantitative comparison of the spatial resolution of our method and the commercial hyperspectral imager.(a) is the image of our reconstruction result at 500nm. (b) is the image of commercial hyperspectral imager at 500nm. (c) are the MTF curves corresponding to the two methods.

Download Full Size | PDF

It can be seen from the above experiments that our method can obtain very attractive spatial detail information while acquiring spectra with high fidelity. Moreover, our imaging system also has the advantages of low cost, lightweight, fast focusing speed, and excellent portability.

5. Discussion and conclusion

The growing demand for hyperspectral imaging techniques and the development of deep learning techniques in recent years have given rise to computational spectral imaging techniques. Such techniques usually subvert traditional spectral imaging techniques in the time or spatial dimension. Although the spectral data obtained by computational spectral imaging systems cannot provide ultrafine resolution and wide spectral ranges as traditional spectral imaging techniques, these systems still have acceptable performance in application scenarios with less stringent spectral accuracy requirements, such as identifying characteristic spectral peaks. With this background, we choose the wavelength encoding system to conduct our research.

In this paper, we propose combining broadband filters and RGB camera’s CFA to achieve wavelength encoding. By combining filters and taking multiple shots, we can improve the accuracy of the spectral reconstruction results. Additionally, we propose using an end-to-end optimization framework to jointly optimize the filters’ transmittance function and the spectral reconstruction network, thus realizing the optimal design of the wavelength encoding scheme. In our spectral reconstruction network, we propose a spatial-channel attention (SCA) mechanism, which can effectively improve the network’s performance. Our reconstruction network is highly competitive and has low computational complexity while ensuring high accuracy. We assign the number of filters to one, conduct filter design and fabrication, and carry out real-world experiments. The reconstruction results have high spatial resolution and attractive spectral accuracy. Compared with other similar methods, our system has the following advantages. (1) Competitive spectral reconstruction network. (2) New transmittance functions can be designed based on the application scenario rather than combining existing filters. (3) Flexible filter configuration option. We look forward to improving and applying our system to consumer devices such as smartphones, etc.

However, our system still has some defects that need to be improved. For example, the design and manufacture of the deeply learned filter is challenging, and it is difficult to ensure that the design results are consistent with the output of the end-to-end framework. To address this issue, we can attempt to represent the transformation of structural parameters to the transmittance function of thin-film filters as a differentiable form in our subsequent research. Integrating the differentiable chain in the end-to-end optimization framework and the structural parameters of the filter can be directly optimized, thus avoiding the errors introduced during the structural design process of the thin-film filter.

Furthermore, our hardware system requires two shots during the imaging procedure, which unfortunately cannot provide accurate snapshot imaging despite reducing the imaging time. In this regard, we can improve the hardware of the imaging system, such as using a dual camera system with a common optical axis or a filter wheel to further reduce the imaging time. Alternatively, enhancing the spectral reconstruction algorithm can increase the capabilities of the hardware system.

Funding

Ningbo Science and Technology Plan Project (No. 2023Z053); Civil Aerospace Pre-Research Project (No. D040104); National Natural Science Foundation of China (No. 62275229); Key Research Project of Zhejiang Lab (No. K2022MH5TW01); Major Scientific Project of Zhejiang Laboratory (K2022MH5TW01).

Acknowledgment

We thank Meijuan Bian from the facility platform of optical engineering of Zhejiang University for instrument support. We appreciate the efforts of Menghao Li and Jiajian He in the camera calibration process.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. F. Wang, Q. Yi, J. Hu, et al., “Combining spectral and textural information in uav hyperspectral images to estimate rice grain yield,” Int. J. Appl. Earth Obs. Geoinformation 102, 102397 (2021). [CrossRef]  

2. J. He and I. Barton, “Hyperspectral remote sensing for detecting geotechnical problems at ray mine,” Eng. Geol. 292, 106261 (2021). [CrossRef]  

3. P. Daukantas, “Hyperspectral imaging meets biomedicine,” Opt. Photonics News 31(4), 32–39 (2020). [CrossRef]  

4. N. M. Nasrabadi, “Hyperspectral target detection : An overview of current and future challenges,” IEEE Signal Process. Mag. 31(1), 34–44 (2014). [CrossRef]  

5. L. Huang, R. Luo, X. Liu, et al., “Spectral imaging with deep learning,” Light: Sci. Appl. 11(1), 61 (2022). [CrossRef]  

6. B. Arad, R. Timofte, O. Ben-Shahar, et al., “Ntire 2020 challenge on spectral reconstruction from an rgb image,”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2020), pp. 446–447.

7. B. Arad, R. Timofte, R. Yahel, et al., “Ntire 2022 spectral recovery challenge and data set,”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 863–881.

8. S. Chen, H. Feng, D. Pan, et al., “Optical aberrations correction in postprocessing using imaging simulation,” ACM Trans. Graph. 40(5), 1–15 (2021). [CrossRef]  

9. T. H. Kim, H. J. Kong, T. H. Kim, et al., “Design and fabrication of a 900–1700nm hyper-spectral imaging spectrometer,” Opt. Commun. 283(3), 355–361 (2010). [CrossRef]  

10. P. Mouroulis, R. O. Green, and T. G. Chrien, “Design of pushbroom imaging spectrometers for optimum recovery of spectroscopic and spatial information,” Appl. Opt. 39(13), 2210–2220 (2000). [CrossRef]  

11. J. Brauers, N. Schulte, and T. Aach, “Multispectral filter-wheel cameras: Geometric distortion model and compensation algorithms,” IEEE Trans. on Image Process. 17(12), 2368–2380 (2008). [CrossRef]  

12. Z. Xu, H. Zhao, G. Jia, et al., “Optical schemes of super-angular aotf-based imagers and system response analysis,” Opt. Commun. 498, 127204 (2021). [CrossRef]  

13. N. A. Hagen and M. W. Kudenov, “Review of snapshot spectral imaging technologies,” Opt. Eng. 52(9), 090901 (2013). [CrossRef]  

14. M. E. Gehm, R. John, D. J. Brady, et al., “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]  

15. Y. Wu, I. O. Mirza, G. R. Arce, et al., “Development of a digital-micromirror-device-based multishot snapshot spectral imaging system,” Opt. Lett. 36(14), 2692–2694 (2011). [CrossRef]  

16. A. Parada-Mayorga and G. R. Arce, “Colored coded aperture design in compressive spectral imaging via minimum coherence,” IEEE Trans. Comput. Imaging 3(2), 202–216 (2017). [CrossRef]  

17. L. Wang, Z. Xiong, D. Gao, et al., “Dual-camera design for coded aperture snapshot spectral imaging,” Appl. Opt. 54(4), 848–858 (2015). [CrossRef]  

18. C. Tao, H. Zhu, P. Sun, et al., “Hyperspectral image recovery based on fusion of coded aperture snapshot spectral imaging and rgb images by guided filtering,” Opt. Commun. 458, 124804 (2020). [CrossRef]  

19. Z. Meng, J. Ma, and X. Yuan, “End-to-end low cost compressive spectral imaging with spatial-spectral self-attention,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, eds. (Springer International Publishing, Cham, 2020), pp. 187–204.

20. Y. Cai, J. Lin, X. Hu, et al., “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), pp. 17502–17511.

21. L. Wang, Z. Wu, Y. Zhong, et al., “Snapshot spectral compressive imaging reconstruction using convolution and contextual transformer,” Photonics Res. 10(8), 1848–1858 (2022). [CrossRef]  

22. Z. Yang, T. Albrow-Owen, W. Cai, et al., “Miniaturization of optical spectrometers,” Science 371(6528), eabe0722 (2021). [CrossRef]  

23. D. S. Jeon, S.-H. Baek, S. Yi, et al., “Compact snapshot hyperspectral imaging with diffracted rotation,” ACM Trans. Graph. 38(4), 1–13 (2019). [CrossRef]  

24. H. Xu, H. Hu, S. Chen, et al., “Hyperspectral image reconstruction based on the fusion of diffracted rotation blurred and clear images,” Opt. Lasers in Eng. 160, 107274 (2023). [CrossRef]  

25. H. Hu, H. Zhou, Z. Xu, et al., “Practical snapshot hyperspectral imaging with doe,” Opt. Lasers in Eng. 156, 107098 (2022). [CrossRef]  

26. N. Xu, H. Xu, S. Chen, et al., “Snapshot hyperspectral imaging based on equalization designed doe,” Opt. Express 31(12), 20489–20504 (2023). [CrossRef]  

27. M. E. Toivonen, C. Rajani, and A. Klami, “Snapshot hyperspectral imaging using wide dilation networks,” Mach. Vis. Appl. 32(1), 9 (2021). [CrossRef]  

28. S.-H. Baek, H. Ikoma, D. S. Jeon, et al., “Single-shot hyperspectral-depth imaging with learned diffractive optics,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 2651–2660.

29. W. Zhang, H. Song, X. He, et al., “Deeply learned broadband encoding stochastic hyperspectral imaging,” Light: Sci. Appl. 10(1), 108 (2021). [CrossRef]  

30. Y. Zhu, X. Lei, K. X. Wang, et al., “Compact cmos spectral sensor for the visible spectrum,” Photonics Res. 7(9), 961–966 (2019). [CrossRef]  

31. J. Xiong, X. Cai, K. Cui, et al., “Dynamic brain spectrum acquired by a real-time ultraspectral imaging chip with reconfigurable metasurfaces,” Optica 9(5), 461–468 (2022). [CrossRef]  

32. S. Nie, L. Gu, Y. Zheng, et al., “Deeply learned filter response functions for hyperspectral reconstruction,”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

33. S. W. Oh, M. S. Brown, M. Pollefeys, et al., “Do it yourself hyperspectral imaging with everyday digital cameras,”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016).

34. H. Fu, L. Bian, X. Cao, et al., “Hyperspectral imaging from a raw mosaic image with end-to-end learning,” Opt. Express 28(1), 314–324 (2020). [CrossRef]  

35. Y. Fu, T. Zhang, Y. Zheng, et al., “Joint camera spectral response selection and hyperspectral image recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 256–272 (2022). [CrossRef]  

36. W. He, N. Yokoya, and X. Yuan, “Fast hyperspectral image recovery of dual-camera compressive hyperspectral imaging via non-iterative subspace-based fusion,” IEEE Trans. on Image Process. 30, 7170–7183 (2021). [CrossRef]  

37. B. Arad and O. Ben-Shahar, “Sparse recovery of hyperspectral signal from natural rgb images,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, eds. (Springer International Publishing, Cham, 2016), pp. 19–34.

38. I. Choi, D. S. Jeon, G. Nam, et al., “High-quality hyperspectral reconstruction using a spectral prior,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]  

39. B. Lim, S. Son, H. Kim, et al., “Enhanced deep residual networks for single image super-resolution,”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2017).

40. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds. (Springer International Publishing, Cham, 2015), pp. 234–241.

41. L. Chen, X. Lu, J. Zhang, et al., “Hinet: Half instance normalization network for image restoration,”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2021), pp. 182–192.

42. Z. Shi, C. Chen, Z. Xiong, et al., “Hscnn+: Advanced cnn-based hyperspectral recovery from rgb images,”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2018).

43. T. Liu, R. Luo, L. Xu, et al., “Spatial channel attention for deep convolutional neural networks,” Mathematics 10(10), 1750 (2022). [CrossRef]  

Supplementary Material (1)

NameDescription
Supplement 1       The revised supplemental document.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (15)

Fig. 1.
Fig. 1. Schematic diagram of imaging scene. The scene is illuminated by a light source, and the reflected light is then encoded in the wavelength dimension by a combined filter and an RGB camera.
Fig. 2.
Fig. 2. Schematic diagram of wavelength-encoded image degradation pipeline. To ensure that it closely resembles the actual image production process, we first integrate the wavelength-encoded HSI along the three channels of R, G, and B, then we degenerate the three-channel image into a RAW image via three 0/1 masks and inject noise into the RAW image. Demosaicing the noisy RAW image yields the degraded RGB image.
Fig. 3.
Fig. 3. Schematic diagram of our end-to-end optimization framework. In forward propagation, multi-channel wavelength-encoded images are generated and used to reconstruct HSIs. In the back propagation process, the optimization of the spectral reconstruction network weights and the automatic design of the filter transmittance function are achieved by gradient back propagation.
Fig. 4.
Fig. 4. The mapping relationship between the filter’s network optimization parameters ${\omega _q(\lambda )}$ and transmittance function ${t_q (\lambda )}$.
Fig. 5.
Fig. 5. Our proposed network architecture. (a) is the main structure of the network. (b)-(d) illustrate the detailed components of the SCA Block, Spatial Attention Block and Channel Attention Block.
Fig. 6.
Fig. 6. The reconstruction results of our method. We demonstrate the spatial details and spectral curves of our reconstructed results compared to the Ground Truth. For better visual perception, we visualize the Ground Truth as an RGB image.
Fig. 7.
Fig. 7. Spectral accuracy comparison of different reconstruction methods. We selected four regions of interest from the two scenes for analysis and selected five better-performing methods from Table 1 for comparison. The root mean square errors (RMSEs) between the spectral curves of different methods and Ground Truth is shown in Table 3.
Fig. 8.
Fig. 8. Errormaps of different reconstruction methods. At 500 nm, we calculate the absolute value of the difference between the reconstruction result and the Ground Truth, as well as the mean absolute error (MAE) of the whole errormap. The smaller the value in the errormap, the closer to the Ground Truth.
Fig. 9.
Fig. 9. The accuracy of reconstruction results varies with the number of filters and the number of shots. ${m}$ represents the number of filters.
Fig. 10.
Fig. 10. (a) Ideal curve indicates the optimization result of the end-to-end framework. Designed curve indicates the transmittance curve calculated by the film design software (incident angle is 0). Manufactured result indicates the calibration value of the actual filter. (b) is the relative spectral response function of the RGB camera. (c) is the wavelength encoding scheme composed of the deeply learned filter and RGB imaging system.
Fig. 11.
Fig. 11. (a) Schematic of our experimental setup. (b) The deeply learned filter that has been manufactured. Its dimensions are ${72mm \times 72mm \times 1mm}$.
Fig. 12.
Fig. 12. (a) is a RAW image captured with an RGB camera alone. (b) is a RAW image captured using a deeply learned filter and an RGB camera. The right side shows the spatial details of the reconstructed HSI in specific spectral channels.
Fig. 13.
Fig. 13. (a) is the visualization result of the reconstructed hyperspectral image. The spectral curves of the 4 ROIs are shown on the lower side of the image, the red curves are the predictions while the black curves are the results obtained by the commercial spectral imager. (b)-(h) are the reconstruction results in various spectral channels.
Fig. 14.
Fig. 14. Spectral estimation curves for the ColorChecker. The position of each subplot is relative to the painted blocks when the ColorChecker is placed positively. The red curves are the result with a customized filter, the blue curves are the results using only an RGB camera, and the black are the results measured by the commercial spectral imager.
Fig. 15.
Fig. 15. Quantitative comparison of the spatial resolution of our method and the commercial hyperspectral imager.(a) is the image of our reconstruction result at 500nm. (b) is the image of commercial hyperspectral imager at 500nm. (c) are the MTF curves corresponding to the two methods.

Tables (5)

Tables Icon

Table 1. Performance of different reconstruction methods on the testing set.

Tables Icon

Table 2. Computing resource overhead for different networks.

Tables Icon

Table 3. The root mean square errors (RMSEs) between the spectral curves of different methods and Ground truth in Fig. 7. ( × 10 3 )

Tables Icon

Table 4. The number of shots and the metrics of reconstruction results while using different numbers of deeply learned filters. m indicates the number of deeply learned filter. N indicates the number of shots.

Tables Icon

Table 5. The mean absolute error ( × 10 2 ) corresponding to the spectral curves in Fig. 14. "Position" represents the location of the color block, where the first number indicates the row and the second number indicates the column. For example, "1,1" corresponds to the block in the first row and first column of the ColorChecker.

Equations (12)

Equations on this page are rendered with MathJax. Learn more.

T i ( λ ) = q = 0 n t q ( λ ) ,
N = n = 0 m C m n = 2 m ,
J c ( x , y ) = λ 1 λ 2 E ( λ ) R ( x , y , λ ) T i ( λ ) S c ( λ ) d λ ,
[ J 1 c J 2 c J N c ] = [ T 11 S c 1 T 12 S c 2 T 1 o S c o T 21 S c 1 T 22 S c 2 T 2 o S c o T N 1 S c 1 T N 2 S c 2 T N o S c o ] [ I 1 I 2 I o ] .
J = H I ,
t q ( λ ) = 1 1 + e ω q ( λ ) ,
L M A E = 1 K k = 1 K | N ( J i n k ) I G T k | ,
L S S I M = 1 K k = 1 K { 1 SSIM [ N ( J i n k ) , I G T k ] } ,
L G r a d = 1 K k = 1 K | Grad [ N ( J i n k ) ] Grad ( I G T k ) | ,
Grad ( I ) = I x ( x , y ) 2 + I y ( x , y ) 2 , I x ( x , y ) = I ( x + 1 , y ) I ( x 1 , y ) , I y ( x , y ) = I ( x , y + 1 ) I ( x , y 1 ) .
L f i l t e r = q = 1 m p = 2 o | t q ( p ) t q ( p 1 ) | ,
L t o t a l = α L M A E + β L S S I M + γ L G r a d + η L f i l t e r ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.