Wavelength encoding spectral imaging based on the combination of deeply learned filters and an RGB camera

Hao Xu; Hao Xu; Shiqi Chen; Haiquan Hu; Haiquan Hu; Peng Luo; Peng Luo; Zheyan Jin; Qi Li; Qi Li; Zhihai Xu; Zhihai Xu; Huajun Feng; Yueting Chen; Tingting Jiang; Tingting Jiang

doi:10.1364/OE.506997

1. Introduction

The hyperspectral image (HSI) can be described as a collection of 1D spectral curves with several image resolutions or 2D spatial images with several spectral channels. Due to various objects having distinct spectral properties, spectral imaging techniques have been widely used in agriculture [1], remote sensing [2], biomedicine [3], target tracking [4] and so on. Most currently widely used spectral imaging systems are based on the scanning principle. However, the scanning process is time-consuming and requires the assistance of sophisticated, complex, and bulky mechanical systems. Resulting in a very limited application in scenarios where portability and imaging speed are desired. With the development of computational spectral imaging, numerous encoding and reconstruction-based spectral imaging systems have emerged in recent years [5]. These spectral imaging systems have substantially improved snapshot imaging, reduced system complexity, and volume, etc.

Many scholars have focused their research on reconstructing 3D spectral data cubes from easily accessible RGB clear images. The topic of Spectral Reconstruction from RGB is a popular one in the New Trends in Image Restoration and Enhancement (NTIRE) workshop, which attracts a large number of participants every two years, and there has been a constantly growing flow of related algorithms [6,7]. However, most of these algorithms treat RGB-to-Spectrum reconstruction as a channel-dimensional super-resolution task, focusing mainly on the mapping capacity of convolutional neural networks (CNNs). Various RGB cameras have different spectral sensitivities, corresponding to different image signal processes (ISP) [8]. The generalization of such methods in complex scenes needs to be enhanced. In contrast, some researchers regard RGB images as the result of wavelength encoding and thus understand the spectral reconstruction as a decompression process. More specifically, the CFA of an RGB camera can be regarded as a wavelength encoding element, which modulates the spectrum of the incident light field.

In this work, we proceed from the principle of wavelength-encoding spectral imaging and propose to place broadband filters in front of the RGB camera. By switching filters and taking multiple shots, the diversity of encoding and initial data volume is increased. In this system, the CFA of the camera and the broadband filters work together as the wavelength encoding element. Since the CFA of the camera is immutable, we aim to design the transmittance function of the filters through an end-to-end optimization framework to achieve the optimal wavelength encoding scheme. In the optimization framework, we first propose a CNN-based algorithm for RGB-to-Spectrum reconstruction. Experimental results show that our proposed spatial-channel attention module can effectively improve network performance. We set the transmittance functions of the filters as trainable parameters and optimize the network weights and filters’ transmittance functions through data-driven learning. To facilitate the fabrication of the filters, we impose non-negative and smooth constraints on the filter optimization parameters during the optimization process. We develop a broadband filter and conduct imaging experiments.

Our contribution can be summarized as follows:

• We reconsider the RGB imaging from the perspective of wavelength encoding, and we argue that all wavelength-encoding spectral reconstruction tasks should be performed in the RAW domain of RGB images to avoid the influence of some nonlinear transformations in ISP.
• We propose to place a broadband filter in front of the RGB camera, and the wavelength encoding can be realized through the combination of the filter and the camera’s CFA. Our proposed system avoids the challenge of integrating specifically designed filters on a single chip and improves the acquisition efficiency of wavelength-encoded images by acquiring a three-channel image in a single shot.
• We propose to use an end-to-end optimization framework to simultaneously realize the design of broadband encoding filters and the optimization of network parameters. This approach enables us to achieve the optimal wavelength encoding scheme design in the current scenario. In the end-to-end optimization framework, we propose a neural network for spectral reconstruction based on a spatial-channel attention block. Our experimental results demonstrate that our network has excellent spectral reconstruction ability, which is quite competitive compared with other methods.
• We develop a broadband filter based on the designed parameters and construct an experimental system to perform spectral reconstruction of the actual scene. Our reconstruction results have higher spatial resolution and similar spectral curves compared with the commercial spectral imager’s results.

The rest of this paper is organized as follows. In Sec. 2, we briefly introduce and discuss the work related to spectral imaging. Sec. 3 illustrates our imaging model and the implementation details of the end-to-end optimization framework. Detailed numerical simulations and practical experiments are given in Sec. 4. Finally, we discuss and conclude in Sec. 5.

2. Related work

Traditional scanning-based spectral imaging systems usually scan along the spectral or spatial dimension and stitch together multiple measurements to generate a 3D spectral data cube. A typical push-broom imaging system scans along the spatial dimension [9,10] requires the collaboration of slits, dispersive elements (such as gratings or prisms), mechanically moving parts, etc. One-dimensional spatial and corresponding one-dimensional spectral information can be obtained in a single shot. An imaging system that scans along the spectral dimension [11,12] usually needs to put a narrow-band filter in the imaging optical path and obtain the two-dimensional spatial information of the current spectral channel through a single shot. There are also some dispersion-based snapshot hyperspectral imaging systems that enable snapshot acquisition of hyperspectral images, such as Integral Field Units (IFUs) [13]. However, IFUs face challenges of system complexity and a trade-off between spatial resolution and spectral accuracy.

Deep learning has been widely applied in various image restoration and enhancement tasks in recent years, and it has also given rise to several computational spectral imaging systems that do not require a scanning process, also known as encoding and reconstruction-based spectral imaging systems [5]. According to the encoding methods, the existing reconstruction-based spectral imaging systems can be divided into three categories: amplitude encoding-based, phase encoding-based, and wavelength encoding-based.

Amplitude encoding-based systems are based on the compressive sensing theory and employ binary-coded mask and dispersive elements to achieve snapshot spectral imaging. A typical amplitude encoding system is the coded aperture snapshot spectral imaging (CASSI) system proposed by Gehm ME et al. [14]. In CASSI, the incident 3D data cube is firstly modulated by a 0/1 mask, then the encoded data cube is spatially shifted by a dispersion element, and finally, the shifted data cube is integrated along the spectral dimension on the image sensor to obtain a 2D measurement. Since the CASSI system was proposed, numerous improvements have been made in the system structure and reconstruction algorithm. Wu et al. [15] proposed using a digital micromirror device (DMD) as the encoding element to achieve dynamic amplitude modulation. To improve reconstruction results, Parada-Mayorga et al. [16] proposed using color-coded apertures rather than the random black and white coded apertures. A dual optical path system based on a beam splitter was proposed [17,18], one for obtaining a clear image and the other for CASSI, and high-quality HSI can be obtained by fusing and reconstructing the two images. With the growth of computing resources, the reconstruction algorithm based on deep learning has boosted the performance of CASSI systems in recent years [19–21]. However, due to its elaborate optical system, the CASSI system can only achieve relatively stable spectral imaging under laboratory conditions [22]. In comparison, our system has a more compact optical system and better portability.

Phase encoding-based systems encode the phase term of the incident light, which in turn modulates the imaging system’s point spread function (PSF). The typical phase encoding elements are diffractive optical elements (DOEs). The optical path difference caused by the surface profile of the DOE and the refractive index difference introduces an additional phase term, which in turn leads to wavelength-dependent PSFs on the sensor plane. Jeon et al. [23] proposed using a single DOE to achieve dispersion and imaging simultaneously. The system’s PSFs exhibit an anisotropic shape with a rotation angle. A 3D data cube can be obtained through a single shot and image restoration. Improvements in imaging systems and reconstruction algorithms based on this work have been proposed in recent years [24–26]. Mikko et al. [27] proposed using a conventional camera with a diffraction grating element to achieve snapshot spectral imaging, in which the grating project information with different spectra to different spatial positions. Beak et al. [28] proposed using a deeply learned DOE to achieve spectral and depth imaging at the same time. These systems usually have the advantages of small size, compactness, and snapshot type, but their PSFs usually have a large size, and the initial images are severely degraded, which brings great difficulty for image restoration. In contrast, our method encodes in the wavelength dimension with less image characteristics lost.

Unlike the previously described systems, wavelength encoding-based systems primarily focus on encoding and integrating 3D data cubes in the wavelength dimension without sacrificing spatial resolution. Due to the unique spectral transmittance characteristics of different materials, there are various approaches to achieve wavelength encoding. Zhang et al. [29] proposed using thin-film filters as wavelength encoding elements and using a deep neural network (DNN) to realize filter design and spectrum reconstruction. Similarly, Zhu et al. [30] proposed to use silicon nitride-based photonic crystal (PC) slabs as wavelength encoding elements and cover them on CMOS. Jian et al. [31] also proposed using metasurfaces with different patterns to encode the incident spectra and integrate multiple pixels into a micro-spectrometer. Nie et al. [32] suggest that the CFA of the camera is designed for human perception and is not optimal for spectral reconstruction missions, so they propose to optimize the design of the camera’s response function. Seoung et al. [33] proposed utilizing multiple cameras with different spectral sensitivities to capture the same scene and reconstruct the spectrum from different RGB measurements. Fu et al. [34] proposed reconstructing a 3D data cube from a single RAW mosaic image and investigated the performance of three different network structures on the spectral reconstruction task. Fu et al. [35] proposed a method to select cameras with different spectral response for the best spectral recovery. Meanwhile, a lot of work has also contributed to proposing various spectral reconstruction networks [6,7].

3. Methods

In this section, we first present the image formation model based on wavelength encoding. In addition, we provide a thorough introduction to the image simulation degradation pipeline that will be utilized in subsequent experiments. Then the end-to-end framework for optimizing the filter and network parameters is delivered, with details of the network structure and constraints imposed.

3.1 Image formation model

In our proposed system, assume that we have ${m}$ filters. Select ${n}$ (${n \in [0,1,\ldots,m]}$) pieces from the ${m}$ pieces to form a combined filter, then the transmittance function of the combined filter can be expressed as

(1)$$T_i(\lambda) = \prod_{q=0}^{n} t_q(\lambda) ,$$

where the subscript ${i}$ represents the ${i^{th}}$ combination scheme, ${t_q(\lambda )}$ is the transmittance function of the single-chip filter. When taking images, a combined filter is placed in front of the imaging lens, and the corresponding wavelength-encoded image is acquired. We can infer that the maximum number of shots is

(2)$$N = \sum_{n=0}^{m} C_{m}^{n} = 2^m ,$$

where ${C_{m}^{n}}$ represents the process of selection and combination, ’${!}$’ indicates the factorial operation. As the RGB camera can obtain three color channels of R, G, and B for each shot, the spectral reconstruction network can obtain image inputs with up to ${3N}$ channels.

Consider the imaging scene shown in Fig. 1, an object illuminated by a light source with a spectral distribution of ${ E(\lambda ) }$. The light reflected by the object passes through a filter, an imaging lens, and a color filter array before forming an image on the image sensor. Assuming that the reflectance of the surface of the object is ${R(x,y,\lambda )}$, the transmittance function of the filter is ${T_i(\lambda )}$, and the response function of the camera is ${S_c(\lambda )}$, then the formation model of the RGB image can be described as

(3)$$J_c(x,y) = \int_{\lambda_1}^{\lambda_2} E(\lambda) R(x,y,\lambda) T_i(\lambda) S_c(\lambda) d\lambda ,$$

where ${(x,y)}$ represent the spatial coordinates, the subscript ${c \in \{r,g,b\}}$ represents the color channel, ${[\lambda _1, \lambda _2]}$ indicates the active bandwidth of the system, ${S_c(\lambda )}$ contains the integrated effect of imaging lens’ transmittance, CFA’s transmittance and the quantum efficiency of CMOS. For simplicity, the spectra of the light source ${E(\lambda )}$ and the reflectance of the object ${R(x,y,\lambda )}$ can be combined into the spectral radiance ${I(x,y,\lambda )}$ of the scene. The subsequent spectral reconstruction yields the spectral radiance of the scene, and the relative reflectance of the object can be obtained by spectral calibration. The principle and procedure of spectral calibration are detailed in the Section 1 of Supplement 1. Substitute ${I(x,y,\lambda )}$ into Eq. (3) and rewrite it in discrete matrix form as

(4)$$\begin{bmatrix} J_{1c}\\ J_{2c}\\ \vdots \\ J_{Nc} \end{bmatrix} = \begin{bmatrix} T_{11}S_{c1} & T_{12}S_{c2} & \cdots & T_{1o}S_{co} & \\ T_{21}S_{c1} & T_{22}S_{c2} & \cdots & T_{2o}S_{co} & \\ \vdots & \vdots & \vdots & \vdots & \\ T_{N1}S_{c1} & T_{N2}S_{c2} & \cdots & T_{No}S_{co} & \end{bmatrix} \cdot \begin{bmatrix} I_{1}\\ I_{2}\\ \vdots \\ I_{o} \end{bmatrix} .$$

Furthermore, there are

(5)$$\mathbf{ J = HI } ,$$

where ${\mathbf {H}}$ is the encoding matrix formed jointly by the encoding filter and the RGB camera.

Fig. 1. Schematic diagram of imaging scene. The scene is illuminated by a light source, and the reflected light is then encoded in the wavelength dimension by a combined filter and an RGB camera.

Dataset	ICVL				KAIST
Method	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$
EDSR	44.0299	0.98707	0.02534	6.55	39.2048	0.97857	0.13351	12.58
Unet	36.3690	0.97652	0.03787	15.89	34.3843	0.91418	0.17341	19.60
Hinet	41.3721	0.96996	0.03233	8.75	39.9578	0.96673	0.23528	10.36
HSCNN-R8	44.1279	0.98720	0.02478	6.50	39.6545	0.97786	0.13201	11.73
HSCNN-D24	43.4244	0.98531	0.02745	7.00	38.6470	0.97575	0.13862	13.22
Liu’s	42.8319	0.98267	0.02886	7.48	39.1818	0.97350	0.13580	12.37
Ours	44.3150	0.98740	0.02591	6.40	41.0394	0.98029	0.12708	9.53

Methods	FLOPs(GB)	Parameters(MB)	Avg. time(ms)
EDSR	147.49	2.3	5.756
Unet	51.02	29.6	3.046
Hinet	20.42	3.24	7.121
HSCNN-R8	580.2	9.07	1.928
HSCNN-D24	52.71	0.61	10.421
Liu’s	22.53	0.35	91.037
Ours	40.54	0.88	4.047

Position	EDSR	Hinet	HSCNN-R8	HSCNN-D24	Liu’s	Ours
1	8.622	16.972	7.627	7.997	10.682	7.127
2	2.936	2.766	3.049	4.690	3.970	1.689
3	3.590	4.437	3.975	3.099	8.338	2.723
4	3.344	2.792	3.130	6.206	4.837	2.528

Position	1,1	1,2	1,3	1,4	1,5	1,6	2,1	2,2	2,3	2,4	2,5	2,6
with filter	0.46	4.52	2.78	1.72	1.80	8.58	3.57	2.06	3.23	1.52	3.91	4.61
w/o filter	2.11	5.37	3.20	1.85	6.18	12.16	2.78	3.39	3.66	2.43	6.35	7.11
Position	3,1	3,2	3,3	3,4	3,5	3,6	4,1	4,2	4,3	4,4	4,5	4,6
with filter	1.68	2.49	2.88	2.57	4.56	3.51	2.15	2.16	1.39	2.54	0.83	0.82
w/o filter	4.39	2.38	5.92	5.10	6.10	5.54	5.16	2.70	2.79	1.13	1.02	1.11

Dataset		ICVL				KAIST
$m$	$N$	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$
0	1	44.3150	0.98740	0.02591	6.40	41.0394	0.98029	0.12708	9.53
1	2	45.7941	0.99082	0.02211	5.42	41.6945	0.98261	0.11981	8.76
2	4	46.8253	0.99296	0.02132	4.94	41.8760	0.98434	0.11020	8.59
3	8	47.4636	0.99428	0.02076	4.67	42.0516	0.98539	0.10817	8.51

Dataset		ICVL				KAIST
$m$	$N$	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$	PSNR $↑$	SSIM $↑$	SAM $↓$	RMSE( $\times 10^{- 3}$ ) $↓$
0	1	44.3150	0.98740	0.02591	6.40	41.0394	0.98029	0.12708	9.53
1	2	45.7941	0.99082	0.02211	5.42	41.6945	0.98261	0.11981	8.76
2	4	46.8253	0.99296	0.02132	4.94	41.8760	0.98434	0.11020	8.59
3	8	47.4636	0.99428	0.02076	4.67	42.0516	0.98539	0.10817	8.51

Abstract

1. Introduction

2. Related work

3. Methods

3.1 Image formation model

3.2 End-to-end optimization framework

3.2.1 Deeply learned filter

3.2.2 Network architecture

3.2.3 Loss function

4. Experiments and results

4.1 Implementation details

4.2 Validation on spectral reconstruction network

4.3 Inverse designed filters

4.4 Results on practical experiments

5. Discussion and conclusion

Funding

Acknowledgment

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (15)

Tables (5)

Equations (12)

Optics Express