Single image relighting based on illumination field reconstruction

Jingyuan Zhang; Xiaoyu Chen; Weining Tang; Haotian Yu; Lianfa Bai; Jing Han

doi:10.1364/OE.495858

1. Introduction

Reconstructing images under new lighting conditions from a single low-light image can improve the visual quality of low-light images without any re-shooting. The primary goal of this method is to reconstruct photorealistic images with good brightness, which has various potential applications in fields such as AR/VR and computational imaging. For example, it can reconstruct new lighting scenes in AR through fast generation without the need for slowly re-rendering, and solve the issue of insufficient lighting during camera capture in computational photography.

Currently, the commonly used method for improving the quality of low-light images is Low-light Image Enhancement (LLIE), which can be categorized into two main categories: data-driven [1–9] and physics-informed methods [10–17]. Data-driven methods primarily rely on the dataset used for supervised training and elaborate-designed neural networks to learn the mapping between low-light and normal-light images. LLNet employs a variant of stacked-sparse denoising autoencoder to brighten low-light images [1]; Chen et al. collected the SID dataset and used a U-net to train a mapping from the low-light images to long-exposure images [3]; Xu et al. decomposed low-light images into low-frequency and high-frequency layers, recovering the image using the low-frequency layer and inferring the high-frequency details to enhance the image [8]. On the other hand, physics-informed methods use physical models as prior knowledge to guide the design of neural networks, which can accelerate the convergence of networks and make the results more robust. Retinex theory [18,19] is a common approach that decomposes an image into light-invariant reflectance and light-dependent illumination components. Wei et al. proposed RetinexNet which firstly combines Retinex decomposition with deep learning for LLIE [10]; Zhang et al. proposed KinD [12] and KinD++ [13] for LLIE through layer decomposition, reflectance recovery, and illumination adjustment; Wang et al. developed the DeepUPE network to extract global and local image features for learning the image-to-illumination mapping relationship [16].

Although the above enhancement methods can improve image brightness and enhance image detail from darkness, they still have several limitations. Firstly, the problem at hand is ill-posed as a single low-light image may correspond to multiple normal-light images. Current methods mainly use paired images with high and low exposure for training, and the exposure in different scenes is not constant. This makes the enhancement target ambiguous and poses significant challenges to the robustness of enhancement algorithms. Next, using high-exposure images as the ground truth (GT) in training destroys the scene’s lighting and shadow variation characteristics, resulting in artifacts and perceptual distortions. Moreover, current methods neglect information such as the distribution of light sources, the spatial distribution and the optical properties of targets. As a result, they cannot perceive the distribution pattern of the illumination, potentially leading to unsatisfied enhancement effects.

Next, we introduce the concept of illumination field and revisit the issue of relighting from the perspective of the forward imaging model of the camera. The illumination field we defined describes the outgoing radiance at each point in space, which is affected by factors such as the properties of light sources, the spatial location and the optical properties of objects. Light emitted from a light source undergoes multiple reflections, forming a complex distribution of the illumination field that can be modeled by the rendering equation. The solutions of the rendering equation can be used to obtain the irradiance reaching the camera plane, which can be associated with image pixels through the camera response function (CRF) [20]. Therefore, given a 3D scene and the illumination gain, the new illumination field can be reconstructed through iterative solving based on the rendering equation. Then the illumination field generates images under the new lighting conditions with the help of the CRF. We refer to this process as the illumination field reconstruction (IFR) algorithm. However, since the parameters of the scene and CRF are often difficult to obtain in the real world, estimation methods [21,22] are required to approximate these parameters in order to implement the IFR algorithm.

Based on the above analysis, we further investigated the physical relationship between the illumination gain and the pixel value variation due to the gain, then derived the illumination field modulation equation based on the rendering equation. We utilize the physically based rendering (PBR) approach to create models of photorealistic 3D scenes and build a dataset consisting of image sequences with different illumination levels, each image has a corresponding illumination gain label. During training, image sequences instead of single images were used for supervised learning, which enabled the network to better capture the lighting and shadow variation of the illumination field. More importantly, the labels provided a clear target on the illumination level of the reconstruction, thus avoiding the presence of unclear enhancement targets as in LLIE. Under the guidance of the physical prior and structured datasets, we proposed a convolutional neural network based on a dual-branch fusion architecture. The proposed illumination field reconstruction network (IFRNet) takes a single low-illumination image and illumination gain labels as input; one branch is used to mine the deep features of the scene, and the other branch is used to estimate the gain features. Finally, the two branches are fused to generate the reconstructed images. Extensive experiments demonstrate that our method achieves high-quality illumination field reconstruction on both simulated and real-world datasets. Notably, IFRNet is trained solely on the simulated datasets but has good performance of generalization in real-world scenarios.

2. Method

2.1 Solutions design

We propose a solution for low-light image relighting based on the IFR algorithm, as illustrated in Fig. 1. Firstly, let’s think about this question of relighting from the perspective of forward imaging of the camera. The rendering equation describes the distribution and transmission of light through a scene, providing a theoretical foundation for surface shading and illumination calculations in computer graphics. Given a 3D scene with known parameters, the rendering equation can be used to calculate the outgoing radiance at any position. Subsequently, we can compute the irradiance that reaches the camera imaging plane, which can be linked to the image’s pixel value through the CRF. In Section 2.2, we presented the mathematical formulation of the rendering equation. We analyzed the relighting task by utilizing the rendering equation and derived the illumination field modulation equation. This equation describes the relationship between the illumination gain and the alteration in pixel values of the image caused by the gain.

Fig. 1. Schematic diagram of the single image relighting based on IFR algorithm. (a) The diagram of the rendering equation. (b) Dataset construction by sampling from the 3D model. (c) The proposed IFRNet takes a single low-illumination image as input and generates the corresponding image after relighting according to the desired illumination gain labels.

Download Full Size | PDF

To better improve the accuracy of the reconstruction, it is necessary to explore the variation pattern of the illumination field. Our goal is to create multiple sets of structured image sequences with varying illumination levels and to summarize the illumination field’s variation patterns through these discrete sequences. Existing datasets [23–26] used for low-light image enhancement and high-dynamic range imaging tasks cannot meet our requirements, and adjusting the illumination in real-world scenes to capture the desired dataset can be extremely really difficult, especially for outdoor scenes. With the rapid advancements in rendering theory and technology, current rendering techniques can reconstruct photorealistic scenes. Therefore, we built 3D models using the PBR method and collected image sequences containing different illumination levels. Each set of image sequences was captured in a fixed scene by adjusting the illumination. To better learn the variation pattern of the illumination field, we record the differences in illumination between each image and its corresponding low-light image as a label, which we refer to as illumination gain labels. This structured dataset helps to learn the illumination field variation process and extract the illumination field variation pattern. See section 3.1 for more details on the dataset.

To fully exploit the advantages of structured datasets, the IFRNet was proposed for low-light image relighting. We utilize the illumination field modulation equation as a physical prior to guide the network design and take the original low-light image and the illumination gain labels as inputs for image reconstruction. Our objective is to find a mapping function F that takes a low-light image $I_{L}$ with an illumination gain label $\Delta L$ as inputs and accurately estimates the new image $I_{L+\Delta L}$ captured after boosting the intensity of the light sources by $\Delta L$. This process can be formulated as follows:

(1)$$I _ { L + \Delta L } = F _ { \theta } ( I _ { L } , \Delta L ),$$

where $\Delta L$ represents the illumination gain label based on the current illumination level $L$, so that a continuous, arbitrary enhancement of the current illumination level can be achieved by adjusting $\Delta L$; F represents the network with trainable parameters $\theta$, whose purpose is to find optimal network parameters $\hat {\theta }$ that minimizes the error:

(2)$$\hat{\theta}=\operatorname{argmin} Loss\left(I_{L+\Delta L}, \hat{I}_{L+\Delta L}\right),$$

where $Loss$ is the loss function that drives the optimization of the network, which is described in detail in Section 2.4, and $\hat {I}_{L+\Delta L}$ is the GT of the predicted reconstructed image. By incorporating physical priors and leveraging a vast amount of structured data, the network is capable of fully mining the crucial features of the images, effectively perceiving the variation pattern of the illumination field, and more accurately reconstructing images under new lighting conditions.

2.2 Physical mechanism

In computer graphics, the rendering equation globally describes the light transport in a scene based on the physical law of energy conservation [27]. It effectively captures the process of light propagation in the scene and can be expressed mathematically as:

(3)$$L_{o}\left ( x,\omega _{o}\right ) =L_{e}\left ( x,\omega _{o}\right )+\int\limits_{\Omega}f_{r}\left ( x,\omega _{i},\omega _{o} \right )L_{i}\left ( x,\omega _{i} \right ) \left ( n\cdot\omega _{i} \right ) d\omega _{i},$$

where $L_{o}\left ( x,\omega _{o}\right )$ is the outgoing radiance leaving the point $x$ along direction $\omega _{o}$, $L_{e}\left ( x,\omega _{o}\right )$ is the self-emitted radiance at point $x$ along direction $\omega _{o}$, $L_{i}\left ( x,\omega _{i}\right )$ describes the amount of light reaching point $x$ from direction $\omega _{i}$, the bidirectional reflectance distribution function (BRDF) $f_{r}$ is the reflectance coefficient of the material at point $x$ for incident light direction $\omega _{i}$ and viewing direction $\omega _{o}$, the integral item is the surface integral over all the directions $\omega _{i}$ of the unit sphere $\Omega$ with center in $x$ and $n$ corresponds to the direction of the surface normal at point $x$.

Assuming that the radiance of all self-emitting light sources in the current scene is increased by $\Delta L$, the rewritten equation is:

(4)$$L_{o}^{\prime} \left ( x,\omega _{o}\right ) =L_{e}^{\prime}\left ( x,\omega _{o}\right )+\int\limits_{\Omega}f_{r}\left ( x,\omega _{i},\omega _{o} \right )L_{i}^{\prime}\left ( x,\omega _{i} \right ) \left ( n\cdot\omega _{i} \right ) d\omega _{i}.$$

For convenience, we will omit the variables inside the parentheses in the subsequent formulas, then subtract Eq. (4) from Eq. (3) to obtain:

(5)$$L^{\prime}_{o}-L_{o}=L_{e}^{\prime}-L_{e}+\int\limits_{\Omega}f_{r}( L_{i}^{\prime} -L_{i}) \left ( n\cdot\omega _{i} \right ) d\omega _{i},$$

where $L_{e}^{\prime }-L_{e}$ reflects the difference before and after the change in self-emitting radiance, i.e. $\Delta L$ as mentioned above. The variable inside the integral term represents the difference in the light intensity of the incident light. Simplifying the above equation, we obtain:

(6)$$L^{\prime}_{o}-L_{o}=\Delta L +\int\limits_{\Omega}f_{r}( L_{i}^{\prime} -L_{i}) \left ( n\cdot\omega _{i} \right ) d\omega _{i}.$$

Next, calculating the difference in irradiance reaching the camera plane $E^{\prime }_{o}-E_{o}$ by the integral relationship $F_{E}$ between irradiance and radiance:

(7)$$E^{\prime}_{o}-E_{o}=F_{E}(\Delta L+\int\limits_{\Omega}f_{r}( L_{i}^{\prime} -L_{i}) \left ( n\cdot\omega _{i} \right ) d\omega _{i}).$$

The irradiance received by the camera plane can be transformed into raw pixel values using the camera response function $F_{CRF}$, so we apply the $F_{CRF}$ function to both sides of the equation above then obtain:

(8)$$I^{\prime}_{o}-I_{o}=F_{CRF}(F_{E}(\Delta L+\int\limits_{\Omega}f_{r}( L_{i}^{\prime} -L_{i}) \left ( n\cdot\omega _{i} \right ) d\omega _{i})),$$

where the $F_{CRF}$ is linear for raw pixel values [28,29], the terms on the left side represent the difference in image pixels before and after the illumination gain. The variable on the right side is a highly complex integral with infinite integrals over infinite integration domains, which is related to the illumination gain $\Delta L$ and scene parameters $\theta$, and can only be approximately estimated through infinite recursions, such as Monte Carlo ray tracing [30]. We use $F_{CRF}^{\prime }$ to represent the complex integration terms:

(9)$$I^{\prime}-I=F_{crf}^{\prime}(\Delta L,\theta).$$

The above equation establishes a relationship between the variation of image pixel values and the illumination gain, which we refer to as the illumination field modulation equation. Considering that the complex scene parameters and CRF may be unavailable and the image pixels $I$ are obtained through forward calculation from the complex scene parameters, a more complex function $R$ can be used to achieve inverse estimating from the image $I$ to the scene parameters and complete the calculation of radiance to pixel values. Therefore, the illumination field modulation equation can be rewritten as:

(10)$$I^{\prime}=I+R(\Delta L,I).$$

In the aforementioned imaging process, relighting is achieved by altering the intensity of the self-emitting light sources in the scene. This approach is fundamentally different from enhancing the camera’s response to irradiance by adjusting exposure. By varying the $\Delta L$, the illumination field modulation equation can realize relighting at multiple levels.

2.3 Proposed network architecture

By utilizing the Illumination field modulation equation as the physics prior, the potential of the neural network can be fully exploited for low-light image relighting. We proposed the IFRNet based on the above-mentioned physical prior and utilizes the powerful fitting ability of neural networks to model the relighting process with the help of PBR datasets. Figure 2 illustrates the general framework of the IFRNet, which is an end-to-end network that incorporates a two-branch fusion design architecture.

Fig. 2. The network structure of the proposed IFRNet based on a two-branch fusion mechanism.

Download Full Size | PDF

Fig. 3. The structure and key modules of the illumination gain block. (a) Dilated Layers. (b) The structure of the illumination gain block. (c) Attention Layers.

Download Full Size | PDF

The illumination field modulation equation provides a way for computing the pixel values of an image in a new illumination field by summing the low-light image $I$ and the variation of pixel values $R(\Delta L,I)$ influenced by illumination gain. Similar to the formulation of the illumination field modulation equation, the input of IFRNet is divided into two branches for separate feature calculation. The first branch takes a low-light image as input and implements image-to-feature mapping by stacking a feature mapping network module with multiple 3$\times$3 convolutional layers. The other branch encodes the low-light image using the illumination gain label $\Delta L$ via point multiplication and feeds it into the illumination gain module to extract the illumination gain feature. The two feature maps are then fused using pixel summation and mapped back to the image using two convolutional layers, resulting in the reconstructed image.

The illumination gain block illustrated in Fig. 3 is the most crucial part of the IFRNet network, and it adopts the classic encoder-decoder architecture [31,32]. Leveraging the symmetric property of the encoder-decoder architecture, the high-dimensional latent features extracted by the encoder are transformed back to the shape of the original input through the decoder, generating the desired illumination gain features. To overcome the gradient-vanishing issue, skip connections [33] are introduced from the encoder to its corresponding mirrored decoder, utilizing abundant connections of higher-order residuals in a multi-scale manner. Simply stacking convolutional layers in the Encoder-Decoder architecture cannot achieve high-fidelity reconstruction, as it may lead to uneven brightness distribution and distorted images. To address this issue, we employ an elaborate design for the encoder and the decoder.

The encoder should extract features of various scales and dimensions, with a focus on capturing texture and local detail information. Therefore, we designed an encoder consisting of four stacked dilated layers. The input to the dilated layer is first downsampled by a 3$\times$3 convolutional layer with a stride of 2, and then processed by three dilated convolutional layers [34] with different dilation rates (1, 2, and 4) to generate features with different receptive fields. The output of the dilated convolutional layer with a dilation rate of 1 is passed to the next layer, while the results from all three layers are concatenated and sent to the decoder via the skip connection. By extracting features from dilated convolutional layers with different dilation rates, we can capture more diverse features and local detail information, which improves the network’s convergence.

The decoder, like the encoder, consists of four attention layers. Each layer upsamples the output features from the previous layer and concatenates them with the features obtained from the skip connection. To enhance feature adaptability and effectively merge features from different levels, allowing the network to focus on critical features, we utilize channel-wise and spatial-wise attention guidance strategies on the concatenated features. This attention-guided approach can help addresses artifacts caused by uneven spatial brightness distribution, improves the visual smoothness of the reconstructed image, and facilitates the generation of high-quality images.

2.4 Loss function

Taking advantage of paired data, we utilize a supervised learning approach to train the network. Given an original low-light image, we use IFRNet to generate corresponding images $I_{\text {out }}^{i}$ with different illumination gain labels $\Delta L$, where $i$ represents the serial number of the modulation level. The commonly used $L_1$ loss and SSIM loss are employed as the reconstruction loss $L_{con}$ to constrain the network output $I_{out}^{i}$ and the corresponding true reconstruction image $I_{GT}^{i}$.

(11)$$L_{con}=L_{1}(I_ {out }^{i}, I_{G T}^{i})+L_{ssim}(I_ {out }^{i}, I_{G T}^{i}),$$

where $L_1$ loss is used to guarantee the image’s color and brightness fidelity and SSIM loss enables the generated image to retain structural information without losing details.

IFRNet is primarily designed to predict the image gain of the original image under the modulation of $\Delta L$. To facilitate efficient learning of this image gain, we propose a differential reconstruction loss that constrains the image pixel difference $\Delta I_{out}^{i}$ between reconstructed images, where $\Delta I_{out}^{i}$ is calculated as $I_{out}^{i+1}-I_{out}^{i}$. Similarly, we obtain the image pixel difference $\Delta I_{GT}^{i}$ for the real reconstructed images. Since pixel difference reflects factors such as luminance and color, we use the $L_1$ loss as the differential reconstruction loss, which is denoted as $L_{diff}$:

(12)$$L_ {diff }=L_{1}\left(\Delta I_ {out }^{i}, \Delta I_{G T}^{i}\right).$$

The differential reconstruction loss helps IFRNet to establish a more accurate relationship between the illumination gain labels and the reconstructed image, better understand the pattern of illumination field changes, and reconstruct the new image more precisely. The total loss function, denoted as $L_{total}$, is the sum of the reconstruction loss and the differential reconstruction loss:

(13)$$L_ {total }=L_ {con }+L_ {diff }.$$

3. Experiments and results

3.1 Data acquisition and experiments setting

To ensure effective training and evaluation of our network, we require a substantial dataset that encompasses a wide range of scenes, weather conditions, and light source conditions. Furthermore, each image set in the dataset should consist of a sequence of images sampled from the same scene, but with varying illumination levels, as per our task requirements. Acquiring image sequences with consistent lighting conditions often requires careful control of the positioning and intensity of the light sources in a scene, which can be challenging to achieve in the real-world, especially in outdoor environments. However, leveraging advanced rendering technology enables us to model the real world through 3D modeling and adjust the lighting power of light sources to achieve various lighting conditions.

We have created a simulation dataset of 200 sets of image sequences using Blender software, half of which are collected in indoor scenes while the others are collected outdoors. The indoor scenes consist of various apartment settings, utilizing "point light" and "area light" as the light sources, while the outdoor scenes comprise gardens, snow, streets, countryside, and other environments, predominantly illuminated by "sunlight". Each image sequence in the dataset includes one original low-light image and six new images after relighting, all with a spatial resolution of 1536$\times$1536. To generate the new images, we incrementally increased the lighting intensity or power of the light sources based on the original low-light environment. It is worth noting that the lighting intensity units employed to represent "point light", "surface light" and "daylight" differ in the 3D scene model. To standardize $\Delta L$, each unit of $\Delta L$ represents 10W of "energy" for "point light" and "area light", and also represents 0.1 unit of "strength" for "daylight" lighting. The $\Delta L$ range in the dataset is a floating-point number ranging from 0 to 120, representing the scope of illumination gain labels from low-light images to normal-light images.

To train IFRNet, we randomly selected 180 image sets from the dataset for training and reserved 20 sets for testing. The network was implemented on Nvidia GeForce RTX 3090 GPU using the Pytorch framework and trained for 200 epochs. We utilized the Adam optimizer to optimize our model, setting the initial learning rate to $10^{-4}$, which was halved after 100 epochs. To augment the data, we randomly cropped 512$\times$512 patches, followed by random mirror, resize, and rotation for all patches. The pixel spatial resolution of the reconstructed image is consistent with the input low-light image.

Objective evaluation metrics are crucial for effectively evaluating reconstruction results. To evaluate the learning effectiveness and generalization capability of our network, we quantitatively compared its performance with other methods using the PSNR and SSIM [35] metrics. Higher values of PSNR and SSIM indicate better reconstruction performance and are closer to the true comparison results.

To evaluate the color accuracy of our results, the CIEDE2000 formula is introduced to calculate the color difference between the reconstructed images and the GT [36]. The formula is as follows:

(14)$$\Delta E_{00}=\sqrt{\left(\frac{\Delta L^{\prime}}{k_LS_L}\right)^2+\left(\frac{\Delta C^{\prime}}{k_CS_C}\right)^2+\left(\frac{\Delta H^{\prime}}{k_HS_H}\right)^2+\left(R_T\left(\frac{\Delta C^{\prime}}{k_CS_C}\right)\left(\frac{\Delta H^{\prime}}{k_HS_H}\right)\right)},$$

where $\Delta L^{\prime }$, $\Delta C^{\prime }$ and $\Delta H^{\prime }$ represent the differences in lightness, chroma, and hue, respectively. These differences are weighted by the functions ($S_L$, $S_C$, $S_H$), parametric weighting factors ($k_L$, $k_C$, $k_H$), and rotation term ($R_T$). All parametric weighting factors were set to $k_L=k_C=k_H=1$, as specified in reference [37]. A smaller value of CIEDE2000 indicates a lower color difference and higher color fidelity.

3.2 Results and analysis

To demonstrate the effectiveness of our proposed network, we compare it with six classic algorithms, including Gamma correction, RetinexNet [10], EnlightenGAN [5], Zero-DCE [7], and RetinexDIP [11]. Gamma correction is a traditional method that enhances an image’s dynamic range by adjusting the gamma coefficient, providing adjustable enhancement. In contrast, the remaining four algorithms are all learning-based methods that can only achieve one-to-one enhancement. For fair comparisons, we retrained these methods according to their respective characteristics and training styles. To train RetinexNet using supervised learning with paired images, we reorganized the training dataset by selecting the brightest and darkest images from each image sequence and combining them to form new paired images. Next, we reorganized the paired dataset into an unpaired dataset for the purpose of facilitating EnlightenGAN to learn the brightness mapping from dark to bright. Considering that Zero-DCE performs zero-shot learning through a carefully defined loss function, and RetinexDIP utilizes untrained directions based on the Retinex theory, they do not require reference images as supervision. Therefore, we only provide the original low-light images for retraining both methods.

We evaluated the results of our method and Gamma correction qualitatively using a complete sequence of reference images. For the RetinexNet and EnlightenGAN methods, we organize the test dataset in a similar way to their training set, using the brightest image in each reference image sequence as a reference for the evaluation matrix calculation. On the other hand, as Zero-DCE and RetinexDIP are zero-reference training methods, we calculated the evaluation matrix separately for each enhanced image and its corresponding multiple reference images, then selected the best result for comparisons. Table 1 summarises the quantitative results in the testing dataset, where our approach achieves the best numerical scores while also enabling the reconstruction under dynamically adjustable illuminations.

Table 1. Quantitative results of the simulated dataset.

View Table | View all tables in this article

The visualization results, as shown in Figs. 4 and 5, demonstrate that our method can effectively reconstruct photorealistic images under different illumination conditions while preserving high-frequency details and suppressing artifacts better than other methods. As the illumination increases, the color will also change accordingly. Since low-light images are shot with weak illumination, the color in these images suffers severe degradation and cannot provide enough information for color correction during relighting. However, both quantitative and visualization results show that, despite slight color deviations, our method can reproduce true color properties as accurately as possible compared to other approaches. This suggests that the proposed model has learned the color trends with illumination from the image sequences. In contrast, other methods suffer from over-or-under enhancement problems. Gamma correction cannot perceive semantic information in the image through simple pixel mapping and may cause serious color distortion and noise. RetinexNet and EnlightenGAN exhibit strong color deviations, possibly due to the imperfect physical mechanism of RetinexNet and the instability of the adversarial generation mechanism in EnlightenGAN. Zero-DCE and RetinexDIP are self-supervised methods that cannot realize sufficient brightness enhancement due to the absence of a clearly defined supervisory objective for constraint.

Fig. 4. Reconstruction results with different $\Delta L$. Please see Fig. 5 for the input images. The result of the gamma transform is achieved by adjusting the gamma factor for each level, which is obtained by logarithmic operation between the GT and the average brightness value of the input image.

Download Full Size | PDF

Fig. 5. Visual comparison with low-light image enhancement methods. Our method shows the reconstructed images of the highest $\Delta L$ level and the corresponding GT.

Download Full Size | PDF

During the training phase, IFRNet learns the pattern of illumination variation from the varying illumination gain labels $\Delta L$. Consequently, during the testing phase, IFRNet can reconstruct images with higher illumination levels beyond the range limited by the training set. Figure 6 shows the results beyond the illumination level range of the training phase, and we found that IFRNet can achieve well reconstruction. However, since there is a lack of supervision from real images, the reconstructed results beyond the range may have issues. The reconstruction result improves as $\Delta L$ approaches the upper bound of the training parameters, but excessively large $\Delta L$ values beyond the training range may cause color saturation distortion and insufficient brightness in some areas, as observed in the comparison with reference images.

Fig. 6. Reconstruction results beyond the illumination gain labels range of the training phase, where $\Delta L = 115$ is within the range and the others are outside the range.

Download Full Size | PDF

3.3 Ablation studies

In this section, we conducted ablation studies to evaluate the individual contributions of three key components: physical mechanism guidance, physical-based supervision, and data-based modeling, toward achieving our final results. Firstly, physical mechanism guidance serves as the basis for designing the dual-branch network structure. For the initial set of ablation experiments, we excluded the first branch and solely focused on the other branch containing the illumination gain block. Secondly, physical-based supervision provides structured image sequences with varying levels of illumination. This enables us to investigate the variation pattern of the illumination field effectively. In each set of image sequences, we retained only low-light images and images with the highest illumination level as paired data, which were used to retrain our network for the second set of ablation experiments. Lastly, data-based modeling incorporates Dilated Layers and Attention Layers to facilitate network convergence and enhance image quality. Therefore, in the third set of ablation experiments, we replaced both Dilated Layers and Attention Layers with regular convolutional layers.

Table 2 presents the quantitative results of the ablation experiments. In the table, "baseline" represents the final proposed method, "w/o PMG" and "w/o DBM" represent the first and third set of ablation experiments, respectively. The aforementioned results were tested on the complete testing dataset. "w/o PBS(a)" and "w/o PBS(b)" represent the second set of ablation experiments, where the former was tested on the complete testing dataset and the latter was tested only on the paired testing dataset. According to the quantitative results, "w/o PBS(a)" performed the worst in the complete testing dataset. This indicates that the absence of structured image sequences makes it difficult for the model to learn the evolution process of the illumination field through the paired dataset alone. Furthermore, the metric values of "w/o PMG" are lower than "w/o DBM," suggesting that physical mechanism guidance for network architecture design is more important than the Dilated Layers and Attention Layers proposed by data-based modeling. The results of "w/o PBS(b)" indicate that even without the involvement of image sequences, our method performs significantly better than RetinexNet and EnlightenGAN under the same experiments setting. Comparisons with the baseline further demonstrate the importance of structured image sequences in the relighting process.

Table 2. Qualitative results of the ablation experiments.

View Table | View all tables in this article

The visualization results of the ablation experiments shown in Fig. 7 further validate the aforementioned conclusions. It is evident that the absence of physical mechanism guidance weakens the capability to enhance image brightness during reconstruction, resulting in a reduction in overall brightness levels. Moreover, the absence of image sequences prevents achieving multi-level brightness adjustments, resulting in a mismatch between the brightness levels of the enhanced results and their corresponding labels. Additionally, removing dilated layers and attention layers introduces serious artifacts, as seen in the upper part of each image in the third row of the figure. In summary, only by combining these three key components can the best relighting effect be achieved.

Fig. 7. Reconstruction results of ablation experiments.

Download Full Size | PDF

3.4 Performance in the real world

While our model has shown promising results on the simulated dataset, further validation is necessary to confirm its performance in real-world settings. To quantitatively evaluate our model in real-world data, we built a simple image acquisition system on an optical platform and collected image sequences with multiple illumination levels. Our system used a power-adjustable incandescent lamp as the light source and a Fuji XT30II camera for image capture. We controlled the illumination by adjusting the power of the incandescent lamp while ensuring the object, light source, and camera remained stationary. We collected a total of 10 image sequences, with each sequence consisting of five images captured at different illuminations.

Table 3 summarises the quantitative results in the real-world dataset, where our approach outperforms other methods in terms of PSNR and SSIM. Figure 8 illustrates that our method is able to gradually increases the brightness while preserving image details and avoiding over-enhancement and artifacts, outperforming other methods. However, in terms of color, RetinexDIP outperforms all other methods in the CIEDE2000 metric because it only applies minimal enhancement to the input image. The CIEDE2000 value of our method is slightly lower than Gamma correction and Zero-DCE, and our visualization results shown in Fig. 8 also have a slight color deviation compared to the GT. We believe this is related to the change in the wavelength range between the training dataset and real-world dataset. The images in the simulated dataset were rendered using Blender software and utilized an ideal white light source with nearly identical responses from 380 nm to 750 nm. However, the energy of incandescent lamps used in the real-world dataset is mainly concentrated in the wavelength range from 550nm to 750nm. Our model defaults to relighting the real-world images based on the spectral characteristics of the training dataset, which leads to a wavelength range change before and after relighting. As a result, the reconstructed real-world images have a slight color deviation compared to the GT. It’s worth noting that both Gamma correction and Zero-DCE are unsupervised methods that don’t rely on the GT from the training dataset for supervision. Consequently, there is no change in the wavelength range between the training and testing datasets, leading to better color restoration.

Fig. 8. Reconstruction results of the images sampled from the optical platform.

Download Full Size | PDF

Table 3. Quantitative results in the real-world dataset.

View Table | View all tables in this article

To evaluate the reconstruction results at different illumination levels, we individually calculated the CIEDE2000 values at five different illumination levels, with $\Delta L$ uniformly sampled from 20 to 100. As presented in Table 4, our method maintains good color fidelity at lower illumination levels. However, as the illumination level increases, the color difference of our method gradually intensifies, leading to a decrease in color fidelity.

Table 4. Qualitative results of the CIEDE2000 metric at different illumination levels.

View Table | View all tables in this article

Considering that images captured on the optical platform may not be universal, so we captured images in common scenes using a Fuji XT30 II camera for comparison. Unlike obtaining paired reference images by adjusting exposure, the reference images we need require adjusting the light source, which is difficult to achieve in real-world scenarios. Therefore, we only collected low-light images and performed visual analysis and comparison of their reconstruction results. Figure 9 further demonstrates that our proposed method produces significantly clearer and more natural-looking images, compared to other methods, showcasing its effectiveness and superior generalization ability in real-world scenarios.

Fig. 9. Reconstruction results of the images sampled from rooms.

Download Full Size | PDF

3.5 Noise level evaluation

In this section, we further evaluated the performance of our method at different levels of noise. Since the training images are obtained through rendering, they are hardly affected by any noise. However, real low-light images often suffer from various forms of noise. Therefore, during the training phase, we randomly added varying levels of Gaussian noise to the input low-light images to simulate noise effects. The Gaussian noise was generated by randomly selecting a mean of 0 and a variance ranging from 0 to 0.02. We applied Gaussian noise to the low-light images in our testing dataset, with a mean of 0 and a variance ranging from 0.002 to 0.02, in intervals of 0.002. Subsequently, we conducted testing on these modified images and presented the results in Fig. 10.

Fig. 10. Curve for qualitative results in the testing dataset with different levels of Gaussian noise.

Download Full Size | PDF

It can be observed that as the noise variance increases gradually, the performance of the results decreases accordingly. When the noise variance reaches 0.01, the SSIM metric of the test results falls below 0.6. This indicates that the noise in the input images has a significant impact on the quality of the reconstructed images. The two rows of images shown in Fig. 11(a). represent the input images and the reconstructed images, respectively. It is noticeable that due to the presence of noise, the reconstructed images exhibit slightly lower overall brightness compared to the GT. However, the overall contrast level has improved, and the color deviation is relatively low. As we did not employ any specific noise suppression methods during the design process of our approach, some noise is still present in the reconstructed image. Moreover, as the noise variance increases, the level of noise in the reconstructed image also increases.

Fig. 11. Visualization results for noise level evaluation. (a) Performance on the testing dataset. (b) Performance on the real-world image.

Download Full Size | PDF

FIgure 11(b) showcases the performance of our method in handling real-world images with high levels of noise. We captured the images in the first row under extremely low-light conditions using a Fuji XT30II camera. The images in the second row represent a zoomed-in view of the green rectangular boxes located at the top of the images in the first row. Due to the presence of strong noise and low illumination, the object inside the rectangular box in the input image is obscured by darkness, making it difficult to discern with the human eye. Table 5 provides the entropy and standard deviation values for the image within the rectangular boxes. The two metrics are used to evaluate the amount of information and the degree of brightness variation in low-light images. The higher the metrics, the better the quality of the image. One can see that the input image performs poorly in these two metrics, suggesting that the details in the image are almost unrecognizable. Though relighting does not improve the pixel spatial resolution, but can optimize the spatial contrast of the images and improve the discernment of hidden targets and details in the dark. By applying various levels of relighting, the image’s contrast is enhanced, resulting in improved visibility of the structural information and details inside the boxes. This implies that our method can also be effectively applied to relighting in real-world images with extremely low illumination and strong noise. Essentially, this discerning ability is learned from the continuous illumination variation pattern manifested in structured image sequences. The pattern aids our network in understanding the variation of lighting and shadow within the images, thereby enabling more effective extraction of details in dark conditions.

Table 5. Qualitative results of the image quality inside the rectangular boxes.

View Table | View all tables in this article

4. Conclusion

In this paper, a novel solution is introduced to realize low-light relighting, it can reconstruct images under new illumination conditions from a single low-light image. The proposed approach provides a new perspective on relighting from the forward imaging model of the camera and derived the physical relationship between the illumination gain and pixel value variation as the physical prior. We utilize a PBR approach to create photorealistic 3D scenes and build structured image sequences for the reference of supervised learning. Under the guidance of the physical prior, the proposed IFRNet based on a dual-branch fusion architecture achieves high-quality illumination field reconstruction on both simulated and real-world datasets and allows users to adjust illumination gain labels to achieve results in multiple illuminations, which is a significant advantage. In the future, we plan to explore the extension of our approach to more complex scenes and further improve the accuracy and efficiency of illumination field reconstruction.

Funding

National Natural Science Foundation of China (62101256); China Postdoctoral Science Foundation (2021M691591); Jiangsu Provincial Key Research and Development Program (BE2022391).

Acknowledgments

We thank Yueming Zhu, Yingqi Liu and Chao Qu for technical supports and experimental discussion.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper may be obtained at [38].

References

1. K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoder approach to natural low-light image enhancement,” Pattern Recognition 61, 650–662 (2017). [CrossRef]

2. F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns,” in BMVC, vol. 220 (2018), p. 4.

3. C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 3291–3300.

4. M. Zhu, P. Pan, W. Chen, and Y. Yang, “Eemefn: Low-light image enhancement via edge-enhanced multi-exposure fusion network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), pp. 13106–13113.

5. Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, and Z. Wang, “Enlightengan: Deep light enhancement without paired supervision,” IEEE Trans. on Image Process. 30, 2340–2349 (2021). [CrossRef]

6. K. Lu and L. Zhang, “Tbefn: A two-branch exposure-fusion network for low-light image enhancement,” IEEE Trans. Multimedia 23, 4093–4105 (2021). [CrossRef]

7. C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong, “Zero-reference deep curve estimation for low-light image enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 1780–1789.

8. K. Xu, X. Yang, B. Yin, and R. W. Lau, “Learning to restore low-light images via decomposition-and-enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2281–2290.

9. J. Li, X. Feng, and Z. Hua, “Low-light image enhancement via progressive-recursive network,” IEEE Trans. Circuits Syst. Video Technol. 31(11), 4227–4240 (2021). [CrossRef]

10. C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition for low-light enhancement,” arXiv, arXiv:1808.04560 (2018). [CrossRef]

11. Z. Zhao, B. Xiong, L. Wang, Q. Ou, L. Yu, and F. Kuang, “Retinexdip: A unified deep framework for low-light image enhancement,” IEEE Trans. Circuits Syst. Video Technol. 32(3), 1076–1088 (2022). [CrossRef]

12. Y. Zhang, J. Zhang, and X. Guo, “Kindling the darkness: A practical low-light image enhancer,” in Proceedings of the 27th ACM International Conference on multimedia, (2019), pp. 1632–1640.

13. Y. Zhang, X. Guo, J. Ma, W. Liu, and J. Zhang, “Beyond brightening low-light images,” Int. J. Comput. Vis. 129(4), 1013–1037 (2021). [CrossRef]

14. R. Khan, A. Mehmood, and Z. Zheng, “Robust contrast enhancement method using a retinex model with adaptive brightness for detection applications,” Opt. Express 30(21), 37736–37752 (2022). [CrossRef]

15. S. Ahn, J. Shin, H. Lim, J. Lee, and J. Paik, “Coden: combined optimization-based decomposition and learning-based enhancement network for retinex-based brightness and contrast enhancement,” Opt. Express 30(13), 23608–23621 (2022). [CrossRef]

16. R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W.-S. Zheng, and J. Jia, “Underexposed photo enhancement using deep illumination estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 6849–6857.

17. Y. Wang, Y. Cao, Z.-J. Zha, J. Zhang, Z. Xiong, W. Zhang, and F. Wu, “Progressive retinex: Mutually reinforced illumination-noise perception network for low-light image enhancement,” in Proceedings of the 27th ACM International Conference on multimedia, (2019), pp. 2015–2023.

18. E. H. Land, “An alternative technique for the computation of the designator in the retinex theory of color vision,” Proc. Natl. Acad. Sci. U. S. A. 83(10), 3078–3080 (1986). [CrossRef]

19. D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “Properties and performance of a center/surround retinex,” IEEE Trans. on Image Process. 6(3), 451–462 (1997). [CrossRef]

20. M. D. Grossberg and S. K. Nayar, “Modeling the space of camera response functions,” IEEE Trans. Pattern Anal. Machine Intell. 26(10), 1272–1282 (2004). [CrossRef]

21. Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker, “Inverse rendering for complex indoor scenes: Shape spatially-varying lighting and svbrdf from a single image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2475–2484.

22. P. Kán and H. Kafumann, “Deeplight: light source estimation for augmented reality using deep learning,” Vis. Comput. 35(6-8), 873–883 (2019). [CrossRef]

23. V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic global tonal adjustment with a database of input/output image pairs,” in CVPR 2011, (IEEE, 2011), pp. 97–104.

24. J. Liu, D. Xu, W. Yang, M. Fan, and H. Huang, “Benchmarking low-light image enhancement and beyond,” Int. J. Comput. Vis. 129(4), 1153–1184 (2021). [CrossRef]

25. C. Chen, Q. Chen, M. N. Do, and V. Koltun, “Seeing motion in the dark,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 3185–3194.

26. H. Jiang and Y. Zheng, “Learning to see moving objects in the dark,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 7324–7333.

27. J. T. Kajiya, “The rendering equation,” in Proceedings of the 13th annual Conference on Computer Graphics and Interactive Techniques, (1986), pp. 143–150.

28. C. Aguerrebere, J. Delon, Y. Gousseau, and P. Musé, “Best algorithms for hdr image generation. a study of performance bounds,” SIAM J. Imaging Sci. 7(1), 1–34 (2014). [CrossRef]

29. M. A. Robertson, S. Borman, and R. L. Stevenson, “Estimation-theoretic approach to dynamic range enhancement using multiple exposures,” J. Electron. Imaging 12(2), 219–228 (2003). [CrossRef]

30. P. Shirley, C. Wang, and K. Zimmerman, “Monte carlo techniques for direct lighting calculations,” ACM Trans. Graph. 15(1), 1–36 (1996). [CrossRef]

31. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

32. W. Ren, S. Liu, L. Ma, Q. Xu, X. Xu, X. Cao, J. Du, and M.-H. Yang, “Low-light image enhancement via a deep hybrid network,” IEEE Trans. on Image Process. 28(9), 4364–4375 (2019). [CrossRef]

33. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778.

34. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv, arXiv:1511.07122 (2015). [CrossRef]

35. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

36. M. R. Luo, G. Cui, and B. Rigg, “The development of the cie 2000 colour-difference formula: Ciede2000,” Color Res. Appl. 26(24), 340–350 (2001). [CrossRef]

37. S. Zhu, E. Guo, J. Gu, Q. Cui, C. Zhou, L. Bai, and J. Han, “Efficient color imaging through unknown opaque scattering layers via physics-aware learning,” Opt. Express 29(24), 40024–40037 (2021). [CrossRef]

38. J. Zhang, X. Chen, W. Tang, H. Yu, L. Bai, and J. Han, “Dataset for illumination field reconstruction network,” Github (2023). https://github.com/JetsonKarl/ifrnet_data.

Metrics	Gamma	RetinexNet	EnlightenGAN	Zero-DCE	RetinexDIP	IFRNet (ours)
PSNR(dB)	22.108	12.704	14.249	22.010	20.510	25.470
SSIM	0.844	0.698	0.662	0.861	0.801	0.885
CIEDE2000	7.250	19.479	15.370	8.308	7.837	5.248

Metrics	w/o PMG	w/o PBS(a)	w/o PBS(b)	w/o DBM	Baseline
PSNR(dB)	22.849	16.476	24.423	23.222	25.470
SSIM	0.871	0.769	0.885	0.872	0.885
CIEDE2000	7.036	14.941	5.497	6.239	5.248

Metrics	Gamma	RetinexNet	EnlightenGAN	Zero-DCE	RetinexDIP	IFRNet (ours)
PSNR(dB)	21.333	13.775	16.617	24.603	20.878	24.726
SSIM	0.722	0.404	0.420	0.598	0.722	0.782
CIEDE2000	5.101	18.302	10.720	4.646	1.364	5.691

Metrics	Input	$Δ L_{1}$	$Δ L_{2}$	$Δ L_{3}$	$Δ L_{4}$	$Δ L_{5}$
Entropy	1.219	1.854	2.662	3.133	3.485	3.725
Standard Deviation	1.094	1.908	3.928	6.087	8.358	10.448

Metrics	Gamma	RetinexNet	EnlightenGAN	Zero-DCE	RetinexDIP	IFRNet (ours)
PSNR(dB)	22.108	12.704	14.249	22.010	20.510	25.470
SSIM	0.844	0.698	0.662	0.861	0.801	0.885
CIEDE2000	7.250	19.479	15.370	8.308	7.837	5.248

Single image relighting based on illumination field reconstruction

Abstract

1. Introduction

2. Method

2.1 Solutions design

2.2 Physical mechanism

2.3 Proposed network architecture

2.4 Loss function

3. Experiments and results

3.1 Data acquisition and experiments setting

3.2 Results and analysis

3.3 Ablation studies

3.4 Performance in the real world

3.5 Noise level evaluation

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (5)

Equations (14)

Optics Express