Light field reconstruction in angular domain with multi-models fusion through representation alternate convolution

Fengyin Cao; Fengyin Cao; Xinpeng Huang; Xinpeng Huang; Ping An; Ping An; Chao Yang; Chao Yang; Liquan Shen; Liquan Shen

doi:10.1364/OE.475117

1. Introduction

Owing to its richer interpretation of the scene, light field (LF) has received increasing attention and grown to be an active research area in recent years. The extra directional information in LF allows many applications in computer vision and imaging. LF has been studied in depth estimation [1–3], compression [4–6], quality assessment [7–10], display [11–13], to name but a few. By putting a micro-lens array in front of the imaging sensor, both angular and spatial information of light rays can be recorded within a single shot by a LF camera, e.g., Lytro [14] or Raytrix [15]. However, multiplexing the imaging sensor to capture both angular and spatial information of light rays introduces an inherent spatial-angular trade-off in the resolution [16]. As it has become the barrier of pushing the development of LF, alleviating such trade-off has already turned into imperative needs accordingly. Increasing the resolution of LF would be a blunt solution, LF super-resolution (SR) has become into being consequently [17]. It can reconstruct high resolution (HR) LF from its low resolution (LR) version in angular domain [6,18–20], spatial domain [21,22], and in both angular and spatial domains [23,24]. However, the insufficient angular resolution has become the bottleneck towards multi-view applications. Besides, the angular SR can be well embedded in LF related compression scheme [6]. Moreover, from the perspective of practical use, increasing the resolution of angular domain is the paramount goal of those who wish to make efficient use of LF cameras. In order to thoroughly benefit the extra directional information in LF, we focus on SR in angular domain in this paper.

Most of current LFSR methods only concern limited relations in LFs and construct single SR model towards specific aspect [6,18,24], which suggests that the multi-dimensional information in LFs is exploited insufficiently. To address this issue, we present a multi-models framework for LFSR in angular domain in this paper, so as to fully utilize the multi-dimensional information in LFs. Models which embody LF from distinct aspects are integrated to constitute the fusion framework. Since the number and the arrangement of these models together with the depth of each model determine the performance of the framework, we analyze these factors to reach the best SR result. However, models in the framework are isolated to each other, as they should be fed with the unique inputs. In this case, the fusion cannot be conducted and the end-to-end fashion cannot be adopted in the framework, which not only weakens SR performance but also makes SR process more complex. To tackle this issue, the representation alternate convolution (RAC) is introduced. Because of the RAC, data stream in the framework can be processed on-line. Since the fusion is conducted successfully through the RAC, the multi-dimensional information in LFs can be fully exploited. Besides, the RAC makes the framework easy to expand.

In our early work [25], we had proposed a framework for multi-models fusion in angular domain. The step-by-step manner was employed to meet the requirement of the inputs of different models. Therefore, the intermediate data was processed offline, which resulted a very time consuming process. Moreover, the step-by-step manner makes the framework very hard to expand, the workload substantially increases as the degree of complex of the framework depends on the number of models in it. Besides, the step-by-step manner sacrifices too much SR performance.

In what follows is the contributions of this paper:

1. We present a multi-models fusion framework for LFSR in angular domain, so as to fully utilize the multi-dimensional information in LFs.
2. Models which embody LF from distinct aspects are integrated to constitute the fusion framework. As the number and the arrangement of these models together with the depth of each model determine the performance of the framework, we analyze on these factors to reach the best SR result.
3. Models in the framework are isolated to each other as the unique inputs are required. In this case, the fusion cannot be conducted and the end-to-end fashion cannot be adopted in the framework, therefore, the representation alternate convolution (RAC) is introduced.
4. Comprehensive experiments and ablation studies are implemented, which demonstrate the effectiveness and the stability of the proposed method quantitatively and qualitatively.

2. Related work

In this section, a brief review on the single image SR methods is made firstly, other work related to LFSR is divided as follows: the depth-based (i.e., novel view synthesis) and SR-based methods. We regard LFSR in angular domain, LF reconstruction in angular domain and LF novel view synthesis interchangeable in the rest of this paper.

2.1 Single image super-resolution

The single image super-resolution (SISR) has been well studied over last decades. To super-resolve the HR image from its LR version, it is essential to enlarge the resolution of the LR image at some point to match the size of the HR image. Some methods [26,27] enlarge the resolution at the beginning of the network. Each SR model in the proposed framework follows this practice. Moreover, the resolution can be increased gradually in the middle of the network [28]. On the contrary, methods in [29–32] increase the resolution at the last of the network. The up-sampling network in our framework is with the same practice.

2.2 Novel views synthesis

Kalantari et al. [19] used two sequential neural networks in which one was employed to estimate depth and another was adopted to predict color to reconstruct the missing views in LF. Srinivasan et al. [20] proposed a pipeline to synthesize LF using the central view. They factored the problem into several tasks: estimated scene depth, rendered a Lambertian LF, and predicted occluded rays and non-Lambertian effects. Recently, Jin et al. [33] presented an end-to-end approach for super-resolving LF with a large baseline. They modeled the geometry of the scene explicitly and explored the angular relations for LF blending efficiently. In addition, they also introduced a novel loss function which could be used to preserve the parallax structure of LF. Most depth-based methods take the depth estimation as the starting point. However, inaccurate depth estimation leads to the error accumulation, which severely reduces the accuracy of the reconstruction.

2.3 LFSR in angular domain

The epipolar plane image (EPI) [17,34] is commonly employed in angular SR-based methods. Wu et al. [18] treated LF reconstruction as angular details restoration on 2D EPIs. The "blur-restoration-deblur" scheme was introduced to balance the information asymmetry. After that, the non-blind deblur was employed to recover the suppressed information. Zhao et al. [6] proposed a pipeline to synthesize novel views on the decoder side in a compression framework, they fed EPIs to their network without any pre-process. Liu et al. [35] introduced a multi-angular epipolar geometry structure, in which four sub-networks were employed to learn LF angular consistency and spatial geometry information in different directional epipolar geometry implicitly. The lenslet image is also exploited for SR in angular domain, as it is constituted by elementary images, the number of pixels in an elementary image corresponds to the resolution of angular domain. Gul et al. [24] proposed a lenslet-based SR method, the fully connected layer was employed to increase pixels in each elementary image. Despite these methods achieve the convincing performance, there is still much space for improvement, such as the blurriness of high texture regions and the ghost artifact appearing at the surroundings of objects. Furthermore, most of current SR methods establish models that embody partial relations in the LFs, which implies that the multi-dimensional information in the LFs is exploited insufficiently.

3. Methodology

In this section, each part of the overall framework depicted in Fig. 1 is introduced. Moreover, there is a thorough report on the loss functions at the last of this section.

Fig. 1. The flowchart of the framework in which the sequence "E_R-L-E_C" is conducted. The overall framework contains two parts, one is the learnable up-sampling network $F_{UP}$ that is employed to predict HR lenslet image from its LR version, and another is the fusion framework $F_{fu}$ that can be treated as the quality enhancement on the predicted HR lenslet image. To simplify the expression, we exploit the sequence to indicate the overall framework, unless the extra declaration is made.

Download Full Size | PDF

3.1 LF representations and their corresponding models

The 4D LF is normally modeled using two parallel planes i.e., L(x, y, u, v), where (x, y) and (u, v) are projections of the spatial and angular domains respectively [17,36–38]. Therefore, the 4D LF can be considered as a set of views, each of which is called sub-aperture image (SAI) $L_{SAI}(x, y, u, v)$. A slice of continuous SAIs lies on the horizontal or vertical direction is regarded as the sub-views of LF, and these sub-views facilitate the extraction of EPIs. EPI can be acquired by fixing one of the coordinates in each domain. For instance, $E_{y^*,v^*}(x, u)$ is the horizontal EPI which can be obtained by keeping the spatial coordinate $y$ and the angular coordinate $v$ unchanged. A vertical $E_{x^*,u^*}(y, v)$ can be attained in the similar means. The lenslet image $L_{lenslet}(X, Y)$ where X and Y are the products of x and u, y and v separately, can be achieved via fusing corresponding axis in the spatial and angular domains. Fig. 2 illustrates SAIs, EPIs and the lenslet image, and theirs relationships.

Fig. 2. The relationships among SAIs, EPIs and the lenslet image. (a) The lenslet image. (b) Lenslet image is constituted of elementary images, the number of pixels in each elementary image corresponds to the resolution of angular domain. (c) SAIs and EPIs.

Download Full Size | PDF

The lenslet image $L_{lenslet}(X, Y)$, the horizontal EPIs $E_{y^*,v^*}(x, u)$ and the vertical EPIs $E_{x^*,u^*}(y, v)$ are regarded as distinct LF representations in this paper, as they contain the relevant information in angular domain, which can be utilized for the reconstruction of the missing views in the LFs. LF representations in this paper are illustrated in Eq. (1):

(1)$$\Omega_{LF}=\left\{L_{{lenslet}}(X, Y), E_{y^*,v^*}(x, u), E_{x^*,u^*}(y, v)\right\}.$$

However, SAIs are good at voicing the spatial details rather than the angular relevance. Thus, SAIs are not treated as one of LF representations. We train corresponding models for these representations using the same SR network. The objective for each model is to reconstruct the high angular resolution LF $LF^{hr}$ from its low angular resolution version $LF^{lr}$:

(2)$$f=\arg \min _{f}\left\|L F^{h r}-f\left(L F^{l r} \uparrow\right)\right\|,$$

where $f$ is the function that recover high frequency details in each model, $\uparrow$ is the bicubic interpolation. $LF$ refers to the lenslet image $L_{lenslet}(X, Y)$ (denoted as $L$ in the following), the horizontal EPIs $E_{y^*,v^*}(x, u)$ (rows in SAIs, depicted as $E_R$ in the following), and the vertical EPIs $E_{x^*,u^*}(y, v)$ (columns in SAIs, denoted as $E_C$ in the following). These models are employed to form the multi-models fusion framework.

3.2 Forming the fusion framework

The SR models are combined to form the framework, there is no bias in these models so as to reduce the complexity. The framework can be treated as the sequence in which nodes are weights. Therefore, the fusion framework is formed as follows:

(3)$$\begin{array}{c} F_{fu}: \rightarrow f_{\alpha}-f_{\beta}-\dots-f_{0}=f_{\alpha}\left(f_{\beta}\left(\dots\left(f_{0}\right)\right)\right) \\ =w_{\alpha} *\left(w_{\beta} *\left(\dots\left(w_{0} * I_{0}\right)\right)\right), \\ \forall w_{\alpha}, w_{\beta}, \dots \in w_{\Omega}, w_{\Omega} \subseteq w_{L F}, w_{L F}=\left\{w_{L}, w_{E_{R}}, w_{E_{C}}\right\}, \end{array}$$

where $F_{fu}$ is the fusion framework, $f_{\alpha }-f_{\beta }-\dots -f_{0}$ is the sequence that represents $F_{fu}$, $f_{\alpha },f_{\beta },\ldots$ and $w_{\alpha },w_{\beta },\ldots$ stands for each particular model and its corresponding weight respectively, $*$ denotes the convolution, and $I_0$ is the input of the framework. Besides, $w_{\Omega }$ is the subset of $w_{LF}$, it can be any subset except for the empty one.

We first consider the number and the arrangement of these models. Suppose that $w_a$ is an arbitrary weight taken from $W_\Omega$. $W_{A}^T$, which is an $A \times 1$ weighting matrix, is constructed by $w_a$. In order to control the arrangement of these models, a sparse matrix $e_{N \times A}$ is introduced, where each row is the impose function. Therefore, the arrangement can be decided by $M(N;A)$ which is the multiplexing of $e_{N \times A}$ and $W_{A}^T$:

(4)$$\begin{array}{c} w_{a} \in W_{\Omega}, a = 0,1 {\dots} A-1,\\ W_{A}^{T}=\begin{pmatrix} w_{0} \\ w_{1} \\ \cdots \\ w_{A-1} \end{pmatrix}, e_{N \times A}=\begin{pmatrix} 0 & \cdots & 1 & \cdots & 0 \\ \vdots & & \vdots & & \vdots \\ 1 & \cdots & 0 & \cdots & 0 \\ \vdots & & \vdots & & \vdots \\ 0 & \cdots & 0 & \cdots & 1 \end{pmatrix}, \\ M(N ; A)=e_{N \times A} \cdot W_{A}^{T}, \end{array}$$

where $N$ is the number of models. Therefore, the process in Eq. (3) can be reformulated using Eq. (4) as:

(5)$$\begin{array}{c} F_{f u}(N ;M(N ; A))=f_{N-1}\left(f_{N-2}\left(\dots\left(f_{0}\right)\right)\right) \\ = w_{N-1}^{M(N ; A)} *\left(w_{N-2}^{M(N ; A)} * \dots\left(w_{0}^{M(N ; A)} * I_{0}\right)\right), \end{array}$$

where $I_0$ is the input of the framework, $w_n^{M(N ; A)}, n= 0,1 {\dots } N-1$ is the selected weight via $M(N ; A)$ for the $n$th model $f_n$. Next, we consider the depth of each SR model. Therefore, Eq. (5) can be reformulated as:

(6)$$\begin{array}{c} F_{f u}(N ; M(N ; A) ; d)=f_{N-1}^{d}\left(f_{N-2}^{d}\left(\dots\left(f_{0}^{d}\right)\right)\right) \\ =w_{N-1}^{M(N ; A);d} *\left(w_{N-2}^{M(N ; A);d} * \dots\left(w_{0}^{M(N ; A);d} * I_{0}\right) ,\right. \end{array}$$

where $w_n^{M(N ; A);d}, n= 0,1 {\dots } N-1$ indicates the selected weight for the $n$th model $f_n^d$ that has the depth of $d$. To this end, LFSR in angular domain can be formulated as follows:

(7)$$LFSR_{AR} = \arg \max F_{f u}(N ; M(N ; A) ; d),$$

where $LFSR_{AR}$ indicates the objective of this paper, which is to accomplish SR task in angular domain. Models embodying LF from distinct aspects are integrated to constitute the fusion framework $F_{fu}$. The number $N$ and the arrangement $M(N ; A)$ of these models together with the depth $d$ of each model determine the performance of $F_{fu}$.

3.3 Representation alternate convolution

The relationships among LF representations, as what has already mentioned in section 3.1, can be described using four equations as follows:

(8)$$\mathcal{I}_{L \rightarrow E_{R}} =\varPhi_{L \rightarrow E_{R}}\left(\mathcal{I}_{L}\right)=\boldsymbol{T}\left(\boldsymbol{R}\left(\mathcal{I}_{L}\right), \boldsymbol{t}^{[0,2,1,3]}\right),$$

(9)$$\mathcal{I}_{E_{R} \rightarrow L}=\varPhi_{E_{R} \rightarrow L}\left(\mathcal{I}_{E_{R}}\right) =\boldsymbol{R}\left(\boldsymbol{T}\left(\mathcal{I}_{E_{R}}, \boldsymbol{t}^{[0,2,1,3]}\right)\right),$$

(10)$$\mathcal{I}_{L \rightarrow E_{C}} =\varPhi_{L \rightarrow E_{C}}\left(\mathcal{I}_{L}\right) =\boldsymbol{R}\left(\boldsymbol{T}\left(\mathcal{I}_{L}, \boldsymbol{t}^{[0,2,1,3]}\right)\right),$$

(11)$$\mathcal{I}_{E_{C} \rightarrow L} =\varPhi_{E_{C} \rightarrow L}\left(\mathcal{I}_{E_{C}}\right) =\boldsymbol{T}\left(\boldsymbol{R}\left(\mathcal{I}_{E_{C}}\right), \boldsymbol{t}^{[0,2,1,3]}\right),$$

where $L$, $E_R$ and $E_C$ represent the lenslet image, the horizontal EPIs and the vertical EPIs respectively. R stands for the reshape, T denotes the transpose towards the axis $\boldsymbol{t}^{[0,2,1,3]}$. Eq. (8) alternates $L$ to $E_R$, while the reverse conversion is illustrated in Eq. (9). Furthermore, the process that alternates $L$ to $E_C$ and the reverse process are illustrated in Eq. (10) and Eq. (11) separately. It should be noticed that the lenslet image is treated as the starting point of the framework no matter which node lies at the first of the sequence. Besides, the intermediate data in between two consecutive models is also in the form of the lenslet image.

In order to illustrate the representation alternate convolution (RAC), let us consider the case of three different models that are arranged in the order of ${E_R}\mbox{-}{L}\mbox{-}{E_C}$. Since the lenslet image is treated as the starting point of the framework, the input of the RAC is $\mathcal {I}_L^{in}=L^{(B,h \times V_h,w \times V_w,C)}$ that has the shape of $[B, h \times V_h, w \times V_w, C]$, where $B$ is the batch size, $h \times V_h$ and $w \times V_w$ are the height and the width of the lenslet image respectively, $C$ is the number of channels.

step 1: The first step of the RAC with the given order is to convert $\mathcal {I}_L^{in}$ to $\mathcal {I}_{E_{R}}$ using Eq. (8), where $\varPhi _{L \rightarrow E_R }(\mathcal {I}_L^{in})$ alternates the input $\mathcal {I}_L^{in}$ that has the shape of $[B, h \times V_h, w \times V_w, C]$ to the horizontal EPIs $\mathcal {I}_{E_R}$ with the shape $[B \times h \times V_h, V_w, w, C]$. The model $f_{E_R}$ is fed with $\mathcal {I}_{E_R}$: $\mathcal {F}_{E_{R}}=f_{E_{R}}\left (\mathcal {I}_{E_{R}}\right )=w_{E_{R}} * \mathcal {I}_{E_{R}}$, and the output feature map $\mathcal {F}_{E_R}$ is alternated to $\mathcal {F}_{L}^1$ via Eq. (9).

step 2: As the feature map $\mathcal {F}_{L}^1$ has the shape of $[B, h \times V_h, w \times V_w, C]$, which is in the right form to the requirement of the model $f_L$. Therefore, the model $f_L$ is directly fed with $\mathcal {F}_{L}^1$: $\mathcal {F}_{L}^2=f_{L}\left (\mathcal {F}_{L}^1\right )=w_{L} * \mathcal {F}_{L}^1$, and the output feature map $\mathcal {F}_L^2$ is converted to $\mathcal {F}_{E_C}$ via Eq. (10), where $\varPhi _{L \rightarrow E_C}(\mathcal {F}_L^2)$ alternates the feature map $\mathcal {F}_L^2$ with the shape $[B, h \times V_h, w \times V_w, C]$ to the feature map $\mathcal {F}_{E_C}$ that has the shape $[B \times w \times V_w, h, V_h, C]$.

step 3: At last, the model $f_{E_C}$ that lies at the last of the given sequence is fed with $\mathcal {F}_{E_C}$: $\mathcal {I}_{E_C}=f_{E_C}\left (\mathcal {F}_{E_C}\right )=w_{E_{C}} * \mathcal {F}_{E_C}$. To ensure the expansibility of the RAC, the output of the given framework $\mathcal {I}_{E_C}$ is further converted to $\mathcal {I}_L^{out}$ using Eq. (11), where $\varPhi _{E_C \rightarrow L}(\mathcal {I}_{E_C})$ alternates the vertical EPIs $\mathcal {I}_{E_C}$ with the shape $[B \times w \times V_w, h, V_h, C]$ to the lenslet image $\mathcal {I}_L^{out}=L^{(B,h \times V_h,w \times V_w,C)}$ that has the shape $[B, h \times V_h, w \times V_w, C]$.

More specifically, the workflow of the RAC with the given order ${E_R}\mbox{-}{L}\mbox{-}{E_C}$ is illustrated using pseudo-code in Algorithm 1. It should be noticed that $\mathcal {I}_L^{in}$ and $\mathcal {I}_L^{out}$ are the input and output lenslet images separately, $\mathcal {I}_{E_{R}}$ and $\mathcal {I}_{E_C}$ are the horizontal EPIs and vertical EPIs respectively. $f_{E_C}$, $f_{E_R}$ and $f_L$ are the corresponding models for $E_C$, $E_R$ and $L$ respectively. $\mathcal {F}_{E_R}$, $\mathcal {F}_{L}^1$, $\mathcal {F}_L^2$ and $\mathcal {F}_{E_C}$ are the feature maps generated from convolutions. Other sequences can be implemented in the similar means. Since the structure of the current framework can be expand, the performance will be improved to some extent, by adding new nodes at somewhere of the current structure. Besides, we can also eliminate nodes from the current structure.

Algorithm 1. the RAC with the given order E_R-L-E_C

View Table | View all tables in this article

3.4 Up-sampling network

The learnable up-sampling network $F_{UP}$ is exploited to predict the HR intermediate lenlset image from its LR version. Similar to Gul et al. [24], we increase the resolution of each elementary image. $F_{UP}$ is a three-layer fully convolutional network, each layer in it can be described as follows:

(12)$$\begin{array}{c} F_{U P}^{1}\left(L^{L R}\right)=\sigma\left(w_{U P}^{1} * L^{L R}+b_{1}\right), \\ F_{U P}^{2}\left(F_{U P}^{1}\right)=\sigma\left(w_{U P}^{2} * F_{U P}^{1}+b_{2}\right), \\ L^{H R}=F_{U P}^{3}\left(F_{U P}^{2}\right)=\boldsymbol{P S}\left(w_{U P}^{3} * F_{U P}^{2}+b_{3}\right), \end{array}$$

where $w_{UP}^i$, $b_i$, $i = 1, 2, 3$ are weights and biases respectively, $\sigma ( \cdot )$ stands for the activation function, $L^{HR}$ is the high angular resolution ground truth lenslet image and $L^{LR}$ is its LR version. PS is the pixel shuffler [32], which rearranges the elements in the tensor with the shape $[B, H, W, r^2 \times C]$ to the new tensor of shape $[B, r \times H, r \times W, C]$, where r is the upscale factor. $F_{UP}$ together with $F_{fu}$ comprise the overall framework $F_{overall}$. However, the method in [39] exploits the similar network architecture. It establishes models for R, G and B channels separately, and takes $3 \times 3$ lenslet regions as input. On the contrary, $F_{UP}$ only needs the information in Y channel, and takes the single lenslet region to feed the network. Moreover, unlike $F_{UP}$, the method in [39] ignores to preserve the intrinsic structure in the parallax. Besides, $F_{UP}$ is only the part of $F_{overall}$.

3.5 Loss function

3.5.1 Loss function for the up-sampling network

The learnable up-sampling network $F_{UP}$ is employed at the beginning of the overall framework $F_{overall}$. The loss function for $F_{UP}$ is composed of two parts:

(13)$$\begin{array}{c} loss_{UP}=\lambda_{pre}loss_{pre}+\lambda_{EPI}loss_{EPI}, \end{array}$$

where $\lambda _{pre}$ and $\lambda _{EPI}$ are the corresponding coefficients of the predicting loss $loss_{pre}$ and the EPI loss $loss_{EPI}$ respectively. The predicting loss $loss_{pre}$ is used to minimize L2 distance between the predicted lenslet image $F_{UP}(L_i^{lr})$ and the corresponding HR ground truth $L_i^{hr}$ :

(14)$$loss_{pre}= \sum_{i=1}^{M}\left\|L_{i}^{hr}-F_{UP}\left(L_{i}^{lr}\right)\right\|_{2}^{2},$$

where $M$ stands for the number of training samples, $L_i^{lr}$ is the LR input. However, as the lenslet image is treated as a 2D image directly in $F_{UP}$, the ignored intrinsic structure consistency in the parallax should be maintained. Therefore, we employ the EPI loss $loss_{EPI}$, which is based on the gradient of EPIs [33], to keep this consistency:

(15)$$\begin{array}{c} loss_{EPI}= \sum_{y, v}(|\nabla_{x} E_{y^{*}, v^{*}}(x, u)-\nabla_{x} \widehat{E}_{y^{*}, v^{*}}(x, u)| \\ +|\nabla_{u} E_{y^{*}, v^{*}}(x, u)-\nabla_{u} \widehat{E}_{y^{*}, v^{*}}(x, u)|) \\ + \sum_{x, u}(|\nabla_{y} E_{x^{*}, u^{*}}(y, v)-\nabla_{y^{*}} \widehat{E}_{x^{*}, u^{*}}(y, v)\| \\ +|\nabla_{v} E_{x^{*}, u^{*}}(y, v)-\nabla_{v} \hat{E}_{x^{*}, u^{*}}(y, v)|), \end{array}$$

where $E_{y^{*},v^{*}}(x,u)$ and $\widehat {E}_{y^{*}, v^{*}}(x, u)$) are the horizontal EPIs that are constructed from the ground truth and the predicted LF respectively. The same to the vertical EPIs $E_{x^{*},u^{*}}(y,v)$. $\nabla _*$ represents the first order gradient along one of the axis in $x$, $y$, $u$, and $v$.

3.5.2 SR loss

The SR loss contains two parts: one for training the corresponding models of distinct LF representations, another is for the fusion framework $F_{fu}$. L2 distance is employed to optimize each SR model. Besides, the penalty term is also exploited to avoid overfitting. Therefore, the loss function for a corresponding model is:

(16)$$ \textit{loss}_{L F}=\sum_{i=1}^N\left\|L F_i^{h r}-f\left(L F_i^{l r}\right)\right\|_2^2+\lambda \sum_{i=1}^P\left\|p_t\right\|_2^2, $$

where $N$ is the number of training samples, $LF_i^{lr}$ and $LF_i^{hr}$ are the particular LR LF representation and its HR ground truth respectively, $p_t$ and $P$ are the weights and the number of them in a single SR network separately. The cascade structure is employed to implement $F_{fu}$, for a framework with $S$ models, the loss function is:

(17)$$ \textit{loss}_{F_{f u}}=\sum_{j=1}^S \sum_{i=1}^N\left\|\left(M_i^{h r}\right)_j-f\left(\left(M_i^{l r}\right)_j\right)\right\|_2^2+\lambda \sum_{t=1}^{S \times P}\left\|p_t\right\|_2^2, $$

where $(M_i^*)_j$ represents the $j$th models in $F_{fu}$, $M_i^*$ is the specific LF representation.

3.5.3 Overall loss

The objective of $F_{overall}$ is to minimize the distances between the predicted and the ground truth lenslet images:

(18)$$loss_{overall}=\sum_{i=1}^{N}\left\|L_{i}^{h r}-f\left(L_{i}^{l r}\right)\right\|_{2}^{2}+\lambda \sum_{t=1}^{S \times P}\left\|p_{t}\right\|_{2}^{2},$$

where $f$ represents $F_{overall}$, $N$ is the number of training samples, $S$ is the number of SR models, $L^{lr}$ and $L^{hr}$ are the LR lenslet image and its HR ground truth separately, $p_t$ and $P$ are the parameters and the number of them respectively, $\lambda$ is the coefficient of the penalty term.

4. Experiments

In this section, we evaluate our method and make comparison with state-of-the-art SR-based methods. Peak signal to noise ratio (PSNR) and structural similarity index measure (SSIM) [40] are the criteria of the performance. For each LF, numerical results are computed over all reconstructed views. Besides, we conduct the ablation study to demonstrate the importance of each part in the overall framework $F_{overall}$.

4.1 Datasets

We exploit both real-world and synthetic LFs to train and evaluate our method. We utilize the real-world LFs provided by Kalantari et al. [19], in which 100 LFs for training and 30 LFs (i.e., 30scenes) for the evaluation. Besides, the extra LFs are taken from the Stanford Lytro LF Archive [41] to make further comparison in terms of the generalization. We take 10 from occlusions, and 10 from reflective, these LFs involve the challenging scenes. To avoid the overexposure and the vignette effect, we use the central 9${\times }$9 grid. Moreover, we crop SAIs in each LF to 375${\times }$540. The real-world LFs are used to distinguish the performance of different methods that are deployed under the natural illumination and the practical camera distortion. The HCI benchmark [42] is exploited for the synthetic LFs. We use 20 LFs that are taken from "training" and "additional" to train our method. The LFs from "test" are used for the evaluation. As it is generated by the software, the total 81 SAIs in a synthetic LF can be used. Moreover, the spatial resolution is 512${\times }$512 in all synthetic LFs. The synthetic LFs contain HR textures that can be used to measure the capability of maintaining details. Besides, there exists large disparities between consecutive SAIs, which can be used to measure the robustness of different methods. For the LFs, their LR version can be obtained via sampling on them using different sample rates, as shown in Fig. 3. We make comparison for 2${\times }$ and 4${\times }$ SR respectively. For more details, please read the Supplement 1. We convert the LFs to YCbCr color space and only use the information in Y channel.

Fig. 3. Views under different sample rates. (a) Views under sample rate $r = 2$ (i.e. 2${\times }$ SR). (b) Views under sample rate $r = 4$ (i.e. 4${\times }$ SR). The views in green are inputs, the rest is the views to be reconstructed.

Download Full Size | PDF

4.2 Training configurations

The overall framework $F_{overall}$ shown in Fig. 1 is constituted by two parts, one of which is the up-sampling network $F_{UP}$, it is trained using the lenslet images. Another is the fusion framework $F_{fu}$ , where multi-models are combined. Each model in $F_{fu}$ is built upon the SR model [26], there is no bias in the implementation. For the depth of the SR model, we set it to 5 and 10 empirically, to balance the computational burden and the performance. The number of SR models is set at most 3 in practice, and $F_{fu}$ is comprised with non-repetitive SR models. Otherwise, the computational burden is increased to the great extent, and the time consumption is enlarged rapidly. Besides, it may undergo the vanishing gradient in the training phase.

The weights of convolution filters in each SR model and $F_{UP}$ are initialized based on the method in [43], and the biases in $F_{UP}$ are initialized via zero constant initializer. The batch size for $F_{UP}$ is 16, and it is 32 and 64 for SR models which are trained using the lenslet images and EPIs respectively. After $F_{UP}$ and $F_{fu}$ are trained, $F_{overall}$ can be comprised. We adopt parameters in the already trained $F_{UP}$ and $F_{fu}$ as the initial values for $F_{overall}$. The batch size for it is 16. For SR models, $F_{fu}$ and $F_{UP}$ are implemented using the Tensorflow [44] 1.12, which are deployed on the Tesla P100. The Adam optimizer [45] that has parameters of $\beta _1$ = 0.9, $\beta _2$ = 0.999, and $\epsilon = 10^{-8}$ is adopted.

To simplify the expression, we exploit the sequence to indicate $F_{overall}$ which contains $F_{UP}$ and $F_{fu}$, unless the extra declaration is made. We give a brief illustration for the meaning of the sequence, for example, the sequence "$L$10-$E_C$10-$E_R$10" is comprised by three different SR models in the order of ${L}\mbox{-}{E_C}\mbox{-}{E_R}$, and each SR model in it has the depth of 10.

4.3 Comparison with state-of-the-art methods

The bicubic interpolation is treated as the base line, state-of-the-art SR-based methods, namely Wu et al. [18] (i.e, EPICNN), Zhao et al. [6], and Gul et al. [24] are employed to make comparison. However, these methods are implemented with the Caffe [46] and utilize different datasets, we re-implement them using the Tensorflow 1.12 with the same datasets as ours. We use the same training parameters in their paper, and tune them to achieve the best performance. For the method in [24], as it is used to super-resolve both angular and spatial domains, we keep the spatial resolution unchanged so as to implement SR in angular domain, and use LFSR-AR to represent it in the following. Moreover, as MSDRN in [6] is used for both EPI SR and decoder-side quality enhancement, therefore, we only implement EPI SR from their method. However, we cannot make comparison with the depth-based methods, such as [19,33], as they are designed to handle the large disparity.

4.3.1 Comparison for 2${\times }$ SR

Table 1 lists the SR performance of different candidates. The results are used to make the intra-comparison among them at first. The sequence "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$" achieves outstanding result in the real-world LFs, while the sequence "${L10}\mbox{-}{E_R10}\mbox{-}{E_C10}$" obtains the best SR result in the synthetic LFs, and we use them to make further comparison with state-of-the-art methods. It can also be observed form Table 2 that:

• LFSR-AR [24] is inferior to others. While $F_{UP}$ performs better than LFSR-AR and the base line. As both of LFSR-AR and $F_{UP}$ have simple network architecture, it implies that the pixel shuffler is much stronger than the fully connected layer in the capability of prediction.
• EPICNN [18] and Zhao et al. [6] achieve the considerable results in the real-world LFs. However, they meet the performance decay when it comes to the synthetic LFs. These methods cannot maintain the details of HR textures. The pre-process for inputs that are interpolated using the bicubic to match the size of the ground truths, is the possible explanation. Besides, they only concern partial relations in the LFs, which is the vital limitation.
• Our approach outperforms state-of-the-art techniques among all the listed datasets, which indicates that the effectiveness of our approach can be confirmed. Besides, our approach is not troubled by the performance degradation when it comes to the synthetic LFs, which demonstrates the stability of the proposed method.

Table 1. PSNR/SSIM values achieved by different candidates for 2${\times }$ SR. The best results are highlighted in bold.

View Table | View all tables in this article

Table 2. PSNR/SSIM values achieved by different methods for 2${\times }$ SR. The best results are highlighted in bold.

View Table | View all tables in this article

We also visually compare the results of different techniques, which are shown in Fig. 4 and Fig. 5. In both figures, the most left is the ground truth together with two zoomed-in area, the error map lies on the bottom. We illustrate "${L10}\mbox{-}{E_R10}\mbox{-}{E_C10}$" for the synthetic LF "herbs", and "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$" for the real-world LF "IMG_1528_eslf".

Fig. 4. The visual comparison of the synthetic LF "herbs".

Download Full Size | PDF

Fig. 5. The visual comparison of the real-world LF "IMG_1528_eslf".

Download Full Size | PDF

In Fig. 4, is the synthetic LF "herbs", we can observe that except for "${L10}\mbox{-}{E_R10}\mbox{-}{E_C10}$", the black dots on the wall and the edge of the leaves are blurred in different degrees in each visual result. Besides, there exists clearly ghost artifacts around leaves, which is shown in the blue rectangle of LFSR-AR, bicubic, $F_{UP}$, and EPICNN. The more intuitive results can be obtained from the error maps. The real-world LF "IMG_1528_eslf" in Fig. 5 is taken from 30scenes. The billboard as well as the surrounding branches in the orange rectangle in LFSR-AR, bicubic, $F_{UP}$, and EPICNN are blurred in different degrees respectively, while it still sharp in Zhao et al. [5] and "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$". Moreover, the latten and the surrounding branches are also blurred in LFSR-AR, bicubic, and $F_{UP}$. It can be observed from Fig. 4 and Fig. 5 that our approach achieves the better visual performance than state-of-the-art techniques. The blurriness of high texture regions and the ghost artifact appearing at the surroundings of objects, emerge in these methods, on the contrary, our method produces sharp and clean results.

4.3.2 Comparison for 4${\times }$ SR

We list the SR performance of different candidates in Table 3, and make the intra-comparison among them at first. The sequence "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$" achieves superior results in the real-world LFs, while the sequence "${E_R10}\mbox{-}{L10}\mbox{-}{E_C10}$" obtains the best SR result in the synthetic LFs, and we use them to make further comparison with state-of-the-art methods. It can be observed from Table 4 that:

• In the real-world LFs, LFSR-AR [24] achieves the similar performance with the base line. However, it fails to reconstruct the missing views when it comes to the synthetic LFs. While, $F_{UP}$ still performs better than the base line and LFSR-AR.
• Zhao et al. [5] achieves the considerable performance in both real-world and synthetic LFs. EPICNN [15] behaves not so well, which is possibly influenced by the "blur-restoration-deblur" scheme. The effect of this scheme is not so prominent with the smaller disparities, while, it is magnified under the larger disparities. Furthermore, for both methods, they cannot get rid of the trouble of the performance decay when it comes to the synthetic LFs.
• Our approach still performs better than others among all the listed datasets. The effectiveness and the stability of our approach can be further confirmed.

Table 3. PSNR/SSIM values achieved by different candidates for 4${\times }$ SR. The best results are highlighted in bold.

View Table | View all tables in this article

Table 4. PSNR/SSIM values achieved by different methods for 4${\times }$ SR. The best results are highlighted in bold.

View Table | View all tables in this article

The visual comparison are illustrated in Fig. 6 and Fig. 7. For the visual results, we illustrate "${E_R10}\mbox{-}{L10}\mbox{-}{E_C10}$" for the synthetic LF "bicycle", and "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$" for the real-world LF "reflective_3_eslf".

Fig. 6. The visual comparison of the synthetic LF "bicycle".

Download Full Size | PDF

Fig. 7. The visual comparison of the real-world LF "reflective_3_eslf".

Download Full Size | PDF

In Fig. 6 is the synthetic LF "bicycle". We can observe that the result of LFSR-AR is bad, there exists distortion and severely deviation in the color. The reason for this is the fully connected layer lacks the ability to predict the intensities of the missing pixels under large disparities. Besides, except for "${E_R10}\mbox{-}{L10}\mbox{-}{E_C10}$", other methods cannot maintain the details of HR textures. The chips on the bamboo basket and the patterns on the floor, are blurred in different degrees. There even exists the ghost artifacts around the edge of leaves in most results. The "reflective_3_eslf" which is taken from reflective is shown in Fig. 7. The wiper exists severely ghost artifacts and the front of the car is blurred in LFSR-AR, bicubic and $F_{UP}$. Except for "${L10}\mbox{-}{E_C10}\mbox{-}{E_R10}$", the lantern in the orange rectangle has the trailing in different degrees in each result. From the visual comparison in Fig. 6 and Fig. 7, we can draw the conclusion that our approach still obtains the better performance than state-of-the-art techniques.

4.4 Ablation study

In this section, we conduct the ablation study to further verify the importance and the effectiveness of each part in our approach. We assort the ablation study into two aspects, one is the factors that determine the performance, and another is the variants of the overall framework.

4.4.1 Factors that determine the performance

To evaluate the SR performance of the relevant variants, the real-world (i.e., 30scenes) and the synthetic LFs are employed. Different up-sampling rates are also utilized. We set two different depths for the SR model, thus it can illustrate the impact of depth to the performance. However, there is a restriction that we cannot combine models with different depths. Moreover, we select at most two different models, and arrange them in different orders. Table 5 and Table 6 illustrate the SR performance of these variants.

Table 5. PSNR/SSIM values achieved by different variants for 2${\times }$ and 4${\times }$ SR. Each variant combines at most two SR models, each of which has the depth of 5. The best results are highlighted in bold.

View Table | View all tables in this article

Table 6. PSNR/SSIM values achieved by different variants for 2${\times }$ and 4${\times }$ SR. Each variant combines at most two SR models, each of which has the depth of 10. The best results are highlighted in bold.

View Table | View all tables in this article

SR models and frameworks with deeper depth achieve the better performance, which indicates the influence of the depth. However, "$L5$" and "$L10$" show opposite behavior. From Table 5 and Table 6, we can observe that "$L5$" reaches to the lowest performance among SR models, while the situation goes to the opposite when it comes to "$L10$" that ranks at the top position. The possible reason for this behavior is that a shallow network cannot easily handle such high dimensional data (i.e., the lenslet image). Once the architecture of the network is enough to cope with such high dimensional data, it can yield a lot of performance gain.

Furthermore, the effectiveness of the compensation should be demonstrated, and the influence that is brought by the arrangement should be illustrated:

• With the same depth, frameworks achieve the better performance than SR models, which validates the effectiveness of the compensation. For example, in Table 5 and Table 6, "$E_C10$" which can be treated as cascading two identical models "${E_C5}\mbox{-}{E_C5}$", achieves 43.5043dB in 30scenes. However, replacing "$E_C5$" with "$E_R5$" or "$L5$", results in 44.4041dB (i.e., "${E_R5}\mbox{-}{E_C5}$"), and 43.5385dB (i.e., "${L5}\mbox{-}{E_C5}$") respectively.
• Two identical models that are arranged in distinct orders obtain different results, which verifies the importance of the arrangement. We take "$E_C5$" and "$L5$" to form the framework as an example using 30scenes, the sequence "${E_C5}\mbox{-}{L5}$" in Table 5 obtains 43.4495dB, which is lower than "$E_C10$" (i.e., 43.5043dB in Table 6). It implies that "$L5$" yields negative impact to the performance. However, the sequence "${L5}\mbox{-}{E_C5}$" achieves 43.5385dB, which indicates that "$L5$" produces positive impact to the performance.

In addition, we make a comprehensive analysis on the various performance caused by different arrangements in the Supplement 1, please read it for more information.

4.4.2 Variants of the overall framework

In order to validate the performance gain that is brought by the learnable up-sampling network $F_{UP}$ and the end-to-end training manner, we also illustrate the variants of them. To implement the step-by-step fine tune strategy, we prepare training data for each stage offline [41]. For the variants without $F_{UP}$, we directly interpolate on the LR inputs using the bicubic interpolation to match the size of the HR ground truths before feeding them to the networks. We conduct these variants for 2${\times }$ SR in 30scenes. The results are listed in Table 7–9.

• The importance of the end-to-end fashion in the training phased is validated. We take the sequence "${E_R10}\mbox{-}{L10}\mbox{-}{E_C10}$" as an example. When the step-by-step strategy is employed and $F_{UP}$ is removed (i.e., "w/o end2end$+F_{UP}$"), it achieves 44.0061dB. However, it obtains 44.6767dB in the case that only $F_{UP}$ is removed (i.e., "w/o $F_{UP}$"). Furthermore, by using the step-by-step strategy, a lot of time has been spent on the offline data processing, which is not expected in the practice. The end-to-end training fashion brings the better results and reduces the complexity of the process. By producing the sequence in order, the SR in angular domain can be done within the single one step.
• It can be validated that $F_{UP}$ brings the performance gain in most cases. Taking the sequence "${L10}\mbox{-}{E_R10}\mbox{-}{E_C10}$" as an example, the full version of it achieves 44.9166dB (i.e., in Table 1). While, it only obtains 44.2575dB when $F_{UP}$ is removed (i.e., "w/o $F_{UP}$"), as shown in Table 9. However, the opposite is true when "${E_C10}\mbox{-}{L10}\mbox{-}{E_R10}$" is taken. The full version of it obtains 44.8344dB and its "w/o $F_{UP}$" version achieves 44.8598dB, which are listed in Table 1 and Table 9 respectively. It implies that $F_{UP}$ produces negative effect to the performance. A possible reason for this is the framework cannot appropriately fit the predicted HR lenslet images that is produced by $F_{UP}$. Although $F_{UP}$ brings negative impact to the SR performance in several cases, it still yields performance gain in general.

Table 7. PSNR/SSIM values achieved by different variants for 2${\times }$ SR in 30scenes. These variants combine two distinct SR models, each of which has the depth of 5. The best results are highlighted in bold.

View Table | View all tables in this article

Table 8. PSNR/SSIM values achieved by different variants for 2${\times }$ SR in 30scenes. These variants combine two distinct SR models, each of which has the depth of 10. The best results are highlighted in bold.

View Table | View all tables in this article

Table 9. PSNR/SSIM values achieved by different variants for 2${\times }$ SR in 30scenes. These variants combine three distinct SR models, each of which has the depth of 10. The best results are highlighted in bold.

View Table | View all tables in this article

4.5 Limitations

Our method behaves not so well for 4${\times }$ SR. Generally speaking, when it comes to the large disparity in LFs, current SR-based methods normally achieve inferior performance, which is the major drawback of these methods. In the future, we are going to settle this issue. Besides, the capability of the SR network itself is another drawback. An elaborate-designed network dedicated to the SR in LF should be introduced.

5. Conclusion

In this paper, we presented a multi-models fusion framework for LFSR in angular domain. Models embodied LF from distinct aspects were integrated to form the fusion framework. Since the number and the arrangement of these models together with the depth of each model determined the performance of the framework, we analyzed these factors to reach the best SR result. However, as they should be fed with the unique inputs, models in the framework were isolated to each other, in this case, the RAC was introduced. Through the RAC, the fusion was conducted successfully, which lead to the fully exploitation of the multi-dimensional information in LFs. Experimental results confirmed the effectiveness of our approach. Besides, through the comprehensive ablation studies, the importance of each part in our framework can be verified.

Funding

National Natural Science Foundation of China (62020106011, 62001279, 62071287, 62171002); Science and Technology Commission of Shanghai Municipality (20DZ2290100); China Postdoctoral Science Foundation (2021T140442).

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. So Kweon, “Accurate depth map estimation from a lenslet light field camera,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 1547–1555.

2. C. Shin, H.-G. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim, “Epinet: A fully-convolutional neural network using epipolar geometry for depth from light field images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 4748–4757.

3. B. Liu, J. Chen, Z. Leng, Y. Tong, and Y. Wang, “Cascade light field disparity estimation network based on unsupervised deep learning,” Opt. Express 30(14), 25130–25146 (2022). [CrossRef]

4. X. Huang, P. An, Y. Chen, D. Liu, and L. Shen, “Low bitrate light field compression with geometry and content consistency,” IEEE Trans. Multimedia 24, 152–165 (2022). [CrossRef]

5. X. Huang, P. An, C. Yang, and L. Shen, “A novel light field compression framework with hybrid residue transform mechanism,” Electron. Lett. 58(5), 207–209 (2022). [CrossRef]

6. J. Zhao, P. An, X. Huang, C. Yang, and L. Shen, “Light field image compression via cnn-based epi super-resolution and decoder-side quality enhancement,” IEEE Access 7, 135982–135998 (2019). [CrossRef]

7. W. Zhou, L. Shi, Z. Chen, and J. Zhang, “Tensor oriented no-reference light field image quality assessment,” IEEE Trans. on Image Process. 29, 4070–4084 (2020). [CrossRef]

8. X. Min, J. Zhou, G. Zhai, P. Le Callet, X. Yang, and X. Guan, “A metric for light field reconstruction, compression, and display quality evaluation,” IEEE Trans. on Image Process. 29, 3790–3804 (2020). [CrossRef]

9. C. Meng, P. An, X. Huang, C. Yang, and D. Liu, “Full reference light field image quality evaluation based on angular-spatial characteristic,” IEEE Signal Process. Lett. 27, 525–529 (2020). [CrossRef]

10. C. Meng, P. An, X. Huang, C. Yang, L. Shen, and B. Wang, “Objective quality assessment of lenslet light field image based on focus stack,” IEEE Trans. Multimedia 24, 3193–3207 (2022). [CrossRef]

11. Y. Momonoi, K. Yamamoto, Y. Yokote, A. Sato, and Y. Takaki, “Light field mirage using multiple flat-panel light field displays,” Opt. Express 29(7), 10406–10423 (2021). [CrossRef]

12. L. Zhu, G. Lv, L. Xv, Z. Wang, and Q. Feng, “Performance improvement for compressive light field display based on the depth distribution feature,” Opt. Express 29(14), 22403–22416 (2021). [CrossRef]

13. F. Zhou, F. Zhou, Y. Chen, J. Hua, W. Qiao, and L. Chen, “Vector light field display based on an intertwined flat lens with large depth of focus,” Optica 9(3), 288–294 (2022). [CrossRef]

14. “Lytro illum,” Available: https://www.lytro.com/, [Online].

15. L. Chen, “Raytrix|3d light field camera technology,” Available: https://www.raytrix.de/, [Online].

16. T. G. Georgiev, K. C. Zheng, B. Curless, D. Salesin, S. K. Nayar, and C. Intwala, “Spatio-angular resolution tradeoffs in integral photography,” Rendering Techniques 2006, 21 (2006).

17. G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Light field image processing: An overview,” IEEE J. Sel. Top. Signal Process. 11(7), 926–954 (2017). [CrossRef]

18. G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light field reconstruction using convolutional network on epi and extended applications,” IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1681–1694 (2019). [CrossRef]

19. N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Trans. Graph. 35(6), 1–10 (2016). [CrossRef]

20. P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, and R. Ng, “Learning to synthesize a 4d rgbd light field from a single image,” in Proceedings of the IEEE International Conference on Computer Vision, (2017), pp. 2243–2251.

21. Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan, “Lfnet: A novel bidirectional recurrent convolutional neural network for light-field image super-resolution,” IEEE Trans. on Image Process. 27(9), 4274–4286 (2018). [CrossRef]

22. Y. Yuan, Z. Cao, and L. Su, “Light-field image superresolution using a combined deep cnn based on epi,” IEEE Signal Process. Lett. 25(9), 1359–1363 (2018). [CrossRef]

23. Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S. Kweon, “Light-field image super-resolution using convolutional neural network,” IEEE Signal Process. Lett. 24(6), 848–852 (2017). [CrossRef]

24. M. S. K. Gul and B. K. Gunturk, “Spatial and angular resolution enhancement of light fields using convolutional neural networks,” IEEE Trans. on Image Process. 27(5), 2146–2159 (2018). [CrossRef]

25. F. Cao, P. An, X. Huang, C. Yang, and Q. Wu, “Multi-models fusion for light field angular super-resolution,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2021), pp. 2365–2369.

26. J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1646–1654.

27. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016). [CrossRef]

28. C. Osendorfer, H. Soyer, and P. v. d. Smagt, “Image super-resolution with fast approximate convolutional sparse coding,” in International Conference on Neural Information Processing, (Springer, 2014), pp. 250–257.

29. B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, (2017), pp. 136–144.

30. Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 2472–2481.

31. W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 624–632.

32. W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1874–1883.

33. J. Jin, J. Hou, H. Yuan, and S. Kwong, “Learning light field angular super-resolution via a geometry-aware network,” in Proceedings of the AAAI conference on artificial intelligence, (2020), pp. 11141–11148.

34. R. C. Bolles, H. H. Baker, and D. H. Marimont, “Epipolar-plane image analysis: An approach to determining structure from motion,” Int. J. Comput. Vision 1(1), 7–55 (1987). [CrossRef]

35. D. Liu, Y. Huang, Q. Wu, R. Ma, and P. An, “Multi-angular epipolar geometry based light field angular reconstruction network,” IEEE Trans. Comput. Imaging 6, 1507–1522 (2020). [CrossRef]

36. M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, (1996), pp. 31–42.

37. S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, (1996), pp. 43–54.

38. O. Johannsen, K. Honauer, B. Goldluecke, et al., “A taxonomy and evaluation of dense light field depth estimation algorithms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, (2017), pp. 82–99.

39. X. Wang, S. You, Y. Zan, and Y. Deng, “Fast light field angular resolution enhancement using convolutional neural network,” IEEE Access 9, 30216–30224 (2021). [CrossRef]

40. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

41. A. S. Raj, M. Lowney, R. Shah, and G. Wetzstein, “Stanford lytro light field archive,” Available: http://lightfields.stanford.edu/LF2016.html, [Online].

42. K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian conference on computer vision, (Springer, 2016), pp. 19–34.

43. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, (2015), pp. 1026–1034.

44. M. Abadi, A. Agarwal, P. Barham, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467 (2016).

45. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

46. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, (2014), pp. 675–678.

Candidate	30scenes	HCI	occlusions & reflective
$E_{R} 10 - L 10 - E_{C} 10$	44.7100/0.9912	41.3526/0.9751	42.4973/0.9828
$E_{R} 10 - E_{C} 10 - L 10$	44.7563/0.9914	41.1028/0.9730	42.5099/0.9830
$E_{C} 10 - E_{R} 10 - L 10$	44.7866/0.9914	41.3838/0.9769	42.4919/0.9829
$E_{C} 10 - L 10 - E_{R} 10$	44.8344/0.9914	41.3719/0.9745	42.4826/0.9829
$L 10 - E_{R} 10 - E_{C} 10$	44.9166/0.9913	41.5266/0.9772	42.5971/0.9829
$L 10 - E_{C} 10 - E_{R} 10$	44.9708/0.9914	41.3358/0.9735	42.6545/0.9830

Test set	LFSR-AR [24]	bicubic	EPICNN [18]	$F_{U P}$	Zhao et al. [6]	$L 10 - E_{C} 10 - E_{R} 10$	$L 10 - E_{R} 10 - E_{C} 10$
30scenes	37.3452/0.9602	40.7284/0.9791	41.2280/0.9809	42.3383/0.9838	43.9836/0.9902	44.9708/0.9914	44.9166/0.9913
HCI	30.3626/0.9375	32.5696/0.8619	36.3874/0.9310	33.0526/0.8649	37.5811/0.9401	41.3358/0.9735	41.5266/0.9772
occlusions & reflective	36.6967/0.9567	39.3124/0.9734	39.4964/0.9690	41.1993/0.9784	41.8416/0.9816	42.6545/0.9830	42.5971/0.9829

Candidate	30scenes	HCI	occlusions & reflective
$L 10 - E_{R} 10 - E_{C} 10$	42.0759/0.9862	36.1509/0.9291	39.2967/0.9735
$E_{C} 10 - E_{R} 10 - L 10$	42.1035/0.9863	36.5144/0.9349	39.2764/0.9735
$E_{R} 10 - E_{C} 10 - L 10$	42.1403/0.9859	36.7905/0.9370	39.2666/0.9732
$E_{R} 10 - L 10 - E_{C} 10$	42.1879/0.9864	36.9765/0.9368	39.2769/0.9734
$E_{C} 10 - L 10 - E_{R} 10$	42.1980/0.9863	36.6500/0.9373	39.3014/0.9731
$L 10 - E_{C} 10 - E_{R} 10$	42.4329/0.9866	36.6073/0.9363	39.3699/0.9734

Test set	LFSR-AR [24]	bicubic	$F_{U P}$	EPICNN [18]	Zhao et al. [6]	$E_{R} 10 - L 10 - E_{C} 10$	$L 10 - E_{C} 10 - E_{R} 10$
30scenes	34.6111/0.9328	34.9561/0.9345	36.7140/0.9441	37.9891/0.9638	41.2192/0.9837	42.1879/0.9864	42.4329/0.9866
HCI	15.3002/0.5506	28.9026/0.7754	29.5341/0.7838	31.9515/0.8610	33.4592/0.8867	36.9765/0.9368	36.6073/0.9363
occlusions & reflective	33.8701/0.9398	33.5165/0.9359	36.0122/0.9498	36.0299/0.9460	38.4027/0.9699	39.2769/0.9734	39.3699/0.9734

Sample rate	Test set	$L 5$	$E_{C} 5$	$E_{R} 5$	$E_{C} 5 - L 5$	$E_{R} 5 - L 5$	$L 5 - E_{R} 5$	$L 5 - E_{C} 5$	$E_{C} 5 - E_{R} 5$	$E_{R} 5 - E_{C} 5$
2 $\times$	30scenes	42.2800/0.9846	43.2008/0.9881	43.2560/0.9868	43.4495/0.9885	43.4568/0.9872	43.4988/0.9872	43.5385/0.9885	44.0097/0.9904	44.4041/0.9906
2 $\times$	HCI	32.1077/0.8640	35.7183/0.9103	35.8850/0.9040	36.0506/0.9144	36.0616/0.9059	36.1272/0.9086	35.9039/0.9117	39.1464/0.9528	39.2096/0.9525
4 $\times$	30scenes	36.9448/0.9534	38.6626/0.9641	38.4000/0.9608	39.1587/0.9690	38.8578/0.9684	39.4640/0.9699	39.0422/0.9699	40.9243/0.9818	41.1036/0.9822
4 $\times$	HCI	30.1913/0.8067	31.5947/0.8426	31.5326/0.8352	31.9042/0.8508	31.7963/0.8397	32.1196/0.8518	32.1723/0.8594	34.3056/0.9016	34.2622/0.8989

Light field reconstruction in angular domain with multi-models fusion through representation alternate convolution

Abstract

1. Introduction

2. Related work

2.1 Single image super-resolution

2.2 Novel views synthesis

2.3 LFSR in angular domain

3. Methodology

3.1 LF representations and their corresponding models

3.2 Forming the fusion framework

3.3 Representation alternate convolution

3.4 Up-sampling network

3.5 Loss function

3.5.1 Loss function for the up-sampling network

3.5.2 SR loss

3.5.3 Overall loss

4. Experiments

4.1 Datasets

4.2 Training configurations

4.3 Comparison with state-of-the-art methods

4.3.1 Comparison for 2${\times }$ SR

4.3.2 Comparison for 4${\times }$ SR

4.4 Ablation study

4.4.1 Factors that determine the performance

4.4.2 Variants of the overall framework

4.5 Limitations

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (10)

Equations (18)

Optics Express

Sample rate	Test set	$E_{R} 10$	$E_{C} 10$	$L 10$	$E_{R} 10 - L 10$	$L 10 - E_{R} 10$	$E_{C} 10 - L 10$	$L 10 - E_{C} 10$	$E_{C} 10 - E_{R} 10$	$E_{R} 10 - E_{C} 10$
2 $\times$	30scenes	43.4405/0.9871	43.5043/0.9884	43.9505/0.9895	44.2551/0.9903	44.3415/0.9905	44.3416/0.9906	44.3627/0.9905	44.6105/0.9911	44.6551/0.9911
2 $\times$	HCI	36.2260/0.9100	36.1310/0.9160	36.9937/0.9241	39.7228/0.9558	39.2520/0.9505	39.6665/0.9571	38.6571/0.9485	40.2300/0.9687	40.9076/0.9719
4 $\times$	30scenes	38.7651/0.9626	38.7238/0.9660	39.8939/0.9719	41.2364/0.9814	41.3488/0.9826	41.2898/0.9824	41.3230/0.9830	41.7528/0.9851	41.9848/0.9852
4 $\times$	HCI	32.0117/0.8448	32.0030/0.8522	31.9476/0.8465	34.8297/0.9076	33.8414/0.8879	34.4979/0.9073	33.4356/0.8849	35.7055/0.9297	36.1903/0.9297

Method	w/o end2end $+ F_{U P}$	w/o $F_{U P}$
$E_{R} 5 - L 5$	42.8580/0.9864	43.3770/0.9871
$L 5 - E_{R} 5$	43.1669/0.9865	43.3800/0.9871
$L 5 - E_{C} 5$	43.2627/0.9878	43.4918/0.9884
$E_{C} 5 - L 5$	43.3192/0.9877	43.4940/0.9884
$E_{R} 5 - E_{C} 5$	43.8440/0.9897	44.2917/0.9904
$E_{C} 5 - E_{R} 5$	44.0165/0.9898	44.2696/0.9904

Method	w/o end2end $+ F_{U P}$	w/o $F_{U P}$
$L 10 - E_{R} 10$	43.4000/0.9878	44.3654/0.9905
$L 10 - E_{C} 10$	43.4720/0.9887	44.3425/0.9905
$E_{R} 10 - L 10$	43.5150/0.9884	44.4496/0.9903
$E_{C} 10 - L 10$	43.6161/0.9890	44.2790/0.9905
$E_{R} 10 - E_{C} 10$	44.0542/0.9900	44.5232/0.9908
$E_{C} 10 - E_{R} 10$	44.1181/0.9901	44.5359/0.9909

Method	w/o end2end $+ F_{U P}$	w/o $F_{U P}$
$E_{C} 10 - L 10 - E_{R} 10$	43.9661/0.9901	44.8598/0.9912
$E_{R} 10 - L 10 - E_{C} 10$	44.0061/0.9900	44.6767/0.9911
$E_{C} 10 - E_{R} 10 - L 10$	44.1597/0.9902	44.6613/0.9911
$L 10 - E_{C} 10 - E_{R} 10$	44.1997/0.9904	44.7659/0.9912
$E_{R} 10 - E_{C} 10 - L 10$	44.2275/0.9902	44.7311/0.9911
$L 10 - E_{R} 10 - E_{C} 10$	44.2575/0.9904	44.7421/0.9912