Matching entropy based disparity estimation from light field data

Ligen Shi; Chang Liu; Di He; Xing Zhao; Jun Qiu

doi:10.1364/OE.479741

1. Introduction

A light field [1,2] record the spatial and angular information of a set of light rays in the scene space and have been widely used in scene depth estimation and three-dimensional (3D) imaging [3–5]. From the perspective of data acquisition, the light field data can be acquired directly by imaging devices, and indirectly reconstructed from focal stacks or encoded masks. Integral imaging and camera arrays are two basic types of direct acquisition imaging systems. Gabriel Lippmann proposed integral photography in 1908 and captured the spatial-angular information of 3D scenes for the first time [6–8]. Integral imaging [5,6,9,10] can be regarded as a 3D imaging technique that captures and reproduces a light field by using a two-dimensional (2D) array of microlenses (or lenslets). In light field capture mode, in which the detector is coupled to the microlens array, each microlens allows an image of the subject as seen from the viewpoint of that lens’s location to be acquired. The display manner using integral imaging can be regarded as a type of light field display. In the reproduction mode, in which an object or source array is coupled to the microlens array, each microlens allows each observing eye to see only the area of the associated micro-image containing the portion of the subject.

In terms of theoretical modeling, E. H. Adelson et al. proposed the seven-dimensional (7D) plenoptic function $L(V_x, V_y, V_z,\phi,\varphi,\lambda,t)$ to describe the irradiance information of a light ray with any wavelength in space at any time [11]. Then, a 4D light field, which is a simplified two-plane representation suitable for optical imaging systems, was developed in many integral imaging systems [12] and camera array systems. The integral imaging systems’ optical geometry can be implemented and visualized by substituting pinholes for the micro-lenses, as has been done for some demonstrations and special applications. The fundamentals, related techniques, and emerging applications of light field data and integral imaging techniques for 3D imaging and displays have been extensively studied and comprehensively summarizeda [5–10].

Scene disparity estimation from light field data is an essential problem in light field computational imaging, especially in 3D digital imaging. There are four categories of methods used to estimate disparity information from a 4D light field: matching-based, epipolar-geometry-based, focus-measure-based, and deep-learning-based methods. Matching-based methods [13–19] is an extension of the stereo matching method, which can reduce the influence of light field spectrum aliasing and angular artifacts. However, matching often fails in smooth and occlusion regions. Epipolar plane images (EPIs) reveal the epipolar geometry of light fields [20–23]. Therefore, the depth can be obtained by calculating the slope of the epipolar line in an EPI. Epipolar-geometry-based methods can achieve good results in occlusion regions, but they require a large amount of calculation and are sensitive to noise. Focus-measure-based methods obtain the depth by the focus measure in the focal stack [24–27]. Since the focal stack is the projection of the light field in the preset depth layers, the estimation accuracy depends on the sampling of depth layers. Deep-learning-based methods replace complex depth estimation pipelines with neural networks [28–31], which require a large amount of training data and lack generalization abilities.

Area matching is a commonly used technique in matching-based methods and makes use of window matching instead of pixel matching to improve the robustness. A unified matching window leads to calculation redundancy in textured regions and mismatches in occlusion and smooth regions. If we can determine the occluding, occluded, smooth, and textured regions, we will be able to improve the estimation accuracy and efficiency by selecting the effective window size and shape for different regions. For textured and smooth regions, selecting a matching window that covers enough textures is the key task. For occluding regions, selecting the shapes according to the occlusion geometry is the key task. For occluded regions, selecting the shape of the matching windows and the visible viewpoints are the key tasks.

To accomplish the key tasks, an effective matching window should satisfy three characteristics: texture richness, disparity consistency, and anti-occlusion, thus will provide enough valid matching information and less invalid or incorrect information. We propose matching entropy corresponding to these characteristics to measure the effectiveness of a matching window. With matching entropy acting as the regularization term, we establish an optimization model for disparity estimation and propose a two-step adaptive window matching method to solve the optimization model. In the first step, the region type is adaptively determined based on the segmentation and the texture information. In the second step, matching entropy is used as a criterion for the adaptive selection of the matching windows’ shape and size, and the visible viewpoints. Figure 1 visualizes scheme of the proposed matching entropy based disparity estimation.

2. Related work

The main implementation of light field computational imaging and computational display is integral imaging. The resolution and field of view of light field data depend on the capability of integral imaging systems. The optimum lenslet size in the lenticular screen and the resolution limitation for integral imaging were derived [32]. To improve the real-time performance of integral imaging systems, a real-time integral imaging method [33] was proposed to provide 3D autostereoscopic images of moving objects in real time by using microlens arrays. B. Javidi et al. proposed synthetic aperture integral imaging [34], in which an effectively extended FOV is obtained by moving a small integral imaging system, which greatly increases the FOV and resolution. The synchronously moving micro-optics (lenslet arrays) was utilized in an integral imaging system for image capture and display, in order to overcome the resolution limitation by the Nyquist sampling [35]. F. Jin et al. clarified the effects of a finite number of pixels in elemental images on the resolution and the depth of focus in three-dimensional integral imaging [36].

Since the light field data is the sensing, visualization and 3D display of the scene information, integral imaging systems are practical for many fields. S.H. Hong et al. proposed a 3D imaging technique based on integral imaging [37], which can perceive 3D scenes and reconstruct them into 3D volumetric images. The reconstruction of scene volume pixels is implemented by simulating optical reconstruction based on ray optics calculations. H. Arimoto et al. reconstructed 3D images by numerically processing an array of observed images formed by a microlens array [3]. The algorithms for reconstructing 3D images are robust and can obtain images viewed from arbitrary directions. A. Stern et al. proposed a computational synthetic aperture integral imaging technique [38], which can increase the field of view (FOV). The synthetic aperture is obtained by the relative motion of the imaging system and the object in a plane perpendicular to the optical axis. C.G. Luo et al. analyzed the depth of field (DOF) of integral imaging display based on wave optics [39]. Considering the diffraction effect, the intensity distribution of light with multiple microlenses is analyzed, and the formula for calculating the DOF of the integral imaging display system is derived.

As a middle-level vision process in light field imaging, disparity estimation is an essential topic for high-precision 3D visual perception and high-fidelity 3D content generation. The applications of light field imaging, such as light field super-resolution, digital refocusing, light field compression, and light field editing, largely depend on the accurate estimation of disparity (or depth). In recent years, researchers in the field of optics have also been focusing on the subject of disparity estimation from light field data. The disparity resolution properties of light field data were analyzed [40] in case of limiting the epipolar analysis to a small range to reduce runtime, combined with regression testing to reduce estimation error. The iterative scheme was proposed for fidelity reconstruction of scene depth from 4D light field data [41]. A novel active disparity estimation method [42] was proposed by directly using the corresponding cues in structured light fields to search for the unambiguous disparity. The geometric model based on epipolar space [43], was proposed to determine the relationship between 3D points in a scene and the 4D light field, then the closed-form solution for geometric-model-based 3D shape measurement is completed. The influence of plenoptic imaging distortion on light field disparity estimation [44] was clarified to propose the light field disparity estimation method considering plenoptic imaging distortion. In addition, an accuracy analysis of light field depth estimation is performed using standard phantoms. To handle different types of occlusion, S. Ma et al. proposed the side window subsets for angular coherent [45] and theoretically analyzed the ability of the proposed method to resist occlusions. Deep learning methods are explored for predicting scene disparity. X. Wang et al. proposed a convolutional neural network based on epipolar geometry and image segmentation for light field disparity estimation [46]. Multi-directional epipolar images are chosen as input data, and convolutional blocks are employed according to the disparity of different directional epipolar images. B. Liu et al. proposed a light field disparity estimation network [47], which employs a cascaded cost volume architecture that can predict disparity maps in a coarse-to-fine manner by fully exploring the geometric features of sub-aperture images.

The scene disparity estimation approach that matches sub-aperture image arrays comes from area matching in stereo matching [48], because the sub-aperture images can be regarded as multiview images. The designs of matching windows and the matching costs are the key problems of area matching. Typical matching windows include weighted windows [49], reliable multi-scale and multi-windows (MSMWs) [50], and cross-based local windows [51]. A weighted window is a fixed shape window with radiance- or distance-based weights for pixels. An MSMW is selected from a window dictionary by minimizing the matching cost. A cross-based window is generated by a crisscross expansion of the anchor pixel according to the color consistency. The commonly used matching costs include the sum of absolute differences (SAD), the sum of squared differences (SSD), the normalized cross-correlation (NCC), and census [52].

Sub-aperture images can be regarded as a dense uniform sampling of the viewpoint plane with a small baseline. The small baseline leads to subpixel disparities, which can hardly be detected using conventional matching methods. Spatial interpolation can be used to solve this problem to a certain extent. However, the blur caused by the interpolation increases the possibility of mismatches. H. G. Jeon et al. [14] proposed the phase shift theorem that allows the estimation of the subpixel offset between sub-aperture images. To reduce the mismatches in the occlusion region, J. Navarro et al. [13] used an MSMW [50] to estimate the disparity between the central view and the rest of the views in the same row and column and then used the median operator to extract the reliable disparity value. C. Chen et al. [15] proposed a bilateral metric considering the color consistency and the pixel distance in the reference window to improve the robustness in occlusion regions, but this method is sensitive to noise. W. Williem et al. [25] proposed the analysis of angular patches to form a matching cost by combining the angular entropy metric and adaptive defocus response. The angular entropy metric is more robust to occlusion but sensitive to noise. The balance between angular entropy and the adaptive defocus response is intractable. T. C. Wang et al. [26] proposed an occlusion-aware disparity estimation cross-correlation with occluded line edges. The accuracy of the disparity estimation result is highly dependent on edge detection. Using occlusion-noise-aware data costs, the constrained entropy cost in the angular domain of the light field is proposed to reduce the effects of the dominant occluder and noise in the angular patch, resulting in a low cost [53]. For super-resolution and disparity estimation, a generic mechanism to disentangle the coupled information for LF image processing, and a class of domain-specific convolutions is designed to disentangle LFs from different dimensions [54].

To reduce the mismatches in occluded and smooth regions, we propose matching entropy in the spatial domain of the light field to measure how well a matching window in different regions meets the three characteristics. The optimization model based on matching entropy regularization is utilized for disparity estimation in occlusion, smooth, and textured regions.

3. Optimization model based on matching entropy regularization

The fixed window for region matching may lead to mismatches between occlusion regions and smooth regions. An effective way to solve the mismatch is to eliminate the mismatched part in the window that generates the mismatch and increase the amount of the information that could match correctly. In this paper, the shape of the matching window is used to eliminate the mismatched part, and the size of the matching window is applied to increase the amount of information that can correctly match the region. We proposed the matching entropy measure the correct information in a matching window, and hence, it becomes a criterion for the matching window selection.

3.1 Matching entropy

To estimate the depth map accurately, every matching window $w(x,y)$ needs to contain a sufficient amount of effective matching information. The ideal matching window should satisfy three characteristics: texture richness, disparity consistency, and anti-occlusion. Texture richness is fundamental for area matching. Disparity consistency is the basic assumption of area matching, which ensures that the area remains invariant in different view images. Anti-occlusion is essential for accurate and robust matching in occlusion regions. According to these characteristics, we define the matching entropy of a window $w(x,y)$ to measure the amount of effective matching information.

Definition 1 For a light field $L(u,v,x,y)$, the matching entropy of a window $w(x,y)$ in the central view image $L_{\left (u_{0}, v_{0}\right )}(x, y)$ is defined as

(1)$$E^{entropy}[w(x, y)]={-}\sum_{k=1}^{K} p_{k} \cdot \log \left(p_{k}\right)+\alpha_{1} \sum_{k=1}^{K^{\prime}} p_{k}^{\prime} \cdot \log \left(p_{k}^{\prime}\right)+\alpha_{2} \sum_{k=1}^{K^{\prime \prime}} p_{k}^{\prime \prime} \cdot \log \left(p_{k}^{\prime \prime}\right),$$

where $p_k$ and $p_k^{\prime }$ stand for the probabilities of the gray value and the disparity value of the $kth$ pixel in $w(x,y)$, respectively, while ${p}_{k}^{\prime \prime }$ is the probability of the gray value of the $kth$ pixel of the mismatched pixels in $w(x,y)$. ${\alpha }_{1}\geq {0}$ and ${\alpha }_{2}\geq {0}$ denote the weight coefficients. ${\alpha }_\mathbf {2}={0}$ when there is no occlusion in $w(x,y)$.

The three terms of the matching entropy function refer to texture richness, disparity consistency, and anti-occlusion respectively. $p_k$ and $p_k^{\prime }$ are calculated from the gray histogram and the disparity histogram of $w(x,y)$, respectively, and ${p}_{k}^{\prime \prime }$ is obtained from the gray histogram of the mismatched pixels in $w(x,y)$. In the anti-occlusion term, the mismatched pixels are the occluded pixels in $w(x,y)$ if Pixel $(x,y)$ occludes other pixels, and are the occluding pixels if Pixel $(x,y)$ is occluded.

3.2 Optimization model

In the light field data, the scene point is projected in different views [41,55,56] as shown in Fig. 2. Under the perspective projection, the scene points projected onto the $(x,y)$ plane form a scene surface in 3D space. The depth $d(x,y)$, disparity $disp(x,y)$, and scene surface $\vec {S}(x,y)$ are all represented and defined in the same coordinate system $(x,y)$.

Fig. 1. Scheme of the proposed matching entropy based disparity estimation.

Abstract

1. Introduction

2. Related work

3. Optimization model based on matching entropy regularization

3.1 Matching entropy

3.2 Optimization model

4. Implementation of disparity estimation by adaptive region matching

4.1 Adaptive identification of region types

4.1.1 Indicator for occluding and occluded regions

4.1.2 Indicator for textured and smooth regions

4.2 Adaptive selection of matching window by region type

4.2.1 Matching window selection and visible viewpoint set adoption for occlusion regions

4.2.2 Matching window selection for smooth and textured regions

4.3 Disparity estimation and postprocessing

4.3.1 Disparity estimation and refinement

5. Experimental results

5.1 Evaluation and comparison of the algorithms

5.2 Error analysis in occlusion regions

5.3 Experiments on real data

6. Discussion

7. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (20)

Equations (13)

Optics Express