Conventional stereo matching systems generate a depth map using two or more digital imaging sensors. It is difficult to use the small camera system because of their high costs and bulky sizes. In order to solve this problem, this paper presents a stereo matching system using a single image sensor with phase masks for the phase difference auto-focusing. A novel pattern of phase mask array is proposed to simultaneously acquire two pairs of stereo images. Furthermore, a noise-invariant depth map is generated from the raw format sensor output. The proposed method consists of four steps to compute the depth map: (i) acquisition of stereo images using the proposed mask array, (ii) variational segmentation using merging criteria to simplify the input image, (iii) disparity map generation using the hierarchical block matching for disparity measurement, and (iv) image matting to fill holes to generate the dense depth map. The proposed system can be used in small digital cameras without additional lenses or sensors.
© 2016 Optical Society of America
Three dimensional (3D) information is acquired by measuring the distance between a lens and an object selected as a region of interest (ROI). Especially, a depth map is generated by analyzing all ROIs in an image based on recognizable features. For this reason, stereo matching technique is widely used in 3D applications [1–3].
There are three main approaches to obtain the depth map. The first approach performs stereo matching using two or more cameras [4,5]. Because each image has a different viewing angle, the depth map is generated by estimating the disparity between the images. This approach provides high accuracy at the cost of complexity. However, it increases the cost and volume of the system. The second approach estimates the disparity using a multiple color-filtered aperture (MCA) in a lens [6, 7]. The MCA-based approach is comparable with the proposed method since it estimates the depth map using a single image. The acquired image consists of three color channels each of which has differently located ROIs. Although the MCA-based approach has lower cost than the conventional stereo matching approach, the input image has color distortion and lower brightness. The last approach uses a infrared sensor-based camera to estimate the distance using the time-of-flight (ToF) method [8–10]. It computes the disparities quickly and accurately, but the range of estimated distance is restricted.
Several methods used a gradient map from a vanishing point that is generated by projecting light sources from the scene to the imaging sensor [11–13]. Zhuo et al. computed a defocus map by estimating point spread functions (PSFs) from the near- or far-focused image . He et al. proposed the haze removal method and the depth map application using the dark channel prior . These methods can accurately generate the depth map using a single image.
This paper presents a novel depth map generation system using a dual pixel-type imaging sensor. This sensor acquires a set of stereo images using phase photo-diodes with different black masks. Next, variational segmentation-based stereo matching is performed using the acquired stereo images. After hierarchical block matching is performed to measure sub-pixel disparities, this system generates a dense depth map using the segmented image and the disparity edge map.
This paper is organized as follows. Theoretical background is introduced in section 2, and the proposed depth map generation method is presented in section 3. After summarizing experimental results in section 4, section 5 concludes the paper.
2. Image acquisition model of hybrid auto-focusing
Phase detection auto-focusing (PDAF) is the technique that automatically finds the best focusing position of a lens using a phase-difference information from a specially designed image sensor. This approach falls into the passive auto-focusing category and has an elaborate optical path generated by the relationship between the axis of light source and the separated lights . However, PDAF needs the space to equip additional devices such as line sensors and half mirrors in the camera. Therefore, it is not appropriate in small, portable imaging systems despite the fast, accurate performance. Digital AF can solve this problem using contrast detection  or PSF [18,19], but it has the high complexity.
A hybrid AF system is a variant of the existing AF system to solve this problem, and computes phase differences using a dual pixel-type complementary metal oxide semiconductor (CMOS) sensor [20, 21]. This sensor significantly reduces the cost of cameras since it can replace half mirrors, separating lenses, and line sensors. There are some types of the sensors for the hybrid AF. One is equipped with black masks on the color filters of some pixels as shown in Fig. 1(a) . Phase masks with two directions are installed to interrupt light rays with a specific viewing angle. Another type uses an imaging sensor with special pixels containing two sub-lenses and photo-diodes as shown in Fig. 1(b) . This sensor can simply acquire two sub-images by absorbing the left- and right-sided lights. The third type has isolation barriers between cells as shown in Fig. 1(c) . By placing the barrier between two adjacent pixels, a pixel absorbs the light sources passed through a micro lens without the interference of neighboring lenses in each photo-diode. These sensors commonly generate two images with different viewing angles, and each pixel has different disparity.
The proposed system can use any types of an image sensor with phase masks. As shown in Fig. 2(a), each photo-diode has the phase mask that is installed in the right side of even columns, and in the left side of odd columns. Because a pair of phase pixels generates different disparities from objects with different distances as shown in Fig. 2(b), the proper amount phase difference can give a clue how to move the lens to the right focusing position. Furthermore, sub-pixel phase differences are generated because the phase pixels have different viewing angles from each object with a different distance. When the left- and right-phase images are acquired using photo-diodes with right- and left-sided masks, the image acquisition model for hybrid AF is defined as 
3. Dense depth map generation using phase pixels
The proposed system generates a dense depth map using the dual pixel-type CMOS sensor with black masks. After sensing the raw data, an appropriate pre-processing steps, such as demosaicing and denoising, are needed to reduce the sensing noise. A sophisticated, accurate motion estimation is also needed to measure the disparities up to a sub-pixel precision.
To meet these requirements, the proposed method performs four steps as shown in Fig. 3: i) image separation from g(x,y) using modified imaging sensor array, ii) image segmentation using an improved variational optimization method, iii) disparity map generation using the hierarchical block matching, and iv) dense depth map generation using image matting to fill holes. Elaborated description of each step is given in the following subsection.
3.1. Separation of stereo images using a single sensor
The proposed system acquires stereo images that are separated from an input image as the first step. Phase difference is equivalent to the disparity between the left image gL(x,y) and the right image gR(x,y) because objects have different phase differences as shown in Fig. 2(b). For this reason, phase images can be used as a pair of stereo images . Also, the distance between the left- and right-side phase pixels is defined as the constant in the manufacturing process of the sensor. It means that rectification is very simple in the proposed system because vertical disparity is a constant in all pixels. As shown in Fig. 2(a), a left phase pixel is one pixel apart from the right pixel in the vertical and horizontal directions. When gL(x,y) and gR(x,y) are acquired from the phase pixels, the images have the disparities as shown in Fig. 4(a). Because the vertical disparity is constant, the measurement error of the disparities arises in the horizontal edge as shown in Fig. 4(b). Although the vertical disparity must be computed to obtain the accurate depth map in the horizontal edge, existing CMOS sensors cannot provide the vertical disparity data.
The proposed sensor has one of four directional black masks in each pixel of the 2 × 2 pixel array to measure both horizontal and vertical disparities as shown in Fig. 5. From (1) and (2), the image acquisition model is modified as
3.2. Segmentation using the improved variational merging criteria
Given a set of four stereo images, the left phase image gL(x,y) is simplified by image segmentation under assumption that an object has the same disparity in a segment. The segmented image provides boundaries of a depth map.
The proposed segmentation method is based on the variational optimization that was originally proposed by Mumford et al. using a piecewise constant smoothness . The segmentation process is performed by minimizing the energy functional
For a simple implementation, the proposed segmentation method uses the variational merging criterion proposed by Koepfler et al. . This method defines the merging criterion using the expansion property of boundaries. If gSi(x) expands, the energy of the current boundary B decreases as(7), the merging criterion is defined as
This implies that the Koepfler’s segmentation method is easily implemented and generates the optimal gS(x,y) without noise as shown in Fig. 6(c). But this method is very sensitive to the brightness of gS(x,y), and it merges week edges despite the boundary between two regions.
In order to solve these problems, an additional brightness-invariant criterion is used in the proposed method. Before initializing segmentation regions, the edge image is generated from gL(x,y) using the Laplacian of Gaussian filtering to clearly segment objects with a proper threshold as shown in Fig. 6(b). Next, two regions are merged under the condition of (8). If the first merging criterion does not satisfy, the merging of the regions is performed when . The proposed method repeats the merging process to get the simplified image gS(x,y) as shown in Fig. 6(d), and the final edge image e(x,y) is used to compute disparities quickly.
3.3. Depth map generation using motion estimation and matting
Given the segmentation result, the proposed system generates a depth map using fine motion estimation and image matting. This system needs a floating-point disparity estimation from a set of stereo images because of the masks shown in Fig. 2. Jang et al. presented a hierarchical phase correlation method to detect floating-point phase difference in . But the phase correlation matching is not efficient because of multiple Fourier transforms and the corresponding search of the peak point. Moreover, vertical and horizontal disparity maps must be generated from gL(x,y) and gR(x,y), and gT (x,y) and gB(x,y) respectively.
To obtain vertical and horizontal disparities, e(x,y) is split into ever(x,y) and ehor(x,y). Next, the proposed system measures the disparity using the multiple-scale, hierarchical block matching method based on the sum of absolute difference (SAD) proposed in  to reduce the computation. Block matching using SAD evaluation performs just simply operations by pixel unit without transformation and convolution, and also considers tiny brightness change. If ever(x,y) > 0 and ehor(x,y) > 0, each motion value is defined as
Finally, dense depth map is generated using image matting from gS(x,y) and dE(x,y). To fill holes of dE(x,y) from gS(x,y), the proposed system uses the map interpolation method by Zhuo . This method generates the depth map using matting Laplacian matrix from the reference image and sparse edge image . In the proposed method, each image is replaced by gS(x,y) and dE(x,y). Thus, the following linear system generates the depth map 
4. Experimental results
In order to evaluate the performance of the proposed system, a set of stereo images was acquired using a camera equipped with a F1.8 lens and the sensor array shown in Fig. 2(a). The test camera module is shown in Fig. 7. A set of stereo images with vertical disparity was captured under a normal illumination condition, and another set was acquired by rotating the camera by 90°.
Using the camera shown in Fig. 7, four stereo images of size 1024 × 1024 were acquired as shown in Figs. 8(a)–8(d). A same scene was captured to make four stereo images. The distances of the nearest and farthest objects are respectively 20 and 110 cm, and all objects have the same interval of 10 cm. From Figs. 8(a)–8(d), the disparity tends to decrease as the distance between the imaging sensor and object increases as shown in Fig. 8(e). Since x and y disparity curves are almost identical, the proposed sensor can estimate the disparity in the entire range between 20 and 100 cm.
Segmentation results of Fig. 8(a) using four different algorithms are shown in Fig. 9. In this experiment, the proposed method was compared with k-means clustering, mean-shift, and variational segmentation  methods. As shown in Figs. 9(a)–9(c), three existing methods produce incorrectly segmented results because of the sensitivity to the brightness change. Especially, existing methods fail to segment the ruler as shown in the right side of Fig. 8(a). On the other hand, the proposed method can accurately segment the ruler as shown in Fig. 9(d).
To evaluate the depth map generation performance of the proposed system using the sensor array shown in Fig. 5, five existing stereo matching methods were compared with the proposed method: the PSF by Zhuo , region-based stereo matching by Mukherjee , the enhanced normalized cross correlation by Psarakis , recursive edge-aware filters by Çığla , hierarchical phase correlation using the sensor of Fig. 2(a)  as shown in Figs. 10(a)–10(e). The proposed system can generate more reliable depth map than other methods as shown in Fig. 10(f).
The depth maps were estimated from three sets of stereo images using the proposed system as shown in Fig. 11. Input images of Figs. 11(a)–11(c) have the background of distance of 100, 230, and 350 cm respectively, and all objects have the distance within 100 cm. Although some incorrect disparities are observed in the background region as shown in Figs. 11(b) and 11(c), the proposed system generates the dense depth map with an acceptable accuracy.
In this paper, a novel single sensor-based depth map generation method is presented. Existing stereo matching systems require bulky, expensive hardware to measure the disparities because of an additional camera. The proposed system uses the dual pixel-type imaging sensor with black masks that generate stereo images to estimate the disparity. Since the proposed sensor array does not generate the horizontal error, it can accurately obtain a depth map using multiple disparities. The proposed variational segmentation method is simply implementable and generates an acceptable segmentation result without noise amplification. Moreover, the proposed system generates depth map by computing the disparities using hierarchical block matching. The proposed motion estimation method measures the sub-pixel disparity in response to a very fine angle of view. Consequently, the proposed system can generate the depth map using a single image sensor with multiple objects and additive noise.
This work was supported by Institute for Information & Communications Technology Pro-motion (IITP) grant funded by the Korea government (MSIP) (B0101-1-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis), the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-H8501-16-1018) supervised by the IITP (Institute for Information & communications Technology Promotion), the Technology Innovation Program (Development of Smart Video/Audio Surveillance SoC & Core Component for Onsite Decision Security System) under Grant 10047788.
References and links
1. N. Yang, J. Lee, and R. Park, “Depth map generation from a single image using local depth hypothesis,” in Proceedings of IEEE International Conference on Consumer Electronics, (IEEE, 2012), pp. 311–312.
2. A. Farooq and C. Won, “A survey of human action recognition approaches that use an RGB-D sensor,” IEIE Trans. Smart Processing and Computing 4, 281–290 (2015). [CrossRef]
3. S. Kang, A. Roh, C. Eem, and H. Hong, “Using real-time stereo matching for human gesture detection and tracking,” TechArt: Journal of Arts and Imaging Science 1, 60–66 (2014).
4. H. Kim, J. Kang, and B. Song, “Depth-adaptive sharpness adjustments for stereoscopic perception improvement and hardware implementation,” IEIE Trans. Smart Processing and Computing 3, 110–117 (2014). [CrossRef]
5. K. Denker and G. Umlauf, “Accurate real-time multi-camera stereo matching on the GPU for 3D reconstruction,” Journal of WSCG 19, 9–16 (2011).
7. S. Kim, E. Lee, M. H. Hayes, and J. Paik, “Multifocusing and depth estimation using a color shift model-based computational camera,” IEEE Trans. Image Processing 21, 4152–4166 (2012). [CrossRef]
8. S. Zhang, C. Wang, and S. C. Chan, “A new high resolution depth map estimation system using stereo vision and depth sensing device,” in Proceedings of IEEE 9th International Colloquium on Signal Processing and its Applications (CSPA), (IEEE, 2013), pp. 49–53.
9. Y. Kang and Y. Ho, “High-quality multi-view depth generation using multiple color and depth cameras,” in Proceedings of IEEE 9th International Conference on Multimedia and Expo (ICME), (IEEE, 2010), pp. 1405–1410.
10. B. Yoon, K. Choi, M. Ra, and W. Kim, “Real-time full-view 3D human reconstruction using multiple RGB-D cameras,” IEIE Trans. Smart Processing and Computing 4, 224–230 (2015). [CrossRef]
11. N. Yang, J. Lee, and R. Park, “Depth map generation using local depth hypothesis for 2D-to-3D conversion,” International Journal of Computer Graphics & Animation (IJCGA) 3, 1–15 (2013). [CrossRef]
12. S. Battiato, S. Curti, M. La Cascia, M. Tortora, and E. Scordato, “Depth map generation by image classification,” in Electronic Imaging 2004, (International Society for Optics and Photonics, 2004), pp. 95–104.
13. F. Yu, J. Liu, Y. Ren, J. Sun, Y. Gao, and W. Liu, “Depth generation method for 2D to 3D conversion,” in 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), (IEEE, 2011), pp. 1–4.
14. S. Zhuo and T. Sim, “Defocus map estimation from a single image,” Pattern Recognition 44, 1852–1858 (2011). [CrossRef]
15. K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” Pattern Analysis and Machine Intelligence, IEEE Transactions on 33, 2341–2353 (2011). [CrossRef]
16. L. Spinoulas, A. Katsaggelos, J. Jang, Y. Yoo, J. Im, and J. Paik, “Defocus-invariant image registration for phase-difference detection auto focusing,” in Proceedings of IEEE International Symposium on Consumer Electronics, (IEEE, 2014), pp. 83–84.
17. J. Jeon, J. Lee, and J. Paik, “Robust focus measure for unsupervised auto-focusing based on optimum discrete cosine transform coefficients,” IEEE Trans Consumer Electronics 57, 1–5 (2011). [CrossRef]
18. Y. Yoo, J. Jang, J. Shin, and J. Paik, “Optimal PSF selection using second-order frequency analysis for digital autofocusing,” TechArt: Journal of Arts and Imaging Science 2, 81–86 (2015). [CrossRef]
19. D. Kim, J. Shin, and J. Paik, “Real-time digital auto-focusing using prior PSF estimation,” TechArt: Journal of Arts and Imaging Science 1, 39–41 (2014). [CrossRef]
20. J. Jang, Y. Yoo, J. Kim, and J. Paik, “Sensor-based auto-focusing system using multi-scale feature extraction and phase correlation matching,” Sensors 16, 5747–5762 (2015). [CrossRef]
21. P. Śliwiński and P. Wachel, “A simple model for on-sensor phase-detection autofocusing algorithm,” Journal of Computer and Communication 1, 11–17 (2013). [CrossRef]
22. R. Butler, “Exclusive: Fujifilm’s phase detection system explained,” http://www.dpreview.com/articles/2151234617/fujifilmpd.
23. R. Fontaine, “Innovative technology elements for large and small pixel CIS devices,” in Proceedings on International Image Sensor Workshop, (IISS, 2013), pp. 1–4.
24. J. Ahn, K. Lee, Y. Kim, H. Jeong, B. Kim, H. Kim, J. Park, T. Jung, W. Park, T. Lee, E. Park, S. Choi, G. Choi, H. Park, Y. Choi, S. Lee, Y. Kim, Y. J. Jung, D. Park, S. Nah, Y. Oh, M. Kim, Y. Lee, Y. Chung, I. Hisanori, J. Im, D. K. Lee, B. Yim, G. Lee, H. Kown, S. Choi, J. Lee, D. Jang, Y. Kim, T. Kim, G. Hiroshige, C. Choi, D. Lee, and G. Han, “7.1 A 1/4-inch 8Mpixel CMOS image sensor with 3D backside-illuminated 1.12µ m pixel with front-side deep-trench isolation and vertical transfer gate,” in Proceedings on IEEE International Solid-State Circuits Conference Digest of Technical Papers, (IEEE, 2014), pp. 124–125.
25. J. Jang and J. Paik, “Dense depth map generation using a single camera with hybrid auto-focusing,” in Proceedings of IEEE International Conference on Consumer Electronics Berlin, (IEEE, 2015), pp. 277–278.
26. D. Mumford and J. Shah, “Optimal approximations by piecewise smooth functions and associated variational problems,” Communications on Pure and Applied Mathematics 42, 577–685 (1989). [CrossRef]
27. G. Koepfler, C. Lopez, and J.-M. Morel, “A multiscale algorithm for image segmentation by variational method,” SIAM J. Numer. Anal. 31, 282–299 (1994). [CrossRef]
28. A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natural image matting,” IEEE Trans. Pattern Analysis and Machine Intelligence 30, 228–242 (2008). [CrossRef]
29. S. Mukherjee and R. M. R. Guddeti, “A hybrid algorithm for disparity calculation from sparse disparity estimates based on stereo vision,” in Proceedings of IEEE International Conference on Signal Processing and Communications (SPCOM), (IEEE, 2014), pp. 1–6.
30. E. Z. Psarakis and G. D. Evangelidis, “An enhanced correlation-based method for stereo correspondence with subpixel accuracy,” in Proceedings of IEEE International Conference on Computer Vision, (IEEE, 2005), pp. 907–912.
31. C. Çıgla, “Recursive edge-aware filters for stereo matching,” in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Workshops, (IEEE, 2015), pp. 27–34.