## Abstract

Passive 3D sensing using integral imaging techniques has been well studied in the literature. It has been shown that a scene can be reconstructed at various depths using several 2D elemental images. This provides the ability to reconstruct objects in the presence of occlusions, and passively estimate their 3D profile. However, high resolution 2D elemental images are required for high quality 3D reconstruction. Compressive Sensing (CS) provides a way to dramatically reduce the amount of data that needs to be collected to form the elemental images, which in turn can reduce the storage and bandwidth requirements. In this paper, we explore the effects of CS in acquisition of the elemental images, and ultimately on passive 3D scene reconstruction and object recognition. Our experiments show that the performance of passive 3D sensing systems remains robust even when elemental images are recovered from very few compressive measurements.

©2012 Optical Society of America

## 1. Introduction and background review

Three-dimensional (3D) imaging and visualization techniques have been the subject of great interest. Integral imaging (II) is a promising technology among 3D imaging techniques [1–8]. As proposed by G. Lippmann, II can provide full parallax and continuous viewpoints. Essentially, the systems use a microlens array to capture light rays emanating from 3D objects in such a way that the light rays that pass through each pickup microlens are recorded on a two-dimensional (2D) image sensor. The captured 2D image arrays are referred to as elemental images. The elemental images are 2D images, flipped in both the *x* and *y* direction each with a different perspective of a 3D scene. To reconstruct the 3D scene optically from the captured 2D elemental images, the rays are reversely propagated from the elemental images through a display microlens array that is similar to the pickup microlens array. It does not require the special glasses to observe 3D images. In a synthetic aperture integral imaging (SAII) as shown in Fig. 1(a)
, the 2D elemental images with different perspectives can be obtained through a camera array. For 3D computational reconstruction, back-projections of the elemental images through virtual pinhole array are used as depicted in Fig. 1(b), and averaged at the desired depth to reconstruct the image at that point. Thus 3D objects can be reconstructed at different depths in this manner.

Although research to date has assumed the elemental images to be acquired with conventional imagers, we seek to understand the impact of compressive sensing on SAII system. Therefore, in our concept, each camera in Fig. 1 is replaced with a compressive imager. A number of CS approaches are presented in [9–20]. The seminal work on compressed sensing by Candes, Romberg, and Tao [9] and Donoho [10] focused on the problem of perfectly reconstructing the scene using relatively few measurements. Specifically, Donoho notes that it is possible to design *m* = *O*(*N*log(*N*/*k*)) non-adaptive measurements which contain the information necessary to reconstruct any object with *N* pixels at an accuracy comparable to that which would be possible if the *k* most important coefficients of that object were directly observable. Thus one can have perfect (or near-perfect) image/signal reconstruction with far fewer samples than the Shannon sampling theorem implies provided the following assumptions are true:

- 1) There is structure in the image (i.e., the image is not random) ↔ there exist a basis for which the representation of the image of interest is sparse ↔ the image is compressible, and
- 2) The notion of “observing a sample” is generalized to include linear projections of the image in addition to the instantaneous image pixel intensity.

Assumption 1 is easily satisfied by almost all relevant images one wishes to exploit, while assumption 2 is satisfied by assuming simple optical measurement architectures such as those proposed by Wakin, et. al. [11], and Duarte, et. al. [12]. The algorithms for perfect reconstruction are generally formulated as a constrained optimization problem or with greedy search strategies [13]. For the wide field of view imaging application, Muise [14] has designed a compressive imaging algorithm with associated measurement kernels and has simulated results based upon a field-of-view multiplexing sensor. These works show a viable concept for wide area imaging at high resolution by multiplexing many distinct fields-of-view (FOVs) onto a single focal plane array (FPA). Bottisti and Muise [15] present a compressive imaging measurement system and corresponding reconstruction algorithms modeled by increased resolution sensing. This concept can be viewed as a local pixel multiplexing sensor, a technique that we will use for compressively sensing elemental images.

The rest of the paper is organized as follows. Section 2 provides a review of the mathematics of compressive sensing as it pertains to the acquisition and reconstruction of the elemental images. The results of compressively sensing the elemental images and then reconstructing the scene at varying depths are presents in Section 3. We also address the impact of varying levels of compression, and discuss the results in terms of the reconstruction signal to noise ratio and in terms of the impact of the process on object recognition and discrimination in Section 4. Our conclusions and directions for future research are in Section 5.

## 2. A review of compressive sensing

Here we briefly describe a local multiplexing concept in terms how one compressively senses information to reconstruct the elemental images, following the work reported in [17]. Specifically, the sensing and reconstruction is done independently on a 8 × 8 block of pixel. Let ** i** be a 64-element vector representing an 8 × 8 grayscale image

*I*ordered lexigraphically. Furthermore, let

*A*be an

*n*× 64,

*n*= 64 encoding matrix used to subsample the input image. Then the goal of compressive sensing is to reconstruct

**given an n-element vector**

*i***by solving**

*y***. As**

*i**A*has fewer rows than columns, this is an underdetermined system and therefore has infinitely many solutions.

Many algorithms exist for solving this underdetermined system when ** i** is sparse. This is generally not the case for typical imagery; however, we can transform the system into an equivalent sparse system by assuming we can represent the desired image

**as =**

*i**ϕ*

**, where the matrix**

*c**ϕ*represents a compression basis (i.e. wavelet, DCT, or an over-complete dictionary).

Thus, our sensing equation becomes:

where*ψ*=

*Aϕ*.

Using the mathematics described above, we will now detail an approach that compressively senses and reconstructs an image *I*. We consider the image *I* as split into 8 × 8 regions and reconstruct each one independently. Once each patch has been reconstructed, the regions are reassembled to form the final image.

Consider an 8 × 8 matrix consisting of only 1s and 0s where every non-overlapping 2 × 2 region contains one 1 and three 0s, randomly chosen [see Fig. 2(a)
]. We refer to this matrix as our mask. This mask represents the pixels from the original images that are passed onto the FPA to create ** y**. Each 2 × 2 region represents a single element of the sensor. Therefore this mask will allow us to reconstruct an 8 × 8 region from only 16 pixels, resulting in 4x compression.

We now create the matrix *A* as a 16 × 64 matrix. For each small 2 × 2 block in the mask, we consider the 8 × 8 matrix *A'* containing all 0s except where the corresponding pixel in the smaller block is on. This matrix *A'* is then ordered lexigraphically, transposed, and set as a row of the resulting matrix *A*. The *A* matrix corresponding to the mask in Fig. 2(a) is shown in Fig. 2(b).

Once the data is sensed with this sampling matrix, *A*, we must solve the underdetermined system

**. Since this is an underdetermined system, there are infinitely many solutions (i.e. there are many 8 × 8 “images” which fit the data we collected). If we fit all of the underlying assumptions, the theory of compressive imaging states that the sparsest solution will be the correct image. Our 8 × 8 image is then approximated by**

*c***=**

*i**ϕ*

**. There are two details which remain, how do we choose our basis matrix**

*c**ϕ*and how do we find the sparsest solution?

We chose to train a dictionary, *ϕ*, empirically on real world data to generate an overcomplete representation of 8 × 8 image patches that will guarantee the sparsity required by the compressive imaging theory.

To train this dictionary, we chose non-overlapping 8 × 8 chips from imagery collected from the same camera but of a different scene. A total of 409,600 chips were collected and then passed into the K-SVD [16] algorithm to create the dictionary. Similar to the *k*-means algorithm, the K-SVD algorithm, by Aharon, Elad, and Bruckstein, clusters the input chips and then creates basis vectors from them using singular-value decomposition (SVD). Each input chip is then reassigned to a cluster based upon its similarity to the set of basis vectors. The process iterates until the set of basis vectors can reconstruct the input chips within a specified amount of error.

To solve for the sparsest solution we chose the Orthogonal Matching Pursuit (OMP) of Tropp and Gilbert [17]. This algorithm is a greedy search for the nonzero coefficients, one at a time, until the model describes the sensed data to a pre-described tolerance.

Figure 3 shows results of this model on a typical grayscale image. While the performance of the previously described algorithm is very good, one can enhance the described sensing model to reduce the number of samples and therefore increase its compression ratio. Note that as described above, each 8 × 8 block was constructed from 16 samples (one for every 2 × 2 region). To increase the compression ratio, we now gather one measurement for each 4 × 4 region, for a total of 4 pixels for every 8 × 8 block. Pushing this further, one can model the entire 8 × 8 block as a single measurement. This results in a compression ratio of 64x.

## 3. 3D integral imaging with compressive sensing

Our goal is to examine the effects of compressive imaging as described in the previous section on passive 3D sensing. Toward this end, we apply the sensing and subsequent reconstruction of elemental images gathered from a passive 3D sensor, and compare the 3D imagery results to the results using conventionally sensed elemental images.

Given ground truth elemental images, we train a K-SVD dictionary based upon 8 × 8 image patches. The result was a 64 × 384 dictionary matrix *ϕ*. The sensing matrix was generated as in Fig. 2 but with the higher compression sampling previously described. At 1/16 compression ratio, 4 measurements are collected for each 8 × 8 block and for the 1/64 compression ratio, only one measurement is collected for each 8 × 8 block. Each color channel in the elemental image was assumed to be sensed and reconstructed separately and independently. Figure 4
shows the results for a typical elemental image using OMP for the image reconstruction algorithm.

Figure 5
illustrates the procedure for 3D scene reconstruction where each elemental image has been compressively sensed. First, elemental images, *I _{kl}*, are recorded by SAII, sensed at different compression rates, ${\tilde{I}}_{kl}^{16\text{x}},{\tilde{I}}_{kl}^{64\text{x}}$. Using K-SVD dictionary and OMP algorithm, these compressively sensed elemental images are reconstructed from which the 3D scene is using the following reconstruction equations [8]:

*f*is the focal length of camera lens,

*p*,

_{x}*p*are the distance between cameras,

_{y}*N*,

_{x}*N*are the number of pixels for each elemental image in

_{y}*x*and

*y*directions,

*c*,

_{x}*c*are the image sensor size,

_{y}*z*is the reconstruction depth,

*S*,

_{x}*S*are the shifting pixel number for 3D reconstruction in

_{y}*x*and

*y*directions,

*k*,

*l*are the index of each elemental image, and

*O*(

*x*,

*y*) is the superposition matrix at the reconstruction plane, respectively.

## 4. Experimental results

To show the visual performance of 3D integral imaging with compressive sensing, optical experiments are implemented. To obtain 2D elemental images, SAII is used. The focal length of camera lens is *f* = 50mm, the distance between cameras is *p* = 2mm, 10 × 10 elemental images are used to reconstruct 3D images, and each elemental image has 2184 × 1456 pixels. Figures 6(a)
and 6(b) show the original elemental images with and without background. Yellow car and green car are located at *z* = 340mm and *z* = 440mm. Tree and background are positioned at *z* = 610mm and *z* = 800mm, respectively.

Figure 6(c)-6(f) show the compressively (16x and 64x) sensed elemental images with and without background by using Eqs. (1)-(3). Using K-SVD dictionary and OMP algorithm, the original elemental images from compressively sensed elemental images may be reconstructed as shown in Fig. 7(c) -7(f). Then, by using Eqs. (4)-(6) and these reconstructed elemental images 3D images can be reconstructed as shown in Fig. 8 .

The 3D images using 16x CS images with and without background shown in Figs. 8(c) and 8(d) are very comparable to the original reconstructed 3D images in Figs. 8(a) and 8(b). On the other hand, due to the high compression ratio (64x), the 3D images in Figs. 8(e) and 8(f) differ more from the original ones.

To compare the performance of our method with conventional integral imaging, peak signal to noise ratio (PSNR) as a performance metric is used [22]. The PSNR is defined as:

where ${I}_{\mathrm{max}}$ is the maximum possible pixel value of the reconstructed image and $MSE$is mean squared error between the images reconstructed with conventional and compressive sensing. We used the conventional reconstructed 3D images in Figs. 8(a) and 8(b) as the ideal reference images. Figure 9 shows the PSNR results of reconstructed 3D images versus different reconstruction depths.In Fig. 9, the high PSNR values occur when the scene is reconstructed at the range of each object; i.e. the yellow car is at a range of 340mm – 390mm, the green car is at 420mm – 470mm, and the Tree is at 610mm – 640mm. On the other hand, the PSNR is worse at reconstruction depths of 400mm, 500mm, and 650mm since none of the objects in the scene are in focus at these depths.

To show that it is possible to recognize the objects in the compressively sensed and reconstructed scenes, we create maximum average correlation height (MACH) filter [21] using the ideal green car in Fig. 8(b) as the training image. Then, we correlate it with the ideal and reconstructed scenes without and with background at the desired depth. Here, we provide a brief overview of the MACH filter. This filter has been found to be robust over a broad range of conditions such as variations in the object’s scale, its orientation, background clutter and noise. In the Fourier domain, the MACH filter is expressed by [21]:

*M*(

*u*,

*v*),

*D*(

*u*,

*v*) and

*S*(

*u*,

*v*) are an object’s average image, power spectrum and spectral variance, respectively. We assume that the image is corrupted by additive noise with power spectrum

*C*(

*u*,

*v*). The parameters can be chosen to provide an optimal trade-off between the various performance metrics such as sharpness of the correlation peak (controlled by “

*α*” for

*D*), the stability of the peak value in the presence of distortions (controlled by “

*β*” for

*S*), and the response to additive noise (determined by “

*γ*” on

*C*). Since it is difficult to minimize all of these quantities simultaneously, the parameters

*α*,

*β*, and

*γ*allow us to arrive at the best overall tradeoff between the performance metrics. By carefully selecting

*α*and

*β*we can realize the smallest possible increase in correlation energy while minimizing the peak variance. In addition,

*γ*can be selected to yield minimize variance due to noise. To identify the 3D target, we correlate the MACH filter with the reconstructed image and measure the performance metrics.

Figure 10 shows peak to sidelobe ratio (PSR) for each level of compressive sensing. The PSR is defined as the number of standard deviations by which the peak exceeds the mean value of the correlation surface. It shows that pattern recognition performance is not degraded by our process at the correct reconstruction depth (420mm – 470mm) at the 16x compression. The blue and red lines are the ideal PSR when no compression is present (for the cases with and without background, respectively), and achieve maximum value when the reconstruction depth reaches 450 mm (i.e. the green car is in focus). In the presence of 16x compression (i.e. the green and black lines) show some drop in performance, but still achieve a distinct maximum PSR at the right depth. Therefore the correct car is recognizable in the 3D reconstruction, even after 16x compressive sensing. It should be noted that the cross-correlation PSR between the green car and yellow car (at a depth of 340mm – 390mm) does not increase due to the compression/reconstruction process and no false alarm occurs due to the other objects. At 64X compression (brown line), there is marginal separation between the green car at the correct depth, and other objects in the scene. However, in the presence of background, discrimination between the green car and other objects in the scene is no longer possible.

We would like to point out that conventional compression algorithms have been applied to elemental images [see for example reference 22]. The captured 2D elemental images exhibit a high degree of correlation between the pixels of each elemental image and between adjacent elemental images. The approach in [22] used a 4D transform by combining the discrete wavelet transform (DWT) and the discrete cosine transform (DCT) to reduce the dimensionality of elemental images and provide a substantial compression. However, conventional compression uses a sample then compress approach which requires an image sensor array with a large number of pixels for full capture of each elemental image which are then processed by numerical compression algorithms including, DCT, DWT, etc. In contrast, compressive sensing uses a coded acquisition approach which allows the capture of elemental images with far fewer image sensor pixels than conventional imaging.

## 5. Conclusion

In conclusion, we have presented 3D passive integral imaging system using compressive sensing. High resolution 2D elemental images are required for high quality 3D scene reconstruction using integral imaging systems. Compressive sensing provides a way to reduce the amount of data recorded in the elemental images while reconstruction algorithm such as K-SVD dictionary and OMP, allows us to recover the 2D images with high resolution. Although such reconstructed elemental images may have artifacts due to high image compression ratio, good quality 3D reconstructed images may be obtained using computational volumetric reconstruction algorithm of integral imaging. Our experimental results have shown that the proposed 3D imaging system with compressive sensing produces results that are comparable to conventional 3D sensing, but with smaller elemental images. For the images used in our experiments, for image reconstruction purposes, we were able to demonstrate as much as 64X compression, whereas recognition was acceptable of up to 16X compression. In addition, unlike 2D imaging, integral imaging can provide 3D information about the scene.. Therefore, we expect that the proposed system may be utilized for many applications of 3D imaging and 3D object recognition while alleviating the need to capture high-resolution elemental images. Future work will include examination and comparison of the computation time and memory usage of 3D scene reconstruction algorithms and 3D object recognition using compressive sensing and conventional integral imaging to evaluate computing performance.

## References and links

**1. **G. Lippmann, “La Photographie Integrale,” C.-R. Acad. des Sci. **146**, 446–451 (1908).

**2. **L. Yang, M. McCormick, and N. Davies, “Discussion of the optics of a new 3-D imaging system,” Appl. Opt. **27**(21), 4529–4534 (1988). [CrossRef] [PubMed]

**3. **F. Okano, J. Arai, K. Mitani, and M. Okui, “Real-time integral imaging based on extremely high resolution video system,” Proc. IEEE **94**(3), 490–501 (2006). [CrossRef]

**4. **A. Stern and B. Javidi, “Three-dimensional image sensing, visualization, and processing using integral imaging,” Proc. IEEE **94**(3), 591–607 (2006). [CrossRef]

**5. **R. Martinez-Cuenca, G. Saavedra, M. Martinez-Corral, and B. Javidi, “Progress in 3-D multiperspective display by integral imaging,” Proc. IEEE **97**(6), 1067–1077 (2009). [CrossRef]

**6. **J.-H. Park, K. Hong, and B. Lee, “Recent progress in three-dimensional information processing based on integral imaging,” Appl. Opt. **48**(34), H77–H94 (2009). [CrossRef] [PubMed]

**7. **J.-S. Jang and B. Javidi, “Three-dimensional synthetic aperture integral imaging,” Opt. Lett. **27**(13), 1144–1146 (2002). [CrossRef] [PubMed]

**8. **S.-H. Hong, J.-S. Jang, and B. Javidi, “Three-dimensional volumetric object reconstruction using computational integral imaging,” Opt. Express **12**(3), 483–491 (2004). [CrossRef] [PubMed]

**9. **E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory **52**(2), 489–509 (2006). [CrossRef]

**10. **D. Donoho, “Compressed Sensing,” IEEE Trans. Inf. Theory **52**(4), 1289–1306 (2006). [CrossRef]

**11. **M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. Kelly, and R. Baraniuk, “An Architecture for Compressive Imaging,” Proc. Int. Conf. on Image Processing (2006).

**12. **M. Duarte, M. Wakin, S. Sarvotham, D. Baron, and R. Baraniuk, “Distributed Compressed Sensing of Jointly Sparse Signals,” Asilomar Conf. on Signals, Systems, and Computers 1537–1541 (2005).

**13. **D. L. Donoho, Y. Tsaig, I. Drori, and J. L. Starck, “Sparse Solution of Underdetermined Linear Equations by Stagewise Orthogonal Matching Pursuit,” Dep. of Stat., Stanford Univ., Technical Report 2006–2, April (2006).

**14. **R. Muise, “Compressive imaging: An application,” SIAM J. Imaging Science **2**(4), 1255–1276 (2009). [CrossRef]

**15. **D. Bottisti and R. Muise, “Image exploitation from encoded measurements,” Proc. SPIE 8165 816518 (2011).

**16. **M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. **54**(11), 4311–4322 (2006). [CrossRef]

**17. **J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory **53**(12), 4655–4666 (2007). [CrossRef]

**18. **Y. Rivenson and A. Stern, “Compressed imaging with a separable sensing operator,” IEEE Signal Process. Lett. **16**(6), 449–452 (2009). [CrossRef]

**19. **A. Stern, “Compressed imaging system with linear sensors,” Opt. Lett. **32**(21), 3077–3079 (2007). [CrossRef] [PubMed]

**20. **Y. Rivenson and A. Stern, “Conditions for practicing compressive Fresnel holography,” Opt. Lett. **36**(17), 3365–3367 (2011). [CrossRef] [PubMed]

**21. **A. Van Nevel and A. Mahalanobis, “Comparative study of MACH filter variants using LADAR imagery,” Opt. Eng. **42**, 541–550 (2002). [CrossRef]

**22. **E. Elhara, A. Stern, O. Hadar, and B. Javidi, “A hybrid compression method for integral images using discrete wavelet transform and discrete cosine transform,” IEEE JDT **3**, 321–325 (2007).