Needle-based deep-neural-network camera

Ruipeng Guo; Soren Nelson; Rajesh Menon

doi:10.1364/AO.415059

1. INTRODUCTION

Photography without lenses has a long and interesting history, starting with pinhole cameras and more recently using what has been referred to as accidental cameras [1]. Optics-free systems for wide-field imaging have also been studied with an emphasis on large-field-of-view microscopy [2]. We first acknowledge that incoherent imaging without any optics (i.e., with only free-space propagation) has been studied for a long time, [3] and it was recognized as belonging to a class of severely ill-posed problems, uniquely challenging for regularized inversion. Nevertheless, this inverse problem can be solved with regularized matrix inversion in some cases [4,5] as well as with machine-learning techniques [5–7]. Machine learning has been applied to perform classification tasks on raw [6,7] as well as on anthropocentric images [8,9]. A big advantage of these approaches is that they do not rely on coherent illumination (such as those that exploit speckle) [10,11]. Although utilizing coherence can lead to high spatial resolution, in this work, we explicitly avoid comparison to such approaches, as they are not readily applicable to general-purpose photography.

Imaging is the process of information recovery from the imperfect recorded image. Information from an object is lost along the way to the image sensor, where the image is recorded. This loss of information depends strongly upon the optical system. Conventional lenses have shown their supremacy in minimizing such information losses. However, there are many situations where conventional lenses cannot be used. One example of this is in minimally invasive deep-brain imaging, where the requirement of minimal brain trauma and low information loss in the imaging channel can be mutually exclusive. Microendoscopes such as a surgical cannula have been used to transport light from inside the brain to the outside sensor, and computational methods can be used to reconstruct the details of the image at the distal end of the cannula, a technique we named computational-cannula microscopy (CCM) [12–14]. CCM can be fast, as there is no scanning required. It has already demonstrated in situ [15] and machine-learning-enabled 3D microscopy [16,17]. Here we extend this approach into full-color macro photography.

The basic principle relies on the premise that as long as an optical system has a linear space-variant transfer function (point-spread function), it may be possible to invert this transfer function to recover the intensity information of the object being imaged limited primarily by noise [2]. Machine learning has been shown to be an effective approach to solve such inverse problems [18]. By collecting sufficient data, one can train machine-learning algorithms to implicitly extract the space-variant point-spread functions and thereby recover the object information. By definition, the testing data will limit the nature of objects that could be imaged in such a fashion, which is analogous to restricting the solution space (regularizing) with a priori information. Here we demonstrate a camera comprised of a cannula coupled with a deep neural network (DNN) that is trained to recover object information in an anthropocentric format. We further show that it is possible to skip image reconstructions and directly perform classification operations on the raw recorded images, which could enable cameras with enhanced privacy as well as low power consumption. We experimentally characterized the resolution, field of view (FOV), and depth of focus (DOF) of such a camera with a variety of monochrome and color images.

Fig. 1. (a) Schematic and (b) photograph of our experimental setup. (c) Example images. The input to the neural network is the raw sensor image, and the output reconstructed and ground truth images are shown. Details of our neural network are included in Supplement 1.

Download Full Size | PDF

Fig. 2. Color image reconstructions. (a) Color EMNIST images (${\rm{SSIM}} = {0.77}$, ${\rm{MAE}} = {0.05}$); (b) color emojis (${\rm{SSIM}} = {0.67}$, ${\rm{MAE}} = {0.1}$); (c) gray fashion MNIST (${\rm{SSIM}} = {0.75}$, ${\rm{MAE}} = {0.06}$). Each image is ${6.5}\;{\rm{cm}} \times {6.5}\;{\rm{cm}}$ (white ${\rm{scale}}\;{\rm{bar}} = {{1}}\;{\rm{cm}}$). Left column: raw sensor image. Center column: ground truth. Right column: DNN output. Object distance is 35 cm.

Download Full Size | PDF

2. EXPERIMENTS AND NETWORK ARCHITECTURE

Our experimental setup, illustrated in Fig. 1(a) was comprised of a cannula (${\rm{diameter}} = {0.22}\;{\rm{mm}}$, ${\rm{length}} = {12.5}\;{\rm{mm}}$, Thorlabs CFMC52L02). The distal end of the cannula was facing the object, a liquid-crystal-display (LCD) monitor (ViewSonic VG910b, Size: 19 inch, Resolution: ${{1280}} \times {{1024}}$ pixels), placed at a nominal object distance of 35 cm from the distal end of the cannula. The intensity distribution on the proximal end of the cannula is relayed onto a CMOS image sensor (Amscope MU300, ${\rm{pixel}}\;{\rm{size}} = {3.2}\;{{\unicode{x00B5}{\rm m}}}$) via a conventional lens (NAVITAR NMV-12, ${\rm{focal}}\;{\rm{length}} = {{12}}\;{\rm{cm}}$). Different “objects” were displayed on the LCD, and the corresponding images were recorded on the image sensor. We modified the well-known EMNIST dataset [19] by applying random color to the nonzero pixels to create a color EMNIST dataset. Forty thousand images from such a modified EMNIST dataset, 20,890 images from a color emoji dataset [20], and 60,000 images from the grayscale fashion MNIST dataset were displayed on the LCD and recorded on the sensor [see the photograph of the system in Fig. 1(b)]. The physical size of each image was 6.5 cm square. The size of each recorded image was ${{160}} \times {{160}}$ pixels (the size on the sensor is 0.512 mm square), resulting in an effective demagnification of ${{127}} \times$ for each side.

Figure S1 (Supplement 1) shows the architecture of our DNN (a modified U-net) that was used for image reconstructions. The raw sensor image (${{160}} \times {{160}}\;{\rm{pixels}} \times {{3}}\;{\rm{color}}$ channels) was first downsampled to ${{128}} \times {{128}}\;{\rm{pixels}} \times {{3}}\;{\rm{color}}$ channels as a compromise to keep the DNN small. The sensor pixel size is 3.2 µm and the reconstruction pixel size is 4 µm due to this downsampling. The U-net is comprised of dense blocks that include two convolutional layers with ReLu activation function and a batch-normalization layer. Pixel-wise cross entropy was used as the loss function during training. We randomly chose 1000 images from the EMNIST dataset, 489 images from the emojis dataset, and 2000 images from the MNIST fashion dataset, and we used these exclusively for testing. The structural similarity index (SSIM) and the maximum average error (MAE) were (0.77, 0.04), (0.67, 0.1), and (0.75, 0.06) for the color EMNIST, the color emojis, and the MNIST fashion datasets, respectively. Exemplary images are shown in Fig. 2, and the EMNIST images are reconstructed with higher fidelity than the emojis. The larger number of training images in the fashion MNIST images results in high-quality reconstructions as well. It is noted that this experiment corresponds to a FOV of ${{2}} \times {\tan}^{- 1}({{6.5/2/35}})= {{10.6}}^\circ$.

Fig. 3. (a) Estimating resolution by reconstructing squares as small as 2.6 mm, corresponding to an angular resolution of ${\sim}{0.4}^\circ$. The inset shows the cross section through the yellow line. (b) The FOV is estimated as 18°, represented by the clear reconstructions within the red circle. The object distance was 35 cm, and similar results were obtained with color images as well.

Download Full Size | PDF

Fig. 4. Depth of focus. Average MAE and SSIM of (a) grayscale and (b) color images as functions of object distance. (c) Exemplary images. Left column: ground truth. DNN trained at 35 cm.

Download Full Size | PDF

3. IMAGING PERFORMANCE

In order to explore the resolution of our camera, we created a new dataset (20,000 images, 1000 of which were used exclusively for testing) comprised of randomly positioned squares, each of size 2.6 mm. After demagnification of ${{127}} \times$, this corresponds to ${\sim}{{5}} \times {{5}}$ DNN pixels (each of size 4 µm), which can be seen in right panel of Fig. 3(a). We also tried smaller squares, but the sensor signal was too weak. The DNN was able to achieve average MAE and SSIM of 0.91 and 0.01, respectively [example in Fig. 3(a)]. At an object distance of 35 cm, this corresponds to an angular resolution of ${\sim}{0.4}^\circ$. In addition, we reconstructed a slanted edge image with a model trained on a grayscale EMNIST dataset and calculated the modulation transfer function (MTF) [21] as shown in Fig. S7. At 10% contrast, we estimate the sensor side resolution at 178 lp/mm (5.6 um), which is close to the sensor pixel size of 3.2 um. Then the effective numerical aperture (NA) is ${\sim}{0.078}$. In order to estimate the FOV of the camera, we created a new dataset comprised of objects of total size 21 cm on the LCD [randomly positioned individual squares of size 3.6 mm; see Fig. 5(b); 30,000 total images and 800 used exclusively for testing)], and we confirmed that objects within a circle of radius 5.5 cm [red circle in Fig. 3(b)] were reconstructed with good fidelity. Since the object distance was 35 cm, this corresponds to FOV of 18°. Similar results were obtained with color images as well.

In order to estimate the DOF of our cannula-based camera, we created a new dataset with monochrome and color EMNIST and KANJI [22] characters (40,000 total and 1000 used exclusively for testing) at five distances from 29 cm to 41 cm. The network was trained for the data at an object distance of 35 cm. The averaged SSIM and MAE of the reconstructed images were plotted as functions of object distance in Figs. 4(a) and 4(b) for grayscale and color images, respectively. This data and the exemplary images shown in Fig. 4(c) confirm the best performance at 35 cm, and we can estimate the depth of field to be ${\sim}{{2}}\;{\rm{mm}}$ in the object plane. With ${{127}} \times$ demagnification, the DOF is estimated as 16 µm (which is close to the diffraction-limited DOF of a lens with ${\rm{NA}} \sim {0.078}$). Therefore, our results of resolution and DOF are relatively consistent.

We have previously shown that an ancillary DNN can be used for predicting the image depth [17]. Following the same approach, here we trained a depth-classification network with 97,500 images captured at five object distances (29, 32, 35, 38, and 41 cm), and used 2500 images exclusively for testing. A total of 20,000 images were recorded for each object distance. The depth-classification network consists of 2D convolution blocks (one 2D convolution layer with ReLu activation function followed by a batch-normalization layer) with a Maxpool function between every two blocks and a final classifier [17]. The same reconstruction network as before was also trained. Exemplary results for the grayscale and color images are summarized in Fig. 5. The grayscale and color images achieved (SSIM, MAE) of (0.8, 0.06) and (0.77, 0.05), respectively. The depth-prediction accuracies for the grayscale and color images were 0.9996 and 0.9988, respectively. In the supplementary document, we included additional depth-map experiments with synthetic 3D images (Fig. S5) as well as with images from a tilted plane (Fig. S6). Although these results are preliminary and require improvements, they suggest that with sufficient training data, it should be feasible to generate depth maps and 3D images of objects with a single frame.

Fig. 5. Depth prediction. A separate DNN can be trained to predict the object distance. Exemplary (a) grayscale and (b) color images with predicted depths (right columns) are shown (${\rm{accuracy}} \gt {{99}}\%$).

Download Full Size | PDF

Fig. 6. Examples of classification (a) without and (b) with image reconstructions of the EMNIST datasets. Schematics at the bottom describe the data flow. (c) Summary of classification accuracies with 47 classes.

Download Full Size | PDF

Fig. 7. Classification test accuracy on the EMNIST dataset when trained on a portion of the (a) gray and (b) color data, and (c) when trained on the raw images with fewer classes. Classification confusion matrix for the raw (d) monochrome, and (e) color EMNIST datasets.

Download Full Size | PDF

Fig. 8. (a) Confusion matrix for classification of the raw fashion MNIST dataset. Examples of classification on the fashion MNIST: (b) raw images and (c) reconstructed images. Top: monochrome and bottom: color.

Download Full Size | PDF

4. IMAGE CLASSIFICATION USING DEEP NEURAL NETWORKS

Lastly, we show that a DNN can perform classification on the raw data without necessarily performing the reconstructions described above. We initially used two datasets, the modified EMNIST dataset from above and 20,000 monochrome EMNIST images, each of which contains 47 classes. In both cases, 1000 images were exclusively used for testing. For classification, we used the SimpNet [23] architecture, which has shown to have better results than other state-of-the-art networks on many benchmarks, while having much fewer parameters through strategic architecture choices such as minimal max pooling, many similar thin layers, applying dropout on all convolutional layers, and so on. For reference, we used this architecture to classify the ground truth images of the above datasets. While only using ${\sim}{{1/3}}$ of the total EMNIST training data (${\sim}{{1/6}}$ for the monochrome dataset) the SimpNet model comes within 3% (9% on monochrome) of the state of the art [24]. For all models we use the Adam optimizer and trained for 35 epochs. The results show that the network performs better when the images are reconstructed prior to classification as opposed to classifying from the raw sensor images (Fig. 6).

When the amount of training data is reduced (in only the classification stage, not the reconstruction), this gap is increased (Fig. 7). Thus, the network trained on the raw images requires more data to produce similar results. We elucidated what influence the number of classes has by using the raw datasets with images from only 10 classes (digits 0–9) as opposed to 47. The final training set size was 8628 and 4361 for the color and monochrome datasets, respectively. These models were trained the same as before. As expected, the classification accuracy is increased. In all experiments the networks were more successful with the monochrome datasets as opposed to the color sets, even with half the number of training images.

We then tested the classification system on the fashion MNIST dataset, which has more data (52,000 training, 6000 validation, 2000 testing) and a lower number of classes (10). Using the same network architecture and training scheme as before, the model achieved 84% classification accuracy on the raw images and 85% on the reconstructed images (Fig. 8). For reference, the model achieved 93% classification accuracy on the ground truth images. Again, the results indicate that classification after reconstruction produces better results than classification on the raw data. However, here the difference in performance is much smaller than with the EMNIST dataset, which is likely in part due to the larger amount of training data. Additionally, the dataset size as well as the small number of classes, as explored above, is presumably much of the reason for the increase in accuracy on all fashion MNIST datasets over the EMNIST.

5. CONCLUSION

In this paper, we demonstrated a novel camera whose primary imaging element is a needle/cannula of length 12.5 cm and ${\rm{diameter}} = {0.22}\;{\rm{mm}}$, placed 35 cm away from an object. Images were reconstructed using trained DNNs. We confirmed that this computational cannula camera can image with close to sensor-limited resolution, diffraction-limited DOF, and showcased RGB color channels, field of view of 18°, and angular resolution of ${\sim}{0.4}^\circ$. Preliminary experiments with a depth-classifier network showed the potential for 3D or depth imaging. Finally, we also showed that image classification is possible not only from the reconstructed images but also from the raw images without reconstruction. The latter is particularly interesting for application-specific imaging, where power efficiency is important, and also for enhanced privacy (since raw images are not directly readable by humans). We further showed with experiments that by using sufficient training images, the accuracy of classification from raw sensor data can be almost as good as classifying from reconstructed images.

Funding

National Institutes of Health (1R21EY030717); National Science Foundation (1533611); University of Utah Undergraduate Research Opportunities Program (UROP).

Acknowledgment

We thank Zhimeng Pan for discussions.

Disclosures

RM: University of Utah (P).

Supplemental document

See Supplement 1 for supporting content.

REFERENCES

1. A. Torralba and W. T. Freeman, “Accidental pinhole and pinspeck cameras,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June , 2012.

2. A. Ozcan and E. McLeod, “Lensless imaging and sensing,” Ann. Rev. Biomed. Eng. 18, 77–102 (2016). [CrossRef]

3. N. George, “Lensless electronic imaging,” Opt. Commun. 133, 22–26 (1997). [CrossRef]

4. G. Kim, K. Isaacson, R. Palmer, and R. Menon, “Lensless photography with only an image sensor,” Appl. Opt. 56, 6450–6456 (2017). [CrossRef]

5. G. Kim and R. Menon, “Computational imaging enables a ‘see-through’ lensless camera,” Opt. Express 26, 22826–22836 (2018). [CrossRef]

6. S. Nelson, E. Scullion, and R. Menon, “Optics-free imaging of complex, non-sparse QR-codes with deep neural networks,” OSA Contin. 3, 2423–2428 (2020). [CrossRef]

7. G. Kim, S. Kapetanovic, R. Palmer, and R. Menon, “Lensless-camera based machine learning for image classification,” arXiv:1709.00408 (2017).

8. Z. Pan, B. Rodriguez, and R. Menon, “Machine-learning enables image reconstruction and classification in a ‘see-through’ camera,” OSA Contin. 3, 401–409 (2020). [CrossRef]

9. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6, 921–943 (2019). [CrossRef]

10. I. Papadopuolos, S. Farahi, C. Moser, and D. Psaltis, “Focusing and scanning light through a multimode optical fiber using digital phase conjugation,” Opt. Express 20, 10583–10590 (2012). [CrossRef]

11. O. Katz, P. Heidmann, M. Fink, and S. Gigan, “Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations,” Nat. Photonics 8, 784–790 (2014). [CrossRef]

12. G. Kim and R. Menon, “An ultra-small 3D computational microscope,” Appl. Phys. Lett. 105, 061114 (2014). [CrossRef]

13. G. Kim, N. Nagarajan, M. Capecchi, and R. Menon, “Cannula-based computational fluorescence microscopy,” Appl. Phys. Lett. 106, 261111 (2015). [CrossRef]

14. G. Kim and R. Menon, “Numerical analysis of computational cannula microscopy,” Appl. Opt. 56, D1–D7 (2017). [CrossRef]

15. G. Kim, N. Nagarajan, E. Pastuzyn, K. Jenks, M. Capecchi, J. Sheperd, and R. Menon, “Deep-brain imaging via epi-fluorescence computational cannula microscopy,” Sci. Rep. 7, 44791 (2016). [CrossRef]

16. R. Guo, Z. Pan, A. Taibi, J. Sheperd, and R. Menon, “Computational cannula microscopy of neurons using neural networks,” Opt. Lett. 45, 2111–2114 (2020). [CrossRef]

17. R. Guo, Z. Pan, A. Taibi, J. Shepherd, and R. Menon, “3D computational cannula fluorescence microscopy enabled by artificial neural networks,” Opt. Express 28, 32342–32348 (2020). [CrossRef]

18. H. Lin and S. Jegelka, “ResNet with one-neuron hidden layers is a universal approximator,” Adv. Neural Inf. Process. Syst. 31, 6169–6178 (2018).

19. G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” in 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, Alaska, USA, 2017, pp. 2921–2926. [CrossRef]

20. “Emoji dataset,” https://emojipedia.org/.

21. P. Granton, “Slant edge script,” 2021, https://www.mathworks.com/matlabcentral/fileexchange/28631-slant-edge-script.

22. T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha, “Deep learning for classical Japanese literature,” arXiv preprint arXiv: 1812.01718v1 (2018).

23. S. H. Hasanpour, M. Rouhani, M. Fayyaz, M. Sabokrou, and E. Adeli, “Towards principled design of deep convolutional networks: introducing SimpNet,” arXiv:1802.06205 (2018).

24. H. M. Kabir, M. Abdar, S. M. J. Jalali, A. Khosravi, A. F. Atiya, S. Nahavandi, and D. Srinivasan, “Spinalnet: deep neural network with gradual input,” arXiv:2007.03347 (2020).

Needle-based deep-neural-network camera

Abstract

1. INTRODUCTION

2. EXPERIMENTS AND NETWORK ARCHITECTURE

3. IMAGING PERFORMANCE

4. IMAGE CLASSIFICATION USING DEEP NEURAL NETWORKS

5. CONCLUSION

Funding

Acknowledgment

Disclosures

Supplemental document

REFERENCES

Supplementary Material (1)

Cited By

Figures (8)

Applied Optics