Towards practical single-shot phase retrieval with physics-driven deep neural network

Qiuliang Ye; Qiuliang Ye; Li-Wen Wang; Li-Wen Wang; Daniel P. K. Lun

doi:10.1364/OE.496418

1. Introduction

Phase retrieval (PR) intends to reconstruct a complex-valued signal only from its Fourier intensity measurements. It is a key problem in crystallography [1], optical imaging [2,3], inverse scattering [4], diffraction imaging [5], etc. PR is also a crucial component of holographic imaging [6]. The investigation of PR methods was initiated in the 1970s. Numerous reconstruction approaches were developed by the optics research community [1,2,7]. Recently, the developments in modern optimization theories [8,9] and computational imaging [10,11] provided further understanding of the problem. From the mathematical perspective, the PR problem can be expressed as follows [9]:

(1)$$\textrm{Find }\; {\textbf x} \in {\mathrm{\mathbb{C}}^N}\; \; \; \; {{\bf {\cal X}}_m}\; = \; {|{\mathrm{{\cal F}}({{{\textbf h}_m} \circ {\textbf x}} )} |^2}\; ,m = 1,\ldots ,\; \mathrm{{\cal M}},$$

where ${\textbf x}$ is the complex-valued signal of interest; ${\bf {\cal X}}$ denotes the Fourier intensity measurements; ◦ and $\mathrm{{\cal F}}$ stand for elementwise multiplication and Fourier transform, respectively. The pre-determined optical masks ${\textbf h}$ are optional; they are for providing the constraints to lessen the problem’s ill-posedness. Note that the symbols ${{\bf {\cal X}}_m}$ and ${{\textbf h}_m}$ represent the $m$-th measurement and mask, respectively, where $m = 1, \ldots ,\; \mathrm{{\cal M}}$. The total number of masks needed, i.e., $\mathrm{{\cal M}}$, depends on the applications. There are various methods for the implementation of the masks. For example, early PR approaches considered the non-zero signal support as the optical mask [2]. However, these traditional methods cannot ensure globally optimal solutions (uniqueness condition). Their performance will also substantially degrade if the signal support cannot be clearly defined. In recent years, random masks were employed as powerful constraints for the optimization process [8,12]. They can be implemented using a digital micromirror device (DMD) or spatial light modulator (SLM) [13,14]. Although the use of random masks can improve reconstruction performance, the high cost and inaccuracy of the DMD and SLM devices deter the general application of the method. Besides, it is empirically shown in [8,9] that around 4 − 6 measurements are needed for accurate reconstruction with random binary masks. It increases the data acquisition time and is thus undesirable for dynamic applications. For solving (1), different iterative optimization methods, such as the alternating direction method of multipliers (ADMM) [15], Wirtinger Flow [8], etc., are generally used. These approaches are extremely time-consuming. Thus, the resulting PR methods are not suitable for real-time applications.

In the last three decades, deep neural networks (DNN) have been widely studied and successfully applied to different applications [16]. They were also used in the PR problem [17–26]. The deep learning-based PR methods can work with only a single Fourier intensity measurement (which is the so-called single-shot PR). These methods can fit in some feedforward DNN structures to achieve real-time performance when processing with GPU. Nevertheless, the accuracy of these methods still has much room to improve. It is still a challenge to directly infer a complex-valued signal from its Fourier intensity due to the enormous discrepancy between a signal in the Fourier and spatial domains. To improve the accuracy, researchers also suggested the physics-driven method [21] in which the intensity measurement was used to inform a plug-and-play optimization process. However, the method is iterative and as time-consuming as the traditional optimization-based methods. Besides using the plug-and-play structure, researchers also introduced the physics information to the PR process by directly plugging in the HIO algorithm [2] running alternately with a DNN in an iterative procedure [26]. These physics-driven approaches have a common characteristic in that they apply physics information to a traditional optimization method and let it work iteratively with a network model. However, in this case, the system cannot be end-to-end trained. The estimation error of the optimization algorithm may have a special distribution that is unknown to the network model. The iterative process can thus be trapped at a local minimum and fails to give the best solution.

One common problem of these deep learning-based PR methods is that most of them are trained and tested with simulation data. Except for noises, they often ignored the other artifacts in practical Fourier intensity measurements. It makes their reported results unreliable. Most PR applications involve structured images, which have energy concentrated at low frequencies (in particular, at zero frequency). The dynamic range of the data in an intensity measurement is thus extremely large. Most general imaging devices nowadays have only a 12 to 16 bits dynamic range. It makes the low-frequency part of the intensity image severely saturated. As we have shown in Section 4.6, the saturation problem can significantly affect the performance of PR methods.

To rectify the abovementioned problems, we propose in this paper a novel physics-driven deep learning-based PR method dubbed PPRNet (PPRNet stands for Physics-driven Phase Retrieval Network). Similar to other deep learning-based PR methods, PPRNet requires only a single Fourier intensity measurement for each PR reconstruction. It, however, gives a much higher accuracy by using a physics-driven method that guides the network to follow the Fourier intensity measurement at different scales to reconstruct the images. The resulting network structure is still of feedforward type. Thus, it is much faster than the iterative physics-driven approaches. The whole network can be end-to-end trained to give the best result. PPRNet is enabled by a novel Hybrid Unwinding Block (HUB) embedded in a multi-scale convolutional neural network (CNN) structure. It directs the input feature map into two paths such that the global and local information of the feature map at different scales are separately processed with physics informed. The data are then combined with a channel attention method so that the significant features are collected for reconstructing the images. To evaluate the generality of the proposed PPRNet, we have conducted a series of simulations with 2 different datasets that contain complex-valued images with linearly correlated and uncorrelated magnitude and phase. All data in the intensity image were capped and quantized as 12-bit integers to simulate the saturation problem with quantization errors. We then adopted the defocusing method in [27–29] to mitigate the saturation problem in the intensity images. As shown in the simulation results, the proposed PPRNet significantly outperforms state-of-the-art (SOTA) deep learning-based PR methods. To understand the performance of the proposed PPRNet in practical applications, we constructed an optical setup for generating the intensity measurements of phase-only images obtained from 3 datasets. Similar to the simulation, the defocusing method was used to mitigate the saturation problem when preparing the datasets. They were then used for the training and testing of the proposed PPRNet. Based on these experimental data, we compared the proposed PPRNet with SOTA deep learning-based PR methods. A significant improvement in accuracy is noted. It also has lower complexity than other physics-driven PR methods.

To summarize, the contribution of this work is three-fold:

1. A physics-driven deep learning-based PR method, namely PPRNet, is developed. It requires only a single Fourier intensity measurement for each PR reconstruction without the need for any masking scheme to constrain the measurement. It allows the Fourier intensity measurement to inform the training and inferencing processes so that the model is guided to give the right solution. Experimental results have demonstrated the effectiveness of this approach and the improvement it brings over the traditional deep learning-based PR methods. PPRNet has a non-iterative feedforward structure and is end-to-end trained. It has lower complexity than the existing physics-driven approaches while having higher accuracy.
2. A novel Hybrid Unwinding Block (HUB) is proposed. While the proposed PPRNet has a multi-scale structure, HUBs are embedded at different scales of the network to facilitate the utilization of physics information to guide the training and inferencing of the network. It separately processes the global and local information of the feature maps with the aid of the Fourier intensity measurement and combines them with a channel attention method.
3. Different from the traditional deep learning-based PR approaches that are trained and tested with simulation data, we construct an optical platform to evaluate the proposed PPRNet and compare it with SOTA deep learning-based PR approaches. The results are thus more reliable to reflect the true performances in practical applications.

This paper is organized as follows: Section 2 provides a review of the traditional optimization-based algorithms and deep learning-based PR approaches. Section 3 introduces the proposed PPRNet. The simulation and experiment results with comparisons are shown in Sections 4 and 5, respectively. Finally, we conclude the paper in Section 6.

2. Related works

2.1 Optimization-based PR algorithms

In the early days, PR methods were mainly based on the iterative alternating minimization (AM) framework [1,2,7]. The estimated image $\tilde{{\textbf x}} \in {\mathrm{\mathbb{C}}^{N \times N}}$ is updated iteratively between the spatial and Fourier domain. Although the AM framework offers a portable solution, the AM-based algorithms are prone to stagnation and slow to converge (usually, more than 1000 iterations are needed). Besides, they are sensitive to initialization.

In recent years, the Wirtinger flow (WF) PR algorithm was developed with the advancement of modern optimization theories [8]. Different from the AM framework, WF solves the phase retrieval problem through gradient descent:

(2)$${\tilde{{\textbf x}}^{k + 1}}:{\tilde{{\textbf x}}^k} - {\mu ^{k + 1}}\nabla f({{{\tilde{{\textbf x}}}^k}} ),$$

where $\nabla f({\textbf x} )$ represents the first-order gradient descent of the loss function (i.e., MSE loss), and ${\mu ^{k + 1}}$ is the step size at the current iteration. Empirically, 4 to 8 measurements are needed for a globally optimal solution [8,9]. Although WF provides a theoretical guarantee for convergence to the globally minimal solution, it often fails to converge to a satisfactory result if only one intensity measurement is given.

2.2 Deep Learning-based PR methods

In recent years, deep learning-based PR approaches have been widely studied since they provide non-iterative inferences compared with time-consuming optimization-based algorithms. Most of these approaches can work with only a single Fourier intensity measurement for each PR reconstruction without the need for any additional constraints on the measurement. This seemingly impossible task has a theoretical basis [30]. It is known that if the Fourier intensity measurement is oversampled by two or more times in each dimension, the original complex-valued signal can be uniquely reconstructed, except for trivial ambiguities. Although such a reconstruction problem is highly non-convex (it is the reason why the traditional optimization methods fail to perform), it is particularly suitable for deep learning-based methods due to their non-linear nature. Besides, they can also make use of the statistics acquired from a huge dataset to infer the solution. In general, deep learning-based PR methods can be split into two categories depending on whether the underlying physics is adopted in the networks.

For the first category, a feedforward network is used to estimate the target images directly from a Fourier intensity measurement [17,22–24,29,31–33]. Specifically, [24] proposed a two-branch CNN to reconstruct the magnitude and phase part from an oversampled Fourier intensity measurement, and [31] applied this network to the 3D crystal PR problem. The approach shows reasonably good performance for simple images with brief details. The performance when dealing with complex images is unknown. [22] applied the conditional generative adversarial network to reconstruct the images. The network is extremely large since multiple multi-layer perceptrons (MLP) are used for all stages of the network. [32] adopted the ResNet [34] structure for Fourier PR tasks. Only a simple dataset (MNIST) was used for testing, and the performance was only barely satisfactory. The error in the details was still rather large. [17] implemented a UNet structure [35] to reconstruct the phase-only images from Fresnel diffraction patterns. Its performance with Fraunhofer diffraction patterns (Fourier intensity) is unknown. As for [33], the authors adopted a multiple-resolution UNet structure and connected the hidden layers in the decoder to additional convolution layers to produce coarse outputs in an attempt to match the low-frequency components. Only the result of using 2 measurements is shown in the paper, and the blurring effect is quite significant, as shown in their result. Recently, our team also developed a feedforward DNN structure to tackle the PR problem [29]. It has an MLP front end for feature extraction and a residual attention-based reconstruction unit to generate the phase images. Although it outperforms most of the existing methods, its performance when dealing with more complex images still has room for further improvement.

To improve the quality of the reconstructed images, the second category of deep learning-based PR methods implicitly or explicitly utilizes the underlying physics in the models [19,21,25,26,36,37]. For instance, [36] made use of physics information to perform a learnable spectral initialization [8]. It is followed by a double branch UNet for reconstruction. The approach requires an additional masking scheme to impose constraints on the measurement. The reconstructed images are rather noisy, even for simple images. [23] proposed to use MLP of different sizes in a cascaded network. The intensity measurement is applied to each MLP to assist the training and inferencing. The network size is very large due to the use of multiple MLPs. And the approach fails to reconstruct the details in the images. There are other iterative approaches similar to traditional optimization-based methods. For instance, [26] suggested an iterative method with a 3-step structure: HIO initialization, iterative update between UNet and HIO, and final refinement by UNet. [21] proposed to embed a pre-trained DnCNN [38] into a plug-and-play iterative algorithm for refining the estimated images at each iteration. These iterative physics-driven approaches are usually quite time-consuming. Besides, they cannot be end-to-end trained, which often affects the overall performance.

3. Proposed approach

3.1 Defocus-based Fourier phase retrieval system

A Fourier phase retrieval system reconstructs an image ${\textbf x} \in {\mathrm{\mathbb{C}}^{N \times N}}$ from its Fourier intensity measurement ${\bf {\cal X}} \in {\mathrm{\mathbb{R}}^{M \times M}}.$ However, as mentioned in Section 2, the saturation problem often happens when using standard imaging devices to directly capture Fourier intensity images. It is due to the large dynamic range of Fourier intensity data. Besides, we can also find many zero-valued data in an intensity measurement. Most of them come from the high-frequency parts of the measurement. They have originally very small values but are quantized to zero, resulting in the so-called dead pixels. As a result, a Fourier intensity measurement often contains many errors in both the low-frequency and high-frequency regions.

Researchers have suggested a few solutions to the problem. One of them is the defocusing method [27–29] which is adopted in this study. Specifically, we reduce the dynamic range of the intensity measurement by convolving it with a defocus kernel ${\textbf D},$ described mathematically as follows:

(3)$${X_L}({u^{\prime},v^{\prime}} )= C\mathrm{\int\!\!\!\int }X({u,v} )D\left( {u - \frac{{u^{\prime}}}{{\lambda L}},v - \frac{{v^{\prime}}}{{\lambda L}}} \right)dudv,$$

where ${X_L}$ denotes the optical field on the defocus plane, $\lambda $ denotes the wavelength, L represents the distance between the camera and the Fourier plane, and C is a constant. In practice, the convolution operation can be easily implemented by moving the camera beyond the Fourier plane. Figure 1 shows the optical path of a defocus-based PR system. As shown in the figure, the object of interest is illuminated with a coherent light generated by a laser beam. The camera is placed beyond the focal plane such that a defocused Fourier intensity is captured. On the other hand, the convolution in (3) can also be implemented by an element-wise multiplication of the original image x and a defocus kernel ${\textbf d}$ in the spatial domain (where ${\textbf d}$ is the inverse Fourier transform of ${\textbf D}$). In our simulation and experiment, we directly implemented ${\textbf d}$ together with the testing images on an SLM. More details will be given in Section 5.

Fig. 1. The optical path of the defocus-based PR system.

Download Full Size | PDF

3.2 Overall architecture

Figure 2 illustrates the overall architecture of the proposed PPRNet, which is designed to take a Fourier intensity measurement ${\bf {\cal X}} \in {\mathrm{\mathbb{R}}^{M \times M \times 1}}$ as input and the estimated real and imaginary images ${\textbf x} \in {\mathrm{\mathbb{R}}^{N \times N \times 2}}$ as its output. As mentioned above, the oversampling requirement needs to be fulfilled for unique reconstruction. So, in our experiment, M and N are set as 762 and 128, respectively. For the proposed structure, an initial guess $\hat{{\textbf x}} \in {\mathrm{\mathbb{R}}^{N \times N \times C}}$ is generated through the Init Block, where C is the number of channels ($C$ is set as 64 in our experiment). In the Init Block, a random image is first generated and refined by a HUB, which is a learnable physics-informed network structure (will be described in Section 3.4). The multi-scale structure has a contracting and expanding path, similar to [35]. To extract the essential features, the size of the feature map is gradually reduced by a number of DS Blocks in the contracting path. Each DS Block has three Convolution Blocks (ConvB) and each ConvB contains a convolution layer with $3 \times 3$ filters, plus an Instance Norm layer and a LeakyReLU activation function. The convolutional layers of the 3 ConvBs have strides 2, 1, and 1 such that the dimension of a feature vector is downsampled to half after passing through a DS Block.

Fig. 2. Architecture of the proposed PPRNet. The most significant output feature map after each HUB is shown at the bottom of the figure. The sizes of the feature maps are adjusted to be the same for display.

Download Full Size | PDF

In the expanding path, the US Blocks gradually enlarge the feature maps to the original size using the nearest neighbor interpolation function. Most importantly, the introduction of HUBs at various scales in the expanding path allows the reconstruction to be guided by the underlying physics. Note that we only have HUBs in the expanding path but not the contracting path since the role of HUB is to guide the reconstruction but not the encoding process. Finally, the PP Block, which contains two convolutional layers, reconstructs the real and imaginary images from the obtained feature maps. The proposed multi-scale structure contains skip connections to allow the expanding path to employ the fine-grained features discovered in the contracting path, much like the conventional multiscale structure does. Examples of the feature maps at various scales are shown in Fig. 2. Note that only the most important feature map for each scale is displayed. As the scale increases, it can be observed that the quality of the feature maps gradually improves.

3.3 Initialization

In classic optimization-based algorithms, initialization is an important step as it gives the optimization process a proper starting point. The optimization can avoid becoming stuck in unwanted saddle points with proper initialization [8]. Deep learning models are typically sensitive to the quality of the inputs. When initialization is done incorrectly, more noise is introduced, which lowers the performance of the deep learning model. To enhance the initialization process, we suggest incorporating it into the learnable network structure. First, we randomly generate a spatial image ${\tilde{{\textbf x}}_{init}} \sim \mathrm{{\cal N}}({0,1} )\in {\mathrm{\mathbb{R}}^{N \times N}}$ and feed it into a network that contains a 1 × 1 convolutional layer to convert the data into $C = 64$ channels. Note that the 1 × 1 convolution is commonly used in deep neural networks for feature size conversion. After that, they are delivered to a HUB to produce the first guess for the multi-scale network. The HUB provides a superior initial estimation of the target image than the conventional techniques that just depend on the existing physics knowledge. It is because it is physics-informed and trained end-to-end with the other components of the network. Figure 2 presents a case in point. It can be observed that the Init Block reconstructs the object’s overall shape. It is up to the following multi-scale structure to improve the estimation further.

3.4 Hybrid unwinding block (HUB)

HUB is the core building block of PPRNet. Figure 3 shows its structure. The input feature maps $\tilde{{\textbf x}} \in {\mathrm{\mathbb{R}}^{H \times W \times C}}$ are first divided into two processing branches (where H and W are set as N in our experiments). ${\textbf u} \in {\mathrm{\mathbb{R}}^{H \times W \times 2}}$ are the feature maps’ first two channels and are used by the Physics-driven Unwinding Block (PUB) as input. The Feature Refinement Block (FRB) then processes the remaining channels ${\textbf v} \in {\mathrm{\mathbb{R}}^{H \times W \times ({C - 2} )}}$ to give the local features. The outputs of PUB and FRB, designated as $\tilde{{\textbf u}}$ and $\tilde{{\textbf v}}$, respectively, concatenate with the input feature maps and are sent to the Feature Fusion Block (FFB). FFB makes use of the channel attention method to re-weigh these feature maps to extract the significant features for sending to the next stage.

Fig. 3. Structure of the Hybrid Unwinding Block (HUB).

Download Full Size | PDF

3.4.1 Physics-driven unwinding block (PUB)

Utilizing the intensity measurement as prior knowledge to constrain the optimization is a common strategy of conventional physics-driven PR approaches. However, since the constraint must be repeatedly applied to produce the desired effect, it frequently results in an iterative process. We suggest PUB, which converts the iterative procedure into a feedforward network operation, as a solution to the issue. The structure of PUB is depicted in Fig. 4. It is seen that a PUB consists of K unwinding cascading layers, where $K = 5$ in our experiment. For the $({k + 1} )$ tnwinding layer, ${{\textbf u}_k} \in {\mathrm{\mathbb{R}}^{H \times W \times 2}}$ forms the real and imaginary parts of the input feature map. They are first transformed to the Fourier domain as follows:

(4)$${{\textbf U}_k} = |{{{\textbf U}_k}} |{e^{j{\phi _k}}} = \mathrm{{\cal F}}({{{\textbf u}_k}} ).$$

Then, we update $|{{{\textbf U}_k}} |$ with the intensity measurement in order to constrain the estimation. Note that HUB is utilized at various scales throughout the expanding path. We need to first convert the intensity measurement to the appropriate scale to apply it to the PUB of that scale. The updated ${\textbf U}_k^{\prime}$ is then transformed back to the spatial domain, creating an updated image ${\textbf u}_k^{\prime}$. The entire process can be stated as:

(5)$${\textbf u}_k^{\prime} = {\mathrm{{\cal F}}^{ - 1}}\left( {\sqrt {S({\bf {\cal X}} )} {e^{j{\phi_k}}}} \right),$$

where $S({\bf {\cal X}} )$ refers to the filtering and folding of ${\bf {\cal X}}$ to convert it to the required scale without aliasing. It serves as the magnitude constraint that brings domain-specific prior knowledge to the reconstruction process. It informs the network of the reconstruction target at every scale of the expanding path. ${\textbf u}_k^{\prime}$ is then fed to an eight-layer convolutional CNN structure ${g_k}(. )$ to learn the missing phase information. ${g_k}$ is trained to produce the residue of the required output rather than directly generating the refined image. It lessens the difficulty of the training. The following is an expression for the operation:

(6)$${{\textbf u}_{k + 1}} = {g_k}({{\textbf u}_k^{\prime}} )+ {\beta _k}{{\textbf u}_k},$$

where ${\beta _k}$ is a learnable parameter. The resulting image ${{\textbf u}_{k + 1}}$ then serves as the input for the following PUB layer. The amount of magnitude constrain that is incorporated in ${{\textbf u}_{k + 1}}$ is controlled by the learnable parameters ${\beta _k}$ and those in the CNN structure ${g_k}$. Together with FFB, they give the network the freedom to choose, through end-to-end training, how much physics information should be used in the reconstruction process.

Fig. 4. Structure of the Physics-driven Unwinding Block (PUB).

Download Full Size | PDF

3.4.2 Feature refinement block (FRB) and feature fusion block (FFB)

Both global and local elements can be found in images. The PUB shown above seeks to constrain the reconstruction of the image using the Fourier intensity measurement. However, the Fourier transform can only reveal an image's overall frequency information. The local features in the image cannot be properly informed by the magnitude constraint alone. To improve the local processing power, we suggest a Feature Refinement Block (FRB) in addition to PUB to enrich the detailed structures corresponding to the local features. FRB has a shallow structure that contains 3 ConvBs. They can learn detailed representations such as edges, corners, etc., from the input feature maps. They supplement the global information from PUB to provide comprehensive representations for the reconstruction process. The output feature maps of PUB and FRB are concatenated with the input feature maps to form the resulting feature maps of size $({H,W,2C} ).$ However, these 2C channels of feature maps obviously have different significance to the reconstruction. To adaptively re-weight these channels based on their significance, we use a Feature Fusion Block (FFB), which is effectively a channel attention network [39] as illustrated in Fig. 5. We refer to [39] for the details of the operation of the channel attention network. After that, the weighted feature maps are integrated to form the output of HUB.

Fig. 5. Structure of the Feature Fusion Block (FFB).

Download Full Size | PDF

3.4.3 Visualize the effect of hybrid unwinding block

To gain a deeper insight into the operation of HUB, we show in Fig. 6 the feature maps generated by different components of the HUBs of the first two scales when inferencing an example image. The ground truth and the final estimated image are shown at the top. Let us start with the input feature maps for the HUB at Scale 2. The feature maps contain 256 channels, with the first two fed to PUB and the rest fed to FRB. The outputs of PUB and FRB are shown. For better visualization of the FRB branch, the average feature maps of all channels are presented. It can be seen that PUB tends to give the global structure of the image, and FRB tends to give the details. These feature maps are then combined by FFB using the channel attention method. For FFB, we show the two output channels with the largest weights (denoted as Max) and two with the smallest weights (denoted as Min). The feature maps given by the channels with the largest weights have already contained the basic structure of the ground truth. They are upsampled by the US Block and sent to Scale 1 for further enhancement. At Scale 1, the feature maps given by PUB are already very close to the ground truth. They are further processed to give the final output. To summarize, it can be seen in Fig. 6 that the FFB outputs continuously improve from the lower scale to the upper scale. The physics information applied to the PUBs at different scales has played an important role in this enhancement process. It guides the network to reconstruct the image in the right direction not only at the training stage but also at the inferencing stage.

Fig. 6. Visualization of the feature maps generated by different components of the HUBs of the first two scales. The example image is from the COCO dataset.

Download Full Size | PDF

3.5 Loss function

The phase retrieval challenge is treated as a supervised learning problem in this study. Given a Fourier intensity measurement ${\bf {\cal X}}$, there is a corresponding complex-valued image x that serves as the ground-truth reference. As a result, we can calculate the difference between the estimated image $\hat{{\textbf x}}$ and the ground truth image x. The loss function that we employed consists of two components: the pixel-wise loss ${\mathrm{{\cal L}}_{pixel}}$, which reduces the pixel-wise difference between the estimated and ground-truth images; and the total variation (TV) loss ${\mathrm{{\cal L}}_{TV}},$ which employs a smoothness prior to regularize the estimated image while preserving its edges and textures. The formulation of the total loss function $\mathrm{{\cal L}}$ is:

(7)$$\mathrm{{\cal L}}({\hat{{\textbf x}},{\textbf x}} )= {\mathrm{{\cal L}}_{pixel}}({\hat{{\textbf x}},{\textbf x}} )+ \gamma {\mathrm{{\cal L}}_{TV}}({\hat{{\textbf x}}} ),$$

where $\gamma \in \mathrm{\mathbb{R}}$ denotes the coefficient to balance different loss terms. ${\mathrm{{\cal L}}_{pixel}}$ is defined as the ${l_1}$-norm distance between the estimation and ground-truth images, defined as follows:

(8)$${\mathrm{{\cal L}}_{pixel}}({\hat{{\textbf x}},{\textbf x}} )= \frac{1}{{2{N^2}}}\mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N \mathop \sum \limits_{k = 1}^2 ({|{x_{i,j,k}^{Re} - \hat{x}_{i,j,k}^{Re}} |+ |{x_{i,j,k}^{Im} - \hat{x}_{i,j,k}^{Im}} |} ),$$

where ${(. )^{Re}}$ and ${(. )^{Im}}$ represent the real and imaginary parts of the image, respectively. While preserving the image details, ${\mathrm{{\cal L}}_{pixel}}$ promotes the fidelity of the estimated images.

The TV norm sums all local horizontal and vertical gradients. By promoting spatial smoothness in the estimated image, TV regularization can help reduce noise and improve the consistency of the reconstructed image. ${\mathrm{{\cal L}}_{TV}}$ is defined as:

(9)$${\mathrm{{\cal L}}_{TV}}({\hat{{\textbf x}}} )= \frac{1}{{2{N^2}}}\mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N \mathop \sum \limits_{k = 1}^2 ({{{({{{\hat{x}}_{i,j,k}} - {{\hat{x}}_{i + 1,j,k}}} )}^2} + {{({{{\hat{x}}_{i,j,k}} - {{\hat{x}}_{i,j + 1,k}}} )}^2}} ),$$

4. Simulation

4.1 Datasets

We performed extensive simulations to test the efficacy of the proposed PPRNet. To prepare the training and testing samples, we first gathered images from two publicly accessible datasets: the Real-world Affective Faces (RAF) dataset [41] and the Fashion-MNIST dataset [42]. The images were resized to $128 \times 128$ pixels and made grayscale. Then, we created two complex-valued image datasets (i.e., each target image has both the magnitude and phase parts) using two different combinations of these images. The Fashion-MNIST dataset's images were used to create the first dataset, of which the images have linearly correlated magnitude and phase components. Each image ${{\boldsymbol x}_{raw}}$ that was taken from the dataset was scaled to [0, 1] and used as the final image's magnitude component ${{\boldsymbol x}_{mag}}$. We then used an exponential function to obtain the phase component, i.e., ${{\boldsymbol x}_{phase}} = exp({2\pi i{{\boldsymbol x}_{raw}}} )$. Finally, we combined the magnitude and phase components by ${\boldsymbol x} = {{\boldsymbol x}_{mag}} \circ {{\boldsymbol x}_{phase}}$, where ◦ denotes the element-wise multiplication. To build our training and test set, we used the first 25,000 and 1,000 images from the Fashion-MNIST's training and testing datasets, respectively. The second dataset we constructed contains images with uncorrelated magnitude and phase parts. We utilized the same method as before to rescale the data, but to produce the magnitude and phase components using images from two distinct datasets, RAF and Fashion-MNIST. The training set contained 12,771 images, whereas the testing set contained 1,000 images.

4.2 Defocused intensity measurements

To simulate the defocused intensity measurements as discussed in Section 3.1, we multiplied all testing images with a defocus kernel, which was generated by a built-in function of Holoeye SLM control software corresponding to the defocusing distance of 30 mm. The same defocus kernel was used in the experiments on the optical platform, as will be discussed later. Note that other defocusing distance is also possible. Our experiment shows that a similar performance can be obtained with the defocusing distance set from 15 mm to 50 mm. Then, 2-D FFT was performed on these images, and the magnitudes of the resulting images were extracted to become the intensity measurements. The resolution of the measurements is 762 × 762 pixels. To simulate the saturation and quantization errors, we capped the measurements with the maximum limit of 4,095 (12-bit) and quantized the numbers to the integer format.

4.3 Training details and evaluation metrics

We used the Adam optimizer [43] to train our network. The batch size was set to 24, and the learning rate was set to ${10^{ - 4}}$. On a computer with two NVIDIA RTX3090 GPUs and the PyTorch deep learning platform, the network was trained for 160 epochs. The weighting factor $\gamma $ in the loss function (7) was empirically fixed to 0.1.

We evaluate the quality of the reconstructed images using the Peak Signal-to-Noise Ratio (PSNR, higher the better), Structural Similarity (SSIM, higher the better), and Mean Absolute Error (MAE, lower the better). They are commonly used in the field of image restoration to gauge the accuracy of estimates. To make sure that no negative values are used in the PSNR calculation, the phase parts of all images are shifted by π. For comparison, the average PSNR, SSIM, and MAE for the 1,000 testing images of both datasets are used.

4.4 Compared methods

Since the application domains of the optimization-based and learning-based approaches can be quite different, we only compare the performance of PPRNet with SOTA learning-based PR methods, namely PRCGAN [22], HIO-UNet [26], LenlessNet [17], NNPhase [24], and MCNN [33]. Note that some of these approaches are designed under much trivial input and output requirements as compared with those in this paper. For instance, NNPhase and PRCGAN assume the input intensity measurement is very small (only $64 \times 64$ and $28 \times 28$ pixels, respectively, as compared to $762 \times 762$ in the simulation environment). And PRCGAN assumes the target image only has real-valued data (as compared to having both real and imaginary data in the simulation environment). To allow these networks to adapt to the simulation environment, a small modification to these networks is required. Specifically, we modified the PRCGAN and NNPhase networks such that they could accept input of 128 × 128 pixels. The input data's dimension was then changed from $762 \times 762$ to $128 \times 128$ pixels by placing a pre-processing block in front of the original network. It has two convolutional layers with a 5 × 5 kernel and strides 3 and 4, respectively. Additionally, we modified HIO-UNet and LenlessNet's output to let them have two channels for real-imaginary representation, and PRCGAN's output to have two channels for magnitude-phase representation. These adjustments enabled these networks to function in the simulation environment.

4.5 Simulation results

Although the defocused intensity measurement is not the exact intensity measurement of the image, PR methods using the defocused intensity measurements usually perform better than using the original measurements with saturated and dead pixels. Some examples are given in Fig. 7. In the figure, we first show the performance of HIO [2] and prDeep [21], which represent the optimization-based and deep learning-based PR methods. They were implemented following the setting in their original papers (i.e., without saturation and dead pixels). The results are similar to those reported in their original papers, as shown in the second column of Fig. 7 and Table 1. Then, we introduced the saturation problem and dead pixels to the intensity measurements by capping the image data and quantizing them as 12-bit integers. It can be seen that the performances of these methods drop substantially when the imperfections in the measurements are taken into account. Finally, we used the defocusing method as mentioned above to mitigate the problem. As shown in the fourth column of Fig. 7 and Table 1, both approaches can significantly benefit from using the defocusing method. An improvement of up to 15 dB can be achieved. For this reason, we followed [27–29] so that all comparing methods were trained and tested with the defocused intensity measurements.

Fig. 7. Qualitative evaluation results of HIO [2] and prDeep [21] with and without using the defocused kernel.

Download Full Size | PDF

Table 1. Quantitative evaluation results of HIO [2] and prDeep [21] with and without using the defocused kernel. The results are the averages of 6 natural images used in [21].

View Table | View all tables in this article

The qualitative and quantitative comparison results are shown in Fig. 8 and Table 2. To save space, we do not include the qualitative results of LenslessNet in the comparison due to its subpar performance. Besides, we also include the result of HIO [2] just as a reference. As shown in Fig. 8(b), the proposed PPRNet outperforms all compared learning-based approaches for both datasets. Specifically, only PPRNet can reconstruct the details in the images, such as the “Lee” characters and the plaid on the shirt. The images given by other methods are of poor quality. On the other hand, as shown in Fig. 8(a), all competing algorithms struggle with the dataset containing uncorrelated magnitude and phase components. It is particularly the case for PRCGAN and HIO-UNet since they were originally intended to only output real-valued images. As seen in Fig. 8(a), there is a clear distortion in their magnitude images. The output images for techniques like NNPhase and MCNN are somewhat hazy and different from the intended images. These approaches also cannot reconstruct the details in the phase images. When compared to the aforementioned methods, the proposed PPRNet performs the best. The reconstructed magnitude and phase images closely follow the target images, especially the phase images. Occasionally, we can spot some artifacts in the magnitude images, but they are not serious.

Fig. 8. Qualitative simulation results of different phase retrieval methods on two datasets with images having (a) uncorrelated and (b) linearly correlated magnitude and phase components. For each image, the Fourier intensity measurements with pixel values ranging from 0 to 4,095 are shown in the first column. The ground truth of each target image is shown in the second column, with the magnitude part shown on top and the phase part shown right below it. The colormaps of the images can also be found. The third to eighth columns show the reconstructed magnitude and phase images through different methods. Zoom in for a better view.

Download Full Size | PDF

Table 2. Quantitative simulation results (average PSNR/SSIM/MAE) of different phase retrieval methods on two datasets.

View Table | View all tables in this article

In Table 2, the quantitative comparison is displayed. The proposed PPRNet outperforms all compared learning-based approaches measured using various metrics. Specifically, the proposed PPRNet achieves average PSNR and SSIM improvements of at least 5.818 dB and 0.146, respectively, for the magnitude portion. The increases in PSNR and SSIM gains for the phase component can be as high as 9.192 dB and 0.106, respectively. The aforementioned simulation results demonstrate that the proposed PPRNet outperforms SOTA learning-based PR techniques. We will show in Section 5 that the same conclusion can be drawn when testing these methods on an optical platform.

An ablation analysis of various design parameters of PPRNet is given in the supplementary document.

5. Experiment on optical platform

To understand the performance of the proposed PPRNet when applied to practical applications, we constructed an optical platform for collecting realistic intensity measurements for training and testing the network. The platform largely follows the optical path in Fig. 1. A snapshot of it is shown in Fig. 9. It comprises a Thorlabs 10 mW HeNe laser with wavelength $\lambda $ = 632.8 nm and a 12-bit 1920 × 1200 Kiralux CMOS camera with a pixel pitch 5.86 µm. We utilized a 1920 × 1080 Holoeye Pluto phase-only SLM with a pixel pitch ${\delta _{SLM}}$= 8 µm to generate the target images. The SLM can impose different phase shifts to the coherent light that gets through it. It effectively synthesizes the objects in phase imaging applications [12,17]. Since the ground truth is known, we can easily evaluate the accuracy of the reconstructed phase images given by different approaches. Another advantage of using the SLM is, similar to that in the simulation, we can directly multiply the defocus kernel with the images to generate the defocusing effect.

Fig. 9. Optical setup largely follows the optical path in Fig. 1.

Download Full Size | PDF

For the training and testing datasets, we used the same RAF and Fashion-MNIST datasets as in the simulation. Besides, we added the COCO dataset [43] (used the first 50,000 and 2,000 images of the training and testing dataset, respectively), which contains more complex images for challenging the PR approaches. We used the same approach as in the simulation to generate the phase part of the images based on the images collected from these datasets. The magnitude part was assumed as constant. Then these phase images were multiplied with the defocus kernel and loaded to the SLM. Following the optical path, the images captured by the camera were the defocused Fourier intensity measurements of these phase images. They were used for the training and testing of different PR approaches. The resolutions of these intensity measurements and phase images were the same as in the simulation, i.e., $762 \times 762$ and $128 \times 128$ pixels, respectively. When training the models, we first initialized the models with the ones we used in the simulation, as discussed in Section 4. Then we fine-tuned all the pre-trained models by re-training them for 120 epochs with the experimental measurements as the new training inputs.

Similar to the simulation, we compare the proposed PPRNet with the same set of learning-based PR methods. These methods were modified as in the simulation to suit the experiment environment. We evaluate the performance of all competing methods using the same metrics as in the simulation, i.e., PSNR, SSIM, and MAE. The quantitative and qualitative evaluation results are shown in Table 3 and Fig. 10, respectively. To save space, we do not include the qualitative results of ResNet, UNet, LearnInitNet, and CPR-FS in Fig. 10 due to their inferior performances. On the other hand, we include the result of HIO [2] just as a reference. Compared with SOTA deep learning-based PR methods, the proposed PPRNet can achieve PSNR and SSIM gains of at least 7.45 dB and 0.08, respectively, on the Fashion-MNIST dataset. It also performs the best on the RAF and COCO datasets. It can be seen in Fig. 10 that the reconstructed images of PPRNet can provide more details than other methods with fewer artifacts. It is particularly the case for the COCO dataset. Since the images are more complex, other approaches even cannot reconstruct the contours of the objects in the images. PPRNet can recover not only the contours but provide more textures of the original images. It is benefited from the physics information of the underlying image that guides the reconstruction process. Compared with other physics-driven methods, such as prDeep [21] and HIO-UNet [26], PPRNet not only provides better-quality images but also has a lower complexity, measured through the required GMAC operations. GMAC stands for Giga Multiply-Accumulation, which is a commonly used evaluation metric of the complexity of deep learning models. Specifically, the proposed PPRNet requires to perform 35.27 GMACs, while [21] and [26] require 1828.01 and 142.87 GMACs, respectively. The proposed PPRNet obviously has a much lower complexity. It is expected since PPRNet has a feedforward architecture that does not require iterative estimation. Besides, PPRNet is end-to-end trained. It is different from [21] and [26] which have an optimizer operated outside the network model. Thus, it does not have the potential mismatch problem between the optimizer and the network. Besides, prDeep is only successful for real-valued images, as claimed by the authors [21]. The result is unsatisfactory when it is used to retrieve phase images. The above experimental results have verified the performance of the proposed PPRNet when used in a practical phase retrieval system. It consistently outperforms SOTA deep learning-based PR methods both quantitatively and qualitatively. The proposed PPRNet also has a modest network size. In terms of the number of parameters, the proposed PPRNet is higher than only 3 out of 11 approaches in the list in Table 3.

Fig. 10. Experimental results of different phase retrieval methods on (a) RAF dataset [40], (b) Fashion-MNIST dataset [41], and (c) COCO dataset [43]. The first column and the second column show the Fourier intensity measurements (pixel values: 0–4,095) and the corresponding phase parts of ground-truth images, respectively. The values of the Fourier measurements are scaled for better visualization. The other columns show the reconstruction images through different methods. Except for the Fourier intensity measurements (the first column), the colormap of the rest columns ranges from 0 to 2π, with color bars shown in the second column.

Download Full Size | PDF

Table 3. Quantitative comparison (average PSNR/SSIM/MAE/Parameters/Complexity) with SOTA learning-based methods for phase retrieval on three datasets. The training and testing samples are collected with the optical system shown in Fig. 9. Best performances are marked in bold. The second and third-best performances are colored in blue and green, respectively.

View Table | View all tables in this article

6. Conclusion

In this paper, we proposed a deep learning-based PR method, namely PPRNet. Similar to other deep learning-based PR approaches, PPRNet requires only a single Fourier intensity measurement as its input and does not need an extra masking scheme for the PR system to constrain the measurement data. The main novelty of PPRNet is the introduction of physics information to the PR process. Unlike the traditional physics-driven PR methods that often end up in a time-consuming iterative procedure, the proposed PPRNet has a non-iterative feedforward structure but can still effectively utilize the intensity measurement to guide the image reconstruction process. It is enabled by the novel HUB embedded in the network’s input and expanding path. It separately processes the global and local information of the feature maps with the aid of the intensity measurement and combines them with a channel attention method. Our simulation and experiment results have verified the effectiveness of PPRNet. In particular, our experiment results were obtained from an optical platform. They demonstrate the performance of PPRNet when applied to practical phase retrieval systems. From the simulation and experiment results, it can be concluded that the proposed PPRNet consistently outperforms SOTA deep learning-based PR methods, proving it a promising solution to practical PR applications. Nevertheless, at the moment, there is still room for PPRNet to improve further when handling images with complex scenes. It is one of the ongoing works in our group.

Funding

Hong Kong Research Grant Council (PolyU 15225321).

Disclosures

The authors declare no conflicts of interest.

Data availability

The datasets used in this paper are available in [40,41]. Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

CRediT authorship contribution statement

Qiuliang Ye: Conceptualization, Methodology, Simulation, Experiment, Writing - original draft, Writing – review and editing. Li-Wen Wang: Conceptualization, Methodology, Writing - original draft. Daniel P.K. Lun: Supervision, Project administration, Funding acquisition, Conceptualization, Methodology, Writing - review & editing.

Supplemental document

See Supplement 1 for supporting content.

References

1. R. W. Gerchberg, “A practical algorithm for the determination of phase from image and diffraction plane pictures,” Optik 35, 237–246 (1972).

2. J. R. Fienup, “Phase retrieval algorithms: a comparison,” Appl. Opt. 21(15), 2758–2769 (1982). [CrossRef]

3. J. Zhong, L. Tian, P. Varma, and L. Waller, “Nonlinear optimization algorithm for partially coherent phase retrieval and source recovery,” IEEE Trans. Comput. Imaging 2(3), 310–322 (2016). [CrossRef]

4. M. K. Sharma, C. A. Metzler, S. Nagesh, R. G. Baraniuk, O. Cossairt, and A. Veeraraghavan, “Inverse scattering via transmission matrices: Broadband illumination and fast phase retrieval algorithms,” IEEE Trans. Comput. Imaging 6, 95–108 (2020). [CrossRef]

5. J. M. Rodenburg, “Ptychography and related diffractive imaging methods,” Adv. Imaging Electron Phys. 150, 87–184 (2008). [CrossRef]

6. P. Chakravarthula, E. Tseng, T. Srivastava, H. Fuchs, and F. Heide, “Learned hardware-in-the-loop phase retrieval for holographic near-eye displays,” ACM Trans. Graph. 39(6), 1–18 (2020). [CrossRef]

7. D. R. Luke, “Relaxed averaged alternating reflections for diffraction imaging,” Inverse Problems 21(1), 37–50 (2005). [CrossRef]

8. E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015). [CrossRef]

9. E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval from coded diffraction patterns,” Applied and Computational Harmonic Analysis 39(2), 277–299 (2015). [CrossRef]

10. O. Katz, P. Heidmann, M. Fink, and S. Gigan, “Non-invasive singleshot imaging through scattering layers and around corners via speckle correlations,” Nat. Photonics 8(10), 784–790 (2014). [CrossRef]

11. Y. Shechtman, Y. C. Eldar, O. Cohen, H. Chapman, J. Miao, and M. Segev, “Phase retrieval with application to optical imaging: A contemporary overview,” IEEE Signal Process. Mag. 32(3), 87–109 (2015). [CrossRef]

12. Q. Ye, Y.-H. Chan, M. G. Somekh, and D. P. Lun, “Robust phase retrieval with green noise binary masks,” Optics and Lasers in Engineering 149, 106808 (2022). [CrossRef]

13. R. Horisaki, Y. Ogura, M. Aino, and J. Tanida, “Single-shot phase imaging with a coded aperture,” Opt. Lett. 39(22), 6466 (2014). [CrossRef]

14. C. Zheng, R. Zhou, C. Kuang, G. Zhao, Z. Yaqoob, and P. So, “Digital micromirror device-based common-path quantitative phase imaging,” Opt. Lett. 42(7), 1448–1451 (2017). [CrossRef]

15. H. Chang, Y. Lou, Y. Duan, and S. Marchesini, “Total variation–based phase retrieval for poisson noise removal,” SIAM J. Imaging Sci. 11(1), 24–55 (2018). [CrossRef]

16. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

17. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

18. J. Shi, X. Zhu, H. Wang, L. Song, and Q. Guo, “Label enhanced and patch based deep learning for phase retrieval from single frame fringe pattern in fringe projection 3d measurement,” Opt. Express 27(20), 28929–28943 (2019). [CrossRef]

19. Y. Zhang, M. A. Noack, P. Vagovic, K. Fezzaa, F. GarciaMoreno, T. Ritschel, and P. Villanueva-Perez, “Phasegan: a deeplearning phase-retrieval approach for unpaired datasets,” Opt. Express 29(13), 19593–19604 (2021). [CrossRef]

20. M. R. Kellman, E. Bostan, N. A. Repina, and L. Waller, “Physicsbased learned design: Optimized coded-illumination for quantitative phase imaging,” IEEE Trans. Comput. Imaging 5(3), 344–353 (2019). [CrossRef]

21. C. Metzler, P. Schniter, A. Veeraraghavan, and Richard Baraniuk, “prDeep: Robust phase retrieval with a flexible deep network, in: J. Dy and A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 3501–3510.

22. T. Uelwer, A. Oberstraß, and S. Harmeling, “Phase retrieval using conditional generative adversarial networks,” in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 731–738.

23. T. Uelwer, T. Hoffmann, and S. Harmeling, “Non-iterative phase retrieval with cascaded neural networks,” in: I. Farkaš, P. Masulli, S. Otte, and S. Wermter (Eds.), Artificial Neural Networks and Machine Learning – ICANN 2021, Springer, Cham, 2021, pp. 295–306.

24. L. Wu, P. Juhas, S. Yoo, and I. Robinson, “Complex imaging of phase domains by deep neural networks,” IUCrJ 8(1), 12–21 (2021). [CrossRef]

25. E. Cha, C. Lee, M. Jang, and J. C. Ye, “Deepphasecut: Deep relaxation in phase for unsupervised fourier phase retrieval,” IEEE Trans. Pattern Anal. Mach. Intell. 44, 9931–9943 (2022). [CrossRef]

26. Ç. Işıl, F. S. Oktem, and A. Koç, “Deep iterative reconstruction for phase retrieval,” Appl. Opt. 58(20), 5422 (2019). [CrossRef]

27. T.W.K. Chow, D.P.K. Lun, S. Pechprasarn, and M.G. Somekh, “Defocus leakage radiation microscopy for single shot surface plasmon measurement,” Meas. Sci. Technol. 31(7), 075401 (2020). [CrossRef]

28. H. Plank, C. Gspan, M. Dienstleder, G. Kothleitner, and F. Hofer, “The influence of beam defocus on volume growth rates for electron beam induced platinum deposition,” Nanotechnology 19(48), 485302 (2008). [CrossRef]

29. Q. Ye, L.-W. Wang, and D. P. K. Lun, “SiSPRNet: end-to-end learning for single-shot phase retrieval,” Opt. Express 30(18), 31937–31958 (2022). [CrossRef]

30. M. Hayes, “The reconstruction of a multidimensional sequence from the phase or magnitude of its fourier transform,” IEEE Trans. Acoust., Speech, Signal Process. 30(2), 140–154 (1982). [CrossRef]

31. L. Wu, S. Yoo, A. F. Suzana, T. A. Assefa, J. Diao, R. J. Harder, W. Cha, and I. K. Robinson, “Three-dimensional coherent x-ray diffraction imaging via deep convolutional neural networks,” npj Comput Mater 7(1), 175 (2021). [CrossRef]

32. Y. Nishizaki, R. Horisaki, K. Kitaguchi, M. Saito, and J. Tanida, “Analysis of non-iterative phase retrieval based on machine learning,” Opt. Rev. 27(1), 136–141 (2020). [CrossRef]

33. F. Wang, A. Eljarrat, J. Müller, T. R. Henninen, R. Erni, and C. T. Koch, “Multi-resolution convolutional neural networks for inverse problems,” Sci. Rep. 10(1), 1 (2020). [CrossRef]

34. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [CrossRef]

35. O. Ronneberger and P. Fischer, “T. Brox, U-net: Convolutional networks for biomedical image segmentation,” in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.

36. D. Morales, A. Jerez, and H. Arguello, “Learning spectral initialization for phase retrieval via deep neural networks,” Appl. Opt. 61(9), F25–F33 (2022). [CrossRef]

37. C.-J. Wang, C.-K. Wen, S.-H. Tsai, S. Jin, and G. Y. Li, “Phase retrieval using expectation consistent signal recovery algorithm based on hypernetwork,” IEEE Trans. Signal Process. 69, 5770–5783 (2021). [CrossRef]

38. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. on Image Process. 26(7), 3142–3155 (2017). [CrossRef]

39. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [CrossRef]

40. S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep localitypreserving learning for expression recognition in the wild,” in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 2584–2593.

41. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv, arXiv:cs.LG/1708.07747 (2017). [CrossRef]

42. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv:1412.6980 (2017). [CrossRef]

43. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in: D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Computer Vision – ECCV 2014, Springer, Cham, 2014, pp. 740–755.

Methods	Noisy (original paper)		Non-Defocused		Defocused
Methods	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
HIO [2]	20.61	0.40	−5.65	0.03	11.31	0.34
prDeep [21]	27.43	0.68	3.37	0.14	12.08	0.21

Methods	Uncorrelated						Linear Correlated
	PSNR $↑$	PSNR $↑$	SSIM $↑$	SSIM $↑$	MAE $↓$	MAE $↓$	PSNR $↑$	PSNR $↑$	SSIM $↑$	SSIM $↑$	MAE $↓$	MAE $↓$
	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase
Traditional Method
HIO [1]	-20.44	7.73	0.0002	0.075	9.329	1.760	7.709	8.317	0.329	0.043	0.298	1.898
Learning-based Methods
LenlessNet [17]	17.394	20.365	0.412	0.748	0.106	0.359	17.216	17.807	0.624	0.432	0.079	0.594
PRCGAN [22]	15.525	20.163	0.342	0.725	0.130	0.343	21.452	23.502	0.719	0.738	0.047	0.256
NNPhase [24]	17.191	18.876	0.460	0.613	0.108	0.458	21.697	24.193	0.690	0.794	0.049	0.229
MCNN [31]	17.880	18.641	0.483	0.692	0.098	0.424	22.454	24.986	0.774	0.819	0.042	0.201
HIO-UNet [26]	15.597	14.285	0.294	0.538	0.132	0.612	22.464	12.830	0.541	0.410	0.054	0.853
PPRNet (Ours)	23.698	31.698	0.694	0.933	0.053	0.113	32.991	34.178	0.920	0.925	0.017	0.101

Methods	Fashion-MNIST			RAF			COCO			Para. (× {10^6})	Complexity (GMAC)
Methods	PSNR $↑$	SSIM $↑$	MAE $↓$	PSNR $↑$	SSIM $↑$	MAE $↓$	PSNR $↑$	SSIM $↑$	MAE $↓$	Para. (× {10^6})	Complexity (GMAC)
Traditional Method
HIO [2]	9.082	0.052	1.709	12.553	0.127	1.185	10.820	0.069	1.460	NA	NA
Learning-based Methods
ResNet [32]	26.167	0.839	0.171	20.322	0.605	0.461	14.998	0.292	0.872	4.72	77.47
UNet [35]	20.201	0.736	0.343	17.309	0.527	0.688	13.894	0.267	1.062	21.43	27.09
LenslessNet [17]	27.152	0.865	0.160	20.979	0.645	0.433	14.810	0.295	0.898	32.74	4.01
SiSPRNet [29]	28.682	0.877	0.141	20.347	0.667	0.458	14.443	0.277	0.980	19.51	1.78
MCNN [33]	27.594	0.862	0.155	20.611	0.619	0.457	15.170	0.283	0.887	28.55	52.40
PRCGAN [22]	26.619	0.846	0.167	20.150	0.603	0.474	13.971	0.267	0.991	111.38	3.45
NNPhase [24]	24.637	0.807	0.222	19.315	0.567	0.539	14.598	0.272	0.978	18.85	6.49
LearnInitNet [36]	23.771	0.790	0.241	19.024	0.562	0.560	14.379	0.271	0.996	7.43	1.27
CPR-FS [23]	18.965	0.770	0.536	16.426	0.528	0.767	13.148	0.261	1.204	895.32	0.12
prDeep [21]	9.422	0.057	1.478	12.993	0.130	1.053	12.274	0.092	1.115	0.56	1828.01
HIO-UNet [26]	30.033	0.896	0.132	22.527	0.701	0.366	16.314	0.346	0.748	62.16	142.87
PPRNet (Ours)	37.483	0.975	0.052	26.473	0.845	0.252	17.528	0.457	0.671	19.78	35.27

Methods	Noisy (original paper)		Non-Defocused		Defocused
Methods	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
HIO [2]	20.61	0.40	−5.65	0.03	11.31	0.34
prDeep [21]	27.43	0.68	3.37	0.14	12.08	0.21

Methods	Uncorrelated						Linear Correlated
	PSNR $↑$	PSNR $↑$	SSIM $↑$	SSIM $↑$	MAE $↓$	MAE $↓$	PSNR $↑$	PSNR $↑$	SSIM $↑$	SSIM $↑$	MAE $↓$	MAE $↓$
	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase	Mag.	Phase
Traditional Method
HIO [1]	-20.44	7.73	0.0002	0.075	9.329	1.760	7.709	8.317	0.329	0.043	0.298	1.898
Learning-based Methods
LenlessNet [17]	17.394	20.365	0.412	0.748	0.106	0.359	17.216	17.807	0.624	0.432	0.079	0.594
PRCGAN [22]	15.525	20.163	0.342	0.725	0.130	0.343	21.452	23.502	0.719	0.738	0.047	0.256
NNPhase [24]	17.191	18.876	0.460	0.613	0.108	0.458	21.697	24.193	0.690	0.794	0.049	0.229
MCNN [31]	17.880	18.641	0.483	0.692	0.098	0.424	22.454	24.986	0.774	0.819	0.042	0.201
HIO-UNet [26]	15.597	14.285	0.294	0.538	0.132	0.612	22.464	12.830	0.541	0.410	0.054	0.853
PPRNet (Ours)	23.698	31.698	0.694	0.933	0.053	0.113	32.991	34.178	0.920	0.925	0.017	0.101

Towards practical single-shot phase retrieval with physics-driven deep neural network

Abstract

1. Introduction

2. Related works

2.1 Optimization-based PR algorithms

2.2 Deep Learning-based PR methods

3. Proposed approach

3.1 Defocus-based Fourier phase retrieval system

3.2 Overall architecture

3.3 Initialization

3.4 Hybrid unwinding block (HUB)

3.4.1 Physics-driven unwinding block (PUB)

3.4.2 Feature refinement block (FRB) and feature fusion block (FFB)

3.4.3 Visualize the effect of hybrid unwinding block

3.5 Loss function

4. Simulation

4.1 Datasets

4.2 Defocused intensity measurements

4.3 Training details and evaluation metrics

4.4 Compared methods

4.5 Simulation results

5. Experiment on optical platform

6. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (10)

Tables (3)

Equations (9)

Optics Express