Self-supervised speckle noise reduction of optical coherence tomography without clean data

Yangxi Li; Yingwei Fan; Hongen Liao

doi:10.1364/BOE.471497

1. Introduction

For biomedical imaging, optical coherence tomography (OCT) is widely utilized. It was first developed as a non-invasive, high-resolution imaging method for the retina in 1991 [1], and has since demonstrated significant promise for the accurate localization of lesion regions in vascular lumens [2], colorectal tissue [3], brain tissue [4], and other tissues. However, the OCT images suffer from inherent speckle noise due to the low coherence interferometry, which can degrade the image quality, and obscure or destroy fine structures in OCT, thus affecting the diagnosis results, restricting the clinical applications and the development of intelligent theranostics based on OCT [5]. Improving imaging system hardware can reduce speckle noise in theory, such as utilizing a ground glass diffuser [6], or an optical chopper [7] to obtain several uncorrelated speckle patterns and averaging them. These approaches rely on specially designed hardware systems and acquisition software and are hard to be widely used or commercialized. Researchers have also attempted to reduce speckle noise of OCT by using or improving conventional denoising methods, such as block-matching and 3-dimensional (BM3D) filter [8] and non-local means (NLM) filter [9], which are based on the similarity between image patches, have been tried for despeckling OCT images but with limited effectiveness. A Hough-transform-based fixed-pattern noise reduction method has been proposed to remove the high-intensity reflection noises [10]. Sparsity-based methods like MSBTD [11], SBSDI [12], and NWSR [13] have also achieve good despeckling performance, but structural changes or newly appeared artifacts could be observed in processed images.

Denoising performance is significantly improved by deep learning (DL) techniques using convolutional neural networks. DnCNN impressed us with its blind Gaussian denoising performance achieved through deep residual learning and batch normalization [14]. Researchers have also developed generative adversarial network (GAN)-based models for speckle noise reduction since the speckle noise on OCT is more complex than additive noise, such as edge-sensitive cGAN [15], Caps-cGAN [16], SiameseGAN [17], OCT-GAN [18], DRGAN [19,20] and nonlocal-GAN [21]. These methods usually have good speckle noise suppression performance, but their robustness remains to be verified. For example, blurs or artifacts may occur in the processed image. Another significant issue is that it is challenging to meet the requirement for paired noisy-clean images during supervised training because ground truth image is generated by registering and averaging dozens of repeated scans, which takes longer and requires the least amount of movement or deformation of the scanned tissue. Even though the DRGAN [19,20] does not require paired data, it still cannot get rid of the dependence on unpaired clean images. In nonlocal-GAN presented in [21], traditional similar patch search is also needed to complete speckle reduction after network processing, which is not end-to-end despeckling in the complete sense. Recently, some training strategies that only require noisy images have been proposed for natural image denoising. Noise2Noise only requires two noisy observations of the same scene for training [22]. Researchers have conducted a comparative study on the feasibility of Noise2Noise on OCT images [23] and explored the feasibility of using this training strategy to construct a real-time noise reduction model [24]. Nevertheless, Noise2Noise still relies on paired data, even if only noisy observations are needed, and thus we must acquire at least two OCT images at the same tissue location. With only single noisy images, Noise2Void [25] and Noise2Self [26] use the blind-spot strategy to perform self-supervised learning for noise reduction of natural images. Self2Self [27] aims to train a denoiser on a single image by Bernoulli sampling and dropout. Neighbor2Neighbor [28], on the other hand, obtains image pairs for training through adjacent sub-sampling. These methods greatly inspired us to develop a self-supervision-based despeckling method for OCT without any clean data or paired noisy data.

In this paper, we have proposed a self-supervised training strategy requiring only single speckled OCT images, attempting to make the network learn a Mapping between Adjacent Pixels and eventually work for Speckle Noise Reduction (MAP-SNR). We first prove the feasibility of training a despeckling model using single speckled OCT images by a mathematical model and then present the multi-scale pixel patch sampler utilized for image pair generation and further training. We have carried out quantitative and qualitative experiments for the comprehensive evaluation of MAP-SNR. To the best of our knowledge, MAP-SNR is the only speckle reduction method that can be trained with only single noisy OCT images.

2. Methods

2.1 Fundamental theory

We use $\boldsymbol y$ and $\boldsymbol x$ to denote the OCT image with speckle noise and the corresponding speckle-free OCT image, respectively. According to the previous studies [20,29], the speckle noise model can be formulated as:

(1)$$\boldsymbol{y}=\boldsymbol{x}\odot \boldsymbol{n},$$

where ${\odot }$ denotes pixel-wise multiplication and $\boldsymbol {n}$ represents the spatially-independent random noise with unit mean. In supervised training with mean square error (MSE or $\mathcal {L}_2$) loss, we need a ground truth image $\boldsymbol {x}$ and try to minimize ${\mathbb{E}_{\boldsymbol { x},\boldsymbol {y}}=\Vert f_\theta (\boldsymbol {y})-\boldsymbol {x} \Vert _2^2,}$ where ${f_\theta }$ is the neural network parameterized by $\theta$ and $\mathbb{E}$ denotes the expectation operator. As for Noise2Noise, another noisy observation $\boldsymbol {z}$ of the speckle-free image $\boldsymbol {x}$ is needed:

(2)$$\boldsymbol{z}=\boldsymbol{x} \odot \boldsymbol{n}^{'},$$

and the network targets to minimize $\mathbb{E}_{\boldsymbol {x},\boldsymbol {y},\boldsymbol {z}} \Vert f_ \theta (\boldsymbol {y})-\boldsymbol {z}\Vert _2^2$, which has been well proved in [22] and adopted in [23]. Our approach is to sample a pair of images from one noisy OCT image $\boldsymbol {y}$, and use them as the input and target of a neural network since it is difficult to acquire two noisy OCT images with the same pristine image in clinical settings.

Let $(u_{1},u_{2})$ denote a pair of sub-samplers. To keep the trained network from approaching an identity transformation and to ensure that the two sub-sampled images are independent, we choose to sample the adjacent pixels patches from $\boldsymbol {y}$ into two images separately. Utilizing $\boldsymbol {y}_1$ and $\boldsymbol {y}_2$ as simple indicators of $u_{1}\left (\boldsymbol {y}\right )$ and $u_{2}\left (\boldsymbol {y}\right )$, and take note that $\boldsymbol {n}$ is spatially-independent random noise with a unit mean, which results in the formation of

(3)$$\mathbb{E}(\boldsymbol y_1)=\mathbb{E}\left(u_{1}\left(\boldsymbol x \odot \boldsymbol n\right)\right) =\mathbb{E}\left(u_{1}\left(\boldsymbol x \right)\right)$$

(4)$$\mathbb{E}(\boldsymbol y_2)=\mathbb{E}\left(u_{2}\left(\boldsymbol x \odot \boldsymbol n\right)\right) =\mathbb{E}\left(u_{2}\left(\boldsymbol x \right)\right)$$

We will also use $\boldsymbol {x}_1$ and $\boldsymbol {x}_2$ to denote $u_{1}\left (\boldsymbol x \right )$ and $u_{2}\left (\boldsymbol x \right )$ respectively in the following paragraphs. Following the Noise2Noise strategy and mean squared error (MSE) loss, the training process of $f_\theta$ is actually trying to find the minimum of

(5)$$\mathcal{L}_{\text MSE} = \mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{y}_2\Vert _2^2 ,$$

Refer to the derivation in supplementary materials of [28] but with the multiplicative speckle noise model, we have

(6)$$\begin{aligned} \mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{y}_2\Vert _2^2 & = \mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)^2\right)+\mathbb{E}\left(\boldsymbol{y}_2^2\right)-2\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)\boldsymbol{y}_2\right) \\ & =\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)^2\right)+\sigma_{\boldsymbol{y}_2}^2+\mathbb{E}^2\left(\boldsymbol{y}_2\right)-2\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)\right)\mathbb{E}\left(\boldsymbol{y}_2\right) \\ & =\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)^2\right)+\sigma_{\boldsymbol{y}_2}^2+\mathbb{E}^2\left(\boldsymbol{x}_2\right)-2\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)\right)\mathbb{E}\left(\boldsymbol{x}_2\right) \\ & =\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)^2\right) + \mathbb{E}\left(\boldsymbol{x}_2^2\right) - 2\mathbb{E}\left(f_\theta \left(\boldsymbol{y}_1\right)\right)\mathbb{E}\left(\boldsymbol{x}_2\right) + \sigma_{\boldsymbol{y}_2}^2 - \sigma_{\boldsymbol{x}_2}^2 \\ & =\mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{x}_2\Vert _2^2 + \sigma_{\boldsymbol{y}_2}^2 - \sigma_{\boldsymbol{x}_2}^2, \end{aligned}$$

in which $\sigma _{\boldsymbol {y}_2}^2$ and $\sigma _{\boldsymbol {x}_2}^2$ are variances of $\boldsymbol {y}_2$ and $\boldsymbol {x}_2$, respectively. These two terms are constants, thus minimizing Eq. (5) results in minimizing $\mathbb{E}\Vert f_\theta \left (\boldsymbol {y}_1\right )-\boldsymbol {x}_2\Vert _2^2$. And this term could be converted into

(7)$$\begin{aligned} \mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{x}_2\Vert_2^2 & = \mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{x}_1 + \boldsymbol{x}_1-\boldsymbol{x}_2\Vert_2^2 \\ & =\mathbb{E}\Vert f_\theta \left(\boldsymbol{y}_1\right)-\boldsymbol{x}_1\Vert_2^2 + \mathbb{E}\Vert \boldsymbol{x}_1-\boldsymbol{x}_2\Vert_2^2+ 2\mathbb{E}\left(f_\theta\left(\boldsymbol{y}_{1}\right)-\boldsymbol{x}_1\right)^{\text T}\left(\boldsymbol{x}_1-\boldsymbol{x}_2\right) \end{aligned}$$

Assuming that the gap between $\boldsymbol {x}_1$ and $\boldsymbol {x}_2$ is $\Delta _{\boldsymbol {x}}$. After we optimize the sub-samplers to make the resulting sub-images very similar, i.e., $\Delta _{\boldsymbol {x}} \to 0$, the network trained by minimizing Eq. (5) would converge to a despeckling model close to the one trained with supervision (minimizing $\mathbb{E}\Vert f_\theta \left (\boldsymbol {y}_1\right )-\boldsymbol {x}_1\Vert _2^2$). Note that both $\boldsymbol {y}_1$ and $\boldsymbol {x}_1$ are sub-images obtained by the same sampler $u_1$, so it is convincing that the model trained with loss Eq. (5) is also valid for the original image $\boldsymbol y$. These inferences also indicate that Noise2Noise theory with $\mathcal {L}_2$ loss is also useful for multiplicative speckle noise reduction, which has previously been proven to work for additive noise only.

2.2 Training with multi-scale pixel patch sampler (PPS)

To address the requirement of minimizing the differences between the two sub-sampled images and improving the despeckling effect as much as possible, we propose a multi-scale adjacent pixel patch sampler to generate quasi-paired images for training. The sampling and training process are shown in Fig. 1. During a sampling subprocess with patchsize=$k\times k$, the original OCT image is divided into non-overlapping cells with size $2k\times 2k$, and each cell has $4$ patches. Two adjacent patches P1 and P2 were randomly chosen and assigned to the cells’ original position as the yellow arrows in upper Fig. 1 point to. After finishing the subsampling process, we can get quasi-paired images $u_{k, 1}\left (\boldsymbol {y}\right )$ and $u_{k, 2}\left (\boldsymbol {y}\right )$. Since the sampled patches in the corresponding position are adjacent to each other, the similarity of the quasi-paired images is guaranteed to the maximum extent. $k$ could be set to 1, 2, 4 or larger exponential power of 2. When $k$ becomes too large, the similarity between two sub-sampled images will decrease, so we need to determine the value of $k$ by comparative experiments. Besides, no matter what $k$ is, the resulting sub-images are half the size of the original image.

As we mainly put forward this self-supervised training strategy, more specifically, it is a method to generate data pairs for training. The most widely used networks, such as U-net [30] or its enhanced versions, can be selected as the neural network model, however the loss function needs to be compatible with our data generation strategy. During training, all sub-sampled image pairs with different patch sizes $k$ are fed into the network but with loss functions calculated separately, and the total loss function can be expressed as follows:

(8)$$\mathcal{L}=\sum_{k}\left[\lambda_{k}\mathcal{L}_{MSE}\left(f_{\theta}\left(u_{k,1}\left(\boldsymbol{y}\right)\right), u_{k,2}\left(\boldsymbol{y}\right)\right)\right],$$

where $u_{k,i}$ represents the $i$-th PPS with patch size $k (i=1,2,4,\ldots )$ and $\lambda _{k}$ is the weight for each loss. The proposed loss function can integrate the different effects of multi-scale images and balance the speckle reduction effect and structure preservation capability.

Fig. 1. Schematic diagram of MAP-SNR, including sampling, training, and testing processes.

Download Full Size | PDF

3. Experiments and results

3.1 Implementation details

We utilized two publicly available spectral-domain OCT (SD-OCT) datasets (Dataset 1 [31] and Dataset 2 [32]) containing paired noisy and quasi-clean retina images of human retinas to facilitate the evaluations of despeckling. Ground truth images in these datasets were acquired by registering and averaging repeated B-scans. After discarding several images with inaccurate registration or over-smoothing, we selected 26 well-registered pairs of OCT images from them, 14 for training and 12 for testing. Before training, the original 14 noisy images and their ground truths with size $900\ast 450$ (width$\ast$height) were cropped into partially overlapped patches using a sliding window with the size of $256*256$ and a stride of 8. Note that only patches with an average pixel value over $80$ (range:$0-255$) would be selected, thus avoiding patches with large areas of background, as shown in Fig. 2(a). We have obtained 27578 training images after the pre-processing steps above. Images in the test set were also cropped and thresholded in the same manner for the measurement of PSNR and SSIM after each training epoch. Note that the final test is performed on the uncropped test images.

Another low-quality OCT dataset (D3) was acquired from mouse skin using a homemade swept-source OCT (SS-OCT) system with a central wavelength at 1300 nm. This set of experiments was approved by Beijing Institute of Radiation Medicine Experiment Animal Center (Animal Protocols No. IACUC-DWZX-2019-502). 400 B-scan images from two individual mice were used for training, and another C-scan from a different mouse containing 200 B-scans was used for testing. The size of these images is $500*460$, and we cropped them to $256*256$ in the same way as above. We have randomly discarded 90% of them in training set to reduce the amount of data, leaving 21753 images for training.

Fig. 2. (a) Schematic view of data pre-processing. The patches in the green square are retained, while the patch in the yellow dotted square is discarded because it does not contain structural information; (b-c) Examples of regions-of-interest (ROI) delineation for quantitative evaluation. Purple rectangles: background noise regions; green: homogeneous regions; red: nonhomogeneous regions.

Download Full Size | PDF

During training, the cropped images with size of $256*256$ are then sub-sampled with multi-scale PPS as mentioned above. We first carried out a set of ablation experiments to determine the optimal sampling scales and corresponding loss weights. In subsequent evaluations, we kept the combination of two scales with the best performance: $k=1$ and $k=2$, setting $\lambda _{1}=1$ and $\lambda _{2}=10$. We followed [28] to use a modified U-Net as the despeckling model. During training, the batch size was set to 32, and Adam optimizer was adopted, with the learning rate initialized to $10^{-5}$ and half-decayed every 20 epochs. The training process ends at 100 epochs. Network training was done on a server with Ubuntu 16.04 OS and two Nvidia RTX 8000 GPUs. Other evaluation experiments were conducted on a personal computer with 32GB memory and an Intel Core i9-9900K CPU.

3.2 Compared methods and evaluation metrics

We compared the proposed MAP-SNR with five traditional non-learning methods (filters), Median, Bilateral, Wavelet, NLM, and BM3D, one sparsity-based method, MSBTD [11]; and six state-of-the-art deep-learning-based methods, DnCNN [14], Noise2Void [25], Self2Self [27], Neighbor2Neighbor [28], SiameseGAN [17] and DRGAN [19,20]. We all used the official source code or program for experiments with default settings on the same 12 test images. For learning-based methods, the same 14 noisy images (with their ground truths if needed) were used for training. Two ground-truth-required evaluation metrics: peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), and four ground-truth-free metrics: contrast-to-noise ratio (CNR), equivalent number of looks (ENL), edge preservation index (EPI), and signal-to-noise ratio (SNR), were used for quantitative analysis.

(9)$$\text{PSNR} = 10\log_{10}\left(\frac{\max\left(G\right)^2}{\frac{1}{MN}\sum_{m}^{M}\sum_{n}^{N}\left(G(m,n)-I(m,n)\right)}\right)$$

(10)$$\text{SSIM} = \frac{(2\mu_{G}\mu_{I}+c_{1})(2\sigma_{GI}+c_2)}{(\mu_{G}^{2}+\mu_{I}^{2}+c_{1})(\sigma_{G}^{2}+\sigma_{I}^{2}+c_{2})}$$

where $M$ and $N$ are the width and height of the image. $G(m,n)$ is the pixel intensity located at $(m,n)$ of the referenced ground truth image $G$, and $I$ is the resulted image after despeckling; $\mu _{G}$ and $\sigma _{G}$ are the mean and the standard deviation of the ground truth image $G$, while $\mu _{I}$ and $\sigma _{I}$ are the mean and the standard deviation of the despeckled image $I$, respectively. $\sigma _{GI}$ is the covariance of $G$ and $I$. $c_{1}$ and $c_{2}$ are set to $(0.01\times 255)^2$ and $(0.03\times 255)^2$, respectively.

(11)$$\text{SNR} = 10\log_{10}\frac{\max(I)^{2}}{\sigma_{b}^{2}}$$

(12)$$\text{CNR} = \frac{1}{R}\sum_{r=1}^{R}10\log_{10}\frac{\vert\mu_{r}-\mu_{b}\vert}{\sqrt{\sigma_{r}^{2}+\sigma_{b}^2}}$$

(13)$$\text{ENL} = \frac{1}{H}\sum_{h=1}^{H}\frac{\mu_{h}^{2}}{\sigma_{h}^{2}}$$

(14)$$\text{EPI} = \frac{1}{N}\sum_{n=1}^{N}\frac{\Gamma\left(\nabla A_{n}-\overline{\nabla A_{n}}, \nabla A_{n}^{'}-\overline{\nabla A_{n}^{'}}\right)}{\Gamma\left(\nabla A_{n}-\overline{\nabla A_{n}}, \nabla A_{n}-\overline{\nabla A_{n}} \right)\Gamma\left(\nabla A_{n}^{'}-\overline{\nabla A_{n}^{'}}, \nabla A_{n}^{'}-\overline{\nabla A_{n}^{'}} \right)}$$

where $\mu _{b}, \mu _{r}, \mu _{h}$ and $\sigma _{b}^2, \sigma _{r}^2, \sigma _{h}^2$ are the means and variance of the background noise area, $r$-th ROI, and $h$-th homogeneous ROI, respectively. $R$, $H$, and $N$ are the number of all types of ROIs (except the background noise region), homogeneous ROIs, and nonhomogeneous ROIs. $\Gamma$ denotes the operation of pixel-by-pixel multiplication and accumulation:

(15)$$\Gamma\left(I_{1},I_{2}\right)=\sum_{i,j\in I}I_{1}\left(i,j\right)I_{2}\left(i,j\right)$$

$\nabla$ is the $3\times 3$ Laplacian operator. $A_n$ and $A_{n}^{'}$ represent the $n$-th nonhomogeneous ROI before and after despeckling, respectively, and $\overline {\nabla A}$ is the mean value of $A$. In our case, one background noise area, four homogeneous ROIs, and six nonhomogeneous ROIs (mainly the layered structures of the retina) are selected manually on each image for calculation of SNR, CNR, ENL, and EPI, and examples are shown in Fig. 2(b-c).

Higher PSNR and SNR values indicate a better speckle reduction effect. CNR measures the contrast level of the image, and a higher value means that we can distinguish the ROI out of the background more easily. Besides, a higher SSIM value indicates a better similarity between the despeckled image and the ground truth; a higher ENL value indicates better smoothness. The closer EPI is to 1, the better it is at indicating how well edge structures can be preserved.

3.3 Results

Figure 3 shows the training process converges quickly both for retinal OCT images and for low-quality skin OCT images collected by ourselves, and smoothing effects can be seen obviously after ten epochs. Structural details are improved in the later training process. The rapid convergence of the model can also be seen from the curves of training loss function (Fig. 3(b)) and validating PSNR (Fig. 3(c)) and SSIM indices (Fig. 3(d)).

Fig. 3. (a) Image quality improvements during the training process of retina (upper) and low-quality skin OCT images (lower); (b) Training loss curve. The horizontal axis is the number of iterations; (c) The average PSNR value of the cropped test set increases by training epochs; (d) The average SSIM value of the cropped test set increased first and then decreased, and rapidly increased after about 5 epochs.

Download Full Size | PDF

Fig. 4. Qualitative comparison with different loss combinations.

Download Full Size | PDF

3.3.1 Ablation study of different scale loss

We conducted a set of experiments to illustrate the impact of different multi-scale loss combinations. Quantitative evaluation results are shown in Table 1 while qualitative comparisons between those different scales are shown in Fig. 4. MAP-SNR using single-scale PPS with $k=1$ achieves the highest EPI, while the one using PPS only with $k=4$ has the lowest. With the increase of the PPS scale, the PSNR, CNR, and ENL of the image are improved, indicating that the despeckled image will be smoother, but this may lead to the damage of edge information and excessive blurring, such as the image with $k=4$ in Fig. 4. When sampling at multiple scales and setting corresponding weighted loss functions, the speckle reduction effect and fine structure preservation can be balanced to a certain extent. Among those metrics, MAP-SNR with $\lambda _{1}=1$ and $\lambda _{2}=10$ has a moderate and acceptable EPI value, and other metrics are in the first or second place. And with this configuration, the despeckled image has the optimum visual balance. Considering both quantitative and qualitative results, we selected $k=1$ and $2$ with loss weights of 1 and 10, respectively, in the final training scheme.

Table 1. Quantitative evaluation of different loss combinations.

View Table | View all tables in this article

3.3.2 A-line analysis

An SD-OCT image is composed of several A-lines, and each A-line is a set of intensity data derived from Fourier Transform of frequency domain data in a single acquisition. The recovery of A-line data is crucial, which not only involves the preservation of the original tissue structure but also relates to the analysis of tissue optical parameters such as backscattering coefficient, which significantly influences the judgment of pathological tissue. As shown in Fig. 5(c-d), MAP-SNR could effectively remove the disturbance caused by speckles at different depths and keep the original information of A-lines consistent with the ground truth. We also compared the absolute error between the original speckled / MAP-SNR despeckled A-line and the ground truth, as shown in Fig. 5(e-f), which respectively correspond to the positions of the two dotted lines in Fig. 5(b). We can see that MAP-SNR behaves well in despeckling and smoothing tasks, and the quality of resulted images (or A-lines) is much closer to the ground truth.

Fig. 5. Despeckling effects on A-lines. (a) The noisy OCT image and its ground truth; (b) Despeckled OCT image using MAP-SNR; (c-d) Curves of noisy (red), clean (blue), and despeckled (green) A-lines located at the left and right green dotted line in (b), respectively; (e-f) Absolute error between despeckled A-line and ground truth (green), as well as original A-line and ground truth (red) corresponding to (c) and (d).

Download Full Size | PDF

3.3.3 Quantitative evaluation

Quantitative evaluation results of MAP-SNR and the compared methods are shown in Table 2. These values were calculated and averaged across the 12 test images. From this table, we can easily find that the proposed MAP-SNR achieves the highest value in almost all metrics except EPI. In fact, EPI growth is not always a good phenomenon, which might mean that the despeckling effect is weak since EPI will become 1 if calculated on the original ROIs. Generally, methods that achieve higher EPI than MAP-SNR have other obvious shortcomings, e.g., observed low CNR and SSIM of median, bilateral, wavelet filters, and MSBTD, but the parameters of MAP-SNR are relatively balanced well. As a self-supervised method, the MAP-SNR performs better than Noise2Void and Neighbor2Neighbor measured by all metrics and has the edge preservation ability comparable to Self2Self. Besides, MAP-SNR outperforms these latest GAN-based methods in terms of speckle reduction and smoothing abilities (e.g., much higher PSNR, SNR, CNR, and ENL), and has the best structure similarity, which indicates that it has great robustness and will not damage the structures on the original images.

Table 2. Quantitative evaluation of MAP-SNR and the state-of-the-art methods.

View Table | View all tables in this article

3.3.4 Visual effect comparison

Figure 6 and Fig. 7 list the despeckled results of all methods on the same test images. The regions inside four red rectangular are magnified and shown below each image, from top to bottom and left to right. The visual effects of the proposed images are significantly improved compared with the original images. We can observe that MAP-SNR shows great superiority in background areas and has an apparent despeckling effect in the layered tissue area. The edge of the image despeckled by MAP-SNR is still sharp enough and details obscured by speckle noise in the original image are also restored, as yellow arrows in Fig. 6 and Fig. 7 point to. The red arrows in Fig. 6 indicate structural losses on images despeckled by GAN-based methods, and similar artifacts can be observed more clearly on another test image shown in Fig. 8, which is not clinically acceptable. By contrast, MAP-SNR achieves a good balance between speckle suppression and structure preservation, improving the readability and ensuring that the image processing algorithm will not affect the clinical diagnosis.

Fig. 6. Qualitative visual comparison between MAP-SNR and the state-of-the-art methods on example image 1.

Download Full Size | PDF

Fig. 7. Qualitative visual comparison between MAP-SNR and the state-of-the-art methods on example image 2.

Download Full Size | PDF

3.3.5 Results for low-quality skin OCT

As shown in Fig. 9, OCT images of mouse skin processed by MAP-SNR have better smoothness. Due to the low resolution of the original image, some details of the structure are not clear enough, but the proposed method has made the maximum reservation and restoration (Fig. 9(b)(e)). We also utilized the trained model for despeckling an entire C-scan and rendered the sectional images into a 3-D volume (Fig. 9(c)(f)). Under the same rendering settings, the despeckled one (Fig. 9(f)) has better surface textures, less background noise, and better readability. We also extracted the tissue surface from C-scan and compared the changes of en-face maps in different depths before and after despeckling, as shown in Fig. 9(g-i). MAP-SNR works well for those low-resolution images, and the layered structure of mouse skin is preserved as much as possible, indicating that MAP-SNR has strong robustness and extensibility in practical use for images of various tissues and qualities.

Fig. 8. Comparison about structure preservation between SiameseGAN, DRGAN, and the proposed MAP-SNR. All images were cropped from the same location on the same test image.

Download Full Size | PDF

4. Discussions

Preserving image details while suppressing noise effectively has always been a complex problem in the quality-improving task of OCT images. MAP-SNR achieved competitive visual quality among state-of-the-art algorithms and performed well on those evaluation metrics. The proposed method has the highest PSNR and SNR value, indicating that it has the best speckle suppression effect. Similarly, MAP-SNR also has the highest CNR value, which means that this method greatly improves the image contrast after despeckling. The higher the ENL value, the much smoother the homogeneous region. Therefore, in general, the proposed method achieves excellent despeckling and smoothing performance in the case of using only speckled images, even better than other methods with supervised training.

Fig. 9. Despeckling of skin OCT images. (a) Original speckled OCT image; (b) Patches corresponding to the colored squares in (a); (c) 3-D rendered volume of the original C-scan; (d) Despeckled OCT image; (e) Patches corresponding to the colored squares in (d), and their positions are the same as in (a); (f)3-D rendered volume of the original C-scan, same settings as (c); (g-i) En-face map and local zoom-in view at the depth of 0/30/60 pixels beneath the tissue surface, respectively. Left: processed from original images; Right: processed from despeckled images.

Download Full Size | PDF

SSIM and EPI reflect the performance of detail structure preservation. The noisy image and paired ground truth in the dataset are not perfectly aligned, as seen in the region that the green arrows in Fig. 6 and Fig. 7; as a result, the total SSIM value is very low, but our result is still the best. As for the problem of low EPI found in the quantitative evaluation, we did not find that it brought a severe negative impact on the despeckled images during visual analysis. In fact, from the perspective of visual effect, the edge information on the image is intact, and there is no significant difference from other methods.

As in Fig. 6 and Fig. 7, we found that several methods could not deal with the heavy speckles in OCT images, such as median, bilateral, wavelet filtering, MSBTD, and Noise2Void. In the images processed by NLM and BM3D, we could observe the presence of artifacts in the background region. Among those learning-based methods, Neighbor2Neighbor, Noise2Void, and Self2Self could not reduce speckles to an optimal extent, and Self2Self has to take a badly long time to train a single image 150,000 times according to the default settings, which is unacceptable to use in practical OCT imaging scenarios. With a trained MAP-SNR model based on U-Net, despeckling of an OCT image with the size of 256*256 only costs 28.6 milliseconds on average, which could be optimized faster by parallel processing and simplifying network architecture. Images denoised by DnCNN are over-smoothed, and the edges are blurred. Structural changes or artifacts occur in images processed by SiameseGAN and DRGAN, while MAP-SNR is more robust, as shown in Fig. 8. However, in those layered tissue areas, the MAP-SNR behaves worse than GAN-based methods in speckle suppression and restoration of smooth structures. We will consider developing better PPS and optimizing the loss function to improve our results further.

From a comprehensive view, the model trained by the proposed method can preserve the original structure information of the OCT and achieve an excellent despeckling effect. In addition, it is also capable of despeckling tasks for other types of OCT images, as described in section 3.3.5 and Fig. 9. This method can be used to train a very effective despeckling model for low-quality OCT images of mouse skin without clean ground truth. Our approach simplifies training and deployment of despeckling model and is effective for a wide range of clinical OCT applications. In the future, we will also conduct experiments on more types of OCT images and try to integrate our approach into the existing OCT imaging software.

5. Conclusion

In this paper, we have proposed a novel self-supervised training strategy for speckle noise reduction of OCT images based on the mapping between adjacent pixels (MAP-SNR). We have seen competitive, excellent despeckling performance compared to the state-of-the-art methods, whether it’s supervised or not, through quantitative and qualitative evaluations. MAP-SNR worked superiorly not only in speckle suppression and image smoothing but also in detail structure preservation of OCT images. With MAP-SNR, one can train an effective despeckling model without any clean OCT images, which significantly simplifies the process of collecting training data, making it more suitable for practical deployment and clinical application.

Funding

Beijing Municipal Natural Science Foundation (7212202, L192013); National Natural Science Foundation of China (82027807).

Acknowledgments

The authors would like to thank Prof. Hongxiang Kang and his research team from Beijing Institute of Radiation Medicine for their help with animal experiments.

Disclosures

The authors declare that there is no conflict of interest.

Data Availability

Data underlying the results presented in this paper are partially available: Dataset 1 [31] and Dataset 2 [32], are publicly available; Dataset 3 is not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, C. A. Puliafito, and J. G. Fujimoto, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

2. R. Zhang, Y. Fan, W. Qi, A. Wang, X. Tang, and T. Gao, “Current research and future prospects of ivoct imaging-based detection of the vascular lumen and vulnerable plaque,” J. Biophotonics 15(5), e202100376 (2022). [CrossRef]

3. Y. Zeng, S. Xu, W. C. Chapman, S. Li, Z. Alipour, H. Abdelal, D. Chatterjee, M. Mutch, and Q. Zhu, “Real-time colorectal cancer diagnosis using pr-oct with deep learning,” in Optical Coherence Tomography, (Optical Society of America, 2020), pp. OW2E-5.

4. C. Kut, K. L. Chaichana, J. Xi, S. M. Raza, X. Ye, E. R. McVeigh, F. J. Rodriguez, A. Quiñones-Hinojosa, and X. Li, “Detection of human brain cancer infiltration ex vivo and in vivo using quantitative optical coherence tomography,” Sci. Transl. Med. 7(292), 292ra100 (2015). [CrossRef]

5. Y. Li, Y. Fan, C. Hu, F. Mao, X. Zhang, and H. Liao, “Intelligent optical diagnosis and treatment system for automated image-guided laser ablation of tumors,” Int. J. Comput. Assist. Radiol. Surg. 16(12), 2147–2157 (2021). [CrossRef]

6. O. Liba, M. D. Lew, E. D. SoRelle, R. Dutta, D. Sen, D. M. Moshfeghi, S. Chu, and A. de La Zerda, “Speckle-modulating optical coherence tomography in living mice and humans,” Nat. Commun. 8(1), 1–13 (2017). [CrossRef]

7. R. Li, H. Yin, J. Hong, C. Wang, B. He, Z. Chen, Q. Li, P. Xue, and X. Zhang, “Speckle reducing oct using optical chopper,” Opt. Express 28(3), 4021 (2020). [CrossRef]

8. B. Chong and Y.-K. Zhu, “Speckle reduction in optical coherence tomography images of human finger skin by wavelet modified bm3d filter,” Opt. Commun. 291, 461–469 (2013). [CrossRef]

9. H. Yu, J. Gao, and A. Li, “Probability-based non-local means filter for speckle noise suppression in optical coherence tomography images,” Opt. Lett. 41(5), 994–997 (2016). [CrossRef]

10. Y. Fan, L. Ma, W. Chang, W. Jiang, S. Luo, X. Zhang, and H. Liao, “Optimized optical coherence tomography imaging with hough transform-based fixed-pattern noise reduction,” IEEE Access 6, 32087–32096 (2018). [CrossRef]

11. L. Fang, S. Li, Q. Nie, J. A. Izatt, C. A. Toth, and S. Farsiu, “Sparsity based denoising of spectral domain optical coherence tomography images,” Biomed. Opt. Express 3(5), 927–942 (2012). [CrossRef]

12. L. Fang, S. Li, R. P. McNabb, Q. Nie, A. N. Kuo, C. A. Toth, J. A. Izatt, and S. Farsiu, “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation,” IEEE Trans. Med. Imaging 32(11), 2034–2049 (2013). [CrossRef]

13. A. Abbasi, A. Monadjemi, L. Fang, and H. Rabbani, “Optical coherence tomography retinal image reconstruction via nonlocal weighted sparse representation,” J. Biomed. Opt. 23(03), 1 (2018). [CrossRef]

14. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. on Image Process. 26(7), 3142–3155 (2017). [CrossRef]

15. Y. Ma, X. Chen, W. Zhu, X. Cheng, D. Xiang, and F. Shi, “Speckle noise reduction in optical coherence tomography images based on edge-sensitive cgan,” Biomed. Opt. Express 9(11), 5129–5146 (2018). [CrossRef]

16. M. Wang, W. Zhu, K. Yu, Z. Chen, F. Shi, Y. Zhou, Y. Ma, Y. Peng, D. Bao, S. Feng, L. Ye, D. Xiang, and X. Chen, “Semi-supervised capsule cgan for speckle noise reduction in retinal oct images,” IEEE Trans. Med. Imaging 40(4), 1168–1183 (2021). [CrossRef]

17. N. A. Kande, R. Dakhane, A. Dukkipati, and P. K. Yalavarthy, “Siamesegan: a generative model for denoising of spectral domain optical coherence tomography images,” IEEE Trans. Med. Imaging 40(1), 180–192 (2021). [CrossRef]

18. H. Cheong, S. K. Devalla, T. Chuangsuwanich, T. A. Tun, X. Wang, T. Aung, L. Schmetterer, M. L. Buist, C. Boote, A. H. Thiéry, and M. J. A. Girard, “Oct-gan: single step shadow and noise removal from optical coherence tomography images of the human optic nerve head,” Biomed. Opt. Express 12(3), 1482–1498 (2021). [CrossRef]

19. Y. Huang, W. Xia, Z. Lu, Y. Liu, J. Zhou, L. Fang, and Y. Zhang, “Disentanglement network for unsupervised speckle reduction of optical coherence tomography images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2020), pp. 675–684.

20. Y. Huang, W. Xia, Z. Lu, Y. Liu, H. Chen, J. Zhou, L. Fang, and Y. Zhang, “Noise-powered disentangled representation for unsupervised speckle reduction of optical coherence tomography images,” IEEE Trans. Med. Imaging 40(10), 2600–2614 (2021). [CrossRef]

21. A. Guo, L. Fang, M. Qi, and S. Li, “Unsupervised denoising of optical coherence tomography images with nonlocal-generative adversarial network,” IEEE Trans. Instrum. Meas. 70, 1 (2020). [CrossRef]

22. J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” arXiv, arXiv: 1803.04189 (2018). [CrossRef]

23. B. Qiu, S. Zeng, X. Meng, Z. Jiang, Y. You, M. Geng, Z. Li, Y. Hu, Z. Huang, C. Zhou, Q. Ren, and Y. Lu, “Comparative study of deep neural networks with unsupervised noise2noise strategy for noise reduction of optical coherence tomography images,” J. Biophotonics 14(11), e202100151 (2021). [CrossRef]

24. Y. Huang, N. Zhang, and Q. Hao, “Real-time noise reduction based on ground truth free deep learning for optical coherence tomography,” Biomed. Opt. Express 12(4), 2027–2040 (2021). [CrossRef]

25. A. Krull, T.-O. Buchholz, and F. Jug, “Noise2void-learning denoising from single noisy images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 2129–2137.

26. J. Batson and L. Royer, “Noise2self: Blind denoising by self-supervision,” in International Conference on Machine Learning, (PMLR, 2019), pp. 524–533.

27. Y. Quan, M. Chen, T. Pang, and H. Ji, “Self2self with dropout: Learning self-supervised denoising from single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 1890–1898.

28. T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor: Self-supervised denoising from single noisy images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 14781–14790.

29. M. Li, R. Idoughi, B. Choudhury, and W. Heidrich, “Statistical model for oct image denoising,” Biomed. Opt. Express 8(9), 3903–3917 (2017). [CrossRef]

30. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention, (Springer, 2015), pp. 234–241.

31. L. Fang, S. Li, Q. Nie, J. A. Izatt, C. A. Toth, and S. Farsiu, “Sparsity based denoising of spectral domain optical coherence tomography images: data,” Duke University, 2012), https://people.duke.edu/~sf59/Fang_BOE_2012.htm.

32. L. Fang, S. Li, R. P. McNabb, Q. Nie, A. N. Kuo, C. A. Toth, J. A. Izatt, and S. Farsiu, “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation: data,” Duke, 2013, https://people.duke.edu/~sf59/Fang_TMI_2013.htm.

PPS Scales	Loss Weights	PSNR	SSIM	CNR	ENL	EPI	SNR
$k = 1$	$λ_{1} = 1$	27.8563	0.6822	12.0338	329.5305	0.9441	39.2491
$k = 2$	$λ_{2} = 1$	28.3172	0.7267	13.4267	626.7354	0.9285	42.5336
$k = 4$	$λ_{4} = 1$	28.7338	0.7286	14.5484	792.0502	0.9259	41.4726
$k = 1, 2$	$λ_{1} = 1$ , $λ_{2} = 1$	27.8281	0.7101	13.0049	541.2520	0.9372	42.0412
$k = 1, 2, 4$	$λ_{1} = λ_{2} = λ_{4} = 1$	28.5225	0.7229	13.2884	588.4016	0.9344	41.6201
$k = 1, 2$	$λ_{1} = 1$ , $λ_{2} = 10$	28.5891	0.7338	14.0752	789.4689	0.9305	43.5311

Method	PSNR	SSIM	CNR	ENL	EPI	SNR
Median	25.9139	0.5285	10.5221	206.3732	0.9198	30.3420
Bilateral	24.4600	0.3713	8.9282	120.3475	0.9554	28.2369
Wavelet	23.9226	0.3309	7.6111	100.9201	0.9609	27.4568
NLM	26.4557	0.5358	11.3362	356.5452	0.9600	36.3285
BM3D	28.3337	0.7078	13.0989	681.0441	0.9305	38.7498
MSBTD	23.8262	0.3267	8.1336	159.6328	0.9765	30.8820
DnCNN	26.5856	0.6368	13.8684	417.6871	0.9125	35.0974
Neighbor2Neighbor	26.9162	0.5460	10.0581	168.0785	0.9606	33.2448
Noise2Void	23.2027	0.2914	6.4233	72.5384	0.9084	29.0641
Self2Self	27.8068	0.5845	11.9541	451.5220	0.9415	33.6716
SiameseGAN	25.0919	0.6556	11.2943	455.3130	0.9326	37.1161
DRGAN	23.1444	0.5753	11.4429	273.6177	0.9427	34.2980
MAP-SNR	28.5891	0.7338	14.0752	789.4689	0.9305	43.5311

PPS Scales	Loss Weights	PSNR	SSIM	CNR	ENL	EPI	SNR
$k = 1$	$λ_{1} = 1$	27.8563	0.6822	12.0338	329.5305	0.9441	39.2491
$k = 2$	$λ_{2} = 1$	28.3172	0.7267	13.4267	626.7354	0.9285	42.5336
$k = 4$	$λ_{4} = 1$	28.7338	0.7286	14.5484	792.0502	0.9259	41.4726
$k = 1, 2$	$λ_{1} = 1$ , $λ_{2} = 1$	27.8281	0.7101	13.0049	541.2520	0.9372	42.0412
$k = 1, 2, 4$	$λ_{1} = λ_{2} = λ_{4} = 1$	28.5225	0.7229	13.2884	588.4016	0.9344	41.6201
$k = 1, 2$	$λ_{1} = 1$ , $λ_{2} = 10$	28.5891	0.7338	14.0752	789.4689	0.9305	43.5311

Method	PSNR	SSIM	CNR	ENL	EPI	SNR
Median	25.9139	0.5285	10.5221	206.3732	0.9198	30.3420
Bilateral	24.4600	0.3713	8.9282	120.3475	0.9554	28.2369
Wavelet	23.9226	0.3309	7.6111	100.9201	0.9609	27.4568
NLM	26.4557	0.5358	11.3362	356.5452	0.9600	36.3285
BM3D	28.3337	0.7078	13.0989	681.0441	0.9305	38.7498
MSBTD	23.8262	0.3267	8.1336	159.6328	0.9765	30.8820
DnCNN	26.5856	0.6368	13.8684	417.6871	0.9125	35.0974
Neighbor2Neighbor	26.9162	0.5460	10.0581	168.0785	0.9606	33.2448
Noise2Void	23.2027	0.2914	6.4233	72.5384	0.9084	29.0641
Self2Self	27.8068	0.5845	11.9541	451.5220	0.9415	33.6716
SiameseGAN	25.0919	0.6556	11.2943	455.3130	0.9326	37.1161
DRGAN	23.1444	0.5753	11.4429	273.6177	0.9427	34.2980
MAP-SNR	28.5891	0.7338	14.0752	789.4689	0.9305	43.5311

Self-supervised speckle noise reduction of optical coherence tomography without clean data

Abstract

1. Introduction

2. Methods

2.1 Fundamental theory

2.2 Training with multi-scale pixel patch sampler (PPS)

3. Experiments and results

3.1 Implementation details

3.2 Compared methods and evaluation metrics

3.3 Results

3.3.1 Ablation study of different scale loss

3.3.2 A-line analysis

3.3.3 Quantitative evaluation

3.3.4 Visual effect comparison

3.3.5 Results for low-quality skin OCT

4. Discussions

5. Conclusion

Funding

Acknowledgments

Disclosures

Data Availability

References

Supplementary Material (2)

Data Availability

Cited By

Figures (9)

Tables (2)

Equations (15)

Biomedical Optics Express

Name	Description
Dataset 1	spectral-domain OCT (SD-OCT) datasets
Dataset 2	spectral-domain OCT (SD-OCT) datasets