Unsupervised despeckling of optical coherence tomography images by combining cross-scale CNN with an intra-patch and inter-patch based transformer

Quan Zhou; Quan Zhou; Mingwei Wen; Mingwei Wen; Mingyue Ding; Xuming Zhang

doi:10.1364/OE.459477

1. Introduction

Optical coherence tomography (OCT) is a non-invasive, non-contact and micrometer level cross-sectional imaging technology. This technology has found wide application in clinical ophthalmology [1,2]. However, speckle noise will inevitably be generated during imaging due to coherent imaging mode and detector noise [3,4] and it will influence the subsequent image processing and analysis tasks. To address the OCT image despeckling, numerous denoising methods have been presented. These methods mainly include the spatial domain denoising methods such as frame averaging method, anisotropic filtering [5], total variation [6] and non-local means (NLM) [7] as well as the transform domain denoising methods based on curvelets [8], wavelet [9,10], dual tree complex wavelet [11] and dictionary learning [12,13]. As for the frame averaging method, it works by registering a series of repeated B-scan images acquired at the adjacent or same locations, then averaging the registered images to obtain an image with reduced noise and improved signal-to-noise ratio (SNR) [14]. However, this method will prolong OCT image acquisition time so that it is impractical for the clinical application. As for the dictionary learning based methods, they train a sparse dictionary to characterize high-SNR images and use it to despeckle the adjacent low-SNR images. However, these methods need to capture multiple high SNR B-scan images, which limits their clinical application. Another method based on K-SVD dictionary learning [15,16] and curvelet transform [17] has been presented for OCT image despeckling with no need of high SNR B-scan images. A multiscale sparsity-based tomographic denoising (MSBTD) [12] algorithm has also been designed for despeckling the OCT images. As for the NLM methods, they search for pixels that are similar to the considered pixels by computing the intensity differences of image patches, and implement the weighted averaging on these similar pixels to estimate the intensities of the considered pixels. The probabilistic NLM (PNLM) [18] calculates the corrupted probability of each pixel and updates the weights of the NLM in an iterative manner. The optimized Bayesian NLM (OBNLM) [19] realizes patch-based image despeckling by using an optimized Bayesian framework to calculate the weights based on the characteristics of speckle noise. The block matching and 3-D filtering (BM3D) algorithm [20] realizes image denoising by means of such operations as grouping, collaborative filtering and aggregation. However, the dictionary learning based methods and the NLM methods are very expensive in terms of computational overhead and they may damage some pathological details in the noisy images that are meaningful for clinical diagnosis [21].

To address the issues of the above traditional despeckling methods, the deep learning (DL) has been introduced as a feasible solution to deliver improved denoising performance. The DL based image despeckling methods generally use such networks as the convolutional neural network (CNN) [22–25] and the generative adversarial network (GAN) [26–28] for noise removal in a supervised or unsupervised way. As for the supervised image denoising, the denoising convolutional neural network (DnCNN) [29] has been presented to improve the denoising performance of natural images by combining batch normalization operation and residual learning with the CNN. Cozzolino et al. [30] have designed a synthetic aperture radar image denoising algorithm which integrates the CNN with the NLM (CNN-NLM). Shi et al. [31] have utilized the CNN for speckle reduction in OCT images by adopting a series of strategies including leaky rectified linear units and shortcut connection. Devalla et al. [32] have designed a dilated residual UNet for despeckling the OCT images of the optic nerve head. Kande et al. [33] have presented a siamese GAN to despeckle the low-SNR OCT images. The above DL-based denoising algorithms generally require numerous clean images as the ground truth for network training. However, it is difficult to provide the clean reference image for each noisy image in the clinical applications. Although the multi-frame averaging can be implemented to produce a high-SNR denoised image as a reference image for network training, it will greatly increase the acquisition time and the images may suffer from the artifacts produced by the unavoidable movements during multi-frame acquisition. Moreover, because there exists certain position deviation between multiple OCT images, image registration must be implemented before averaging. These disadvantageous factors render these supervised denoising algorithms difficult to despeckle the clinical OCT images effectively.

Recently, the unsupervised despeckling methods have been presented. Huang et al. [34] have utilized the CNN combined with a method named Noise2Noise for OCT image despeckling without relying on the clean images. Disadvantageously, this method requires to collect multiple images of the same scene with independent noise. Guo et al. [21] have proposed to utilize the discriminator of the GAN to distinguish the real noisy samples from the fake ones and then use the NLM method to improve the denoising performance. Although these unsupervised methods do not require the clean images as the ground truth, they still need multiple adjacent similar B-scan images. Meanwhile, some unsupervised methods require multiple similar images after network training to obtain the correlation information for image post-processing. Moreover, the networks used in these unsupervised methods are based on the CNN. Although the CNN can extract the local image features well due to its local perception and weight sharing characteristics, it is difficult to capture the global context information which is highly significant for ensuring the OCT image despeckling performance. Therefore, it will be highly significant to explore the effective OCT image despeckling method based on the global and local features without requiring the clean images and the multiple similar adjacent images.

In recent years, the transformer has been popular in the natural language processing (NLP) field, and it has been extended to the field of computer vision (CV) because of its ability to represent the global image features by exploring the long-range dependance. In [35], the vision transformer (ViT) has been proposed for the image classification task by dividing the image into multiple image patches, using the linear embedding sequence of these image patches as the input and processing the image patches in the way of processing tokens in the NLP field. The ViT has been used for denoising the natural images [36,37] and the microscopy image [38].

Although the transformer-based algorithms have made great breakthroughs in the CV field, the transformers have some inherent problems including high computational complexity, high hardware requirement, lack of inductive biases of the CNN such as equivariance and locality and insufficient generalization ability after training with insufficient data. The combination of the CNN and the transformer will provide an effective solution to address the problems with the CNN-based and transformer-based image denoising methods. Liang et al. [39] have proposed a powerful baseline model based on Swin Transformer for image restoration by utilizing the CNN and transformer blocks to extract the low-level and high-level features, respectively. This algorithm has achieved significantly better performance than the existing denoising algorithms. However, this method works in a supervised way and thus it cannot be directly used for the OCT image despeckling because the noise-free OCT images are generally inaccessible.

In this paper, we have presented an unsupervised OCT image despeckling method by combining the CNN with the transformer. Different from the existing unsupervised methods that require multiple similar images for network training, the proposed method only uses the input noisy images for training. This method firstly extracts the multi-scale local features using the CNN with different convolution kernel sizes, and then uses the transformer to explore the correlation between image patches and between individual pixels in each patch to produce the global and local features. Finally, an image reconstruction network is used for generating the final despeckled image. The proposed model is trained using a new unsupervised loss function based on the Neighbor2Neighbor [40] which can be used for training any denoising network without post-processing of the output image.

The remainder of this paper is structured as follows. The related works are introduced in Section 2. Section 3 describes the proposed network model and the involved loss function for OCT image denoising. Section 4 presents the experimental results on the clinical OCT images. The conclusion is given in Section 5.

2. Related works

2.1 Neighbor2Neighbor

Some existing DL based image denoising methods require numerous noisy-clean image pairs as the training dataset. The image pairs are produced by adding the simulated noise to the noise-free images or averaging multiple images acquired at the same position of the patient. For the scheme based on the simulated noise, there may exist a big difference between real noise and simulated one in real clinical scenarios. For the multi-frame averaging scheme, the acquisition time of multi-frame images is long and the registration of image pairs is also very troublesome. Besides, the other denoising algorithms that do not require the noisy-clean image pairs often require the indirect clean images or multiple adjacent images for post processing of denoised results. To address the problems of the above schemes, the Noise2Noise [41] has been proposed to train the network based on multiple independent noisy images captured at the same scene. The optimization goal of this method is given by:

(1)$$\mathop {\arg \min }\limits_\vartheta {{\rm E}_{x,y,z}}||{{f_\vartheta }(y) - z} ||_2^2$$

where E denotes the expectation; y, z and x mean two independent noisy images and their ground truth, respectively; ${f_\vartheta }$ is the denoising network with parameter $\vartheta $. In reality, it is difficult to achieve the above situation. Assuming that the clean images x and $x + \varepsilon $ corresponding to the similar noisy images y and z are not equal, we have:

(2)$${{\rm E}_{x,y}}||{{f_\vartheta }(y) - x} ||_2^2 = {{\rm E}_{x,y,z}}||{{f_\vartheta }(y) - z} ||_2^2 - \sigma _z^2 + 2\varepsilon \begin{array}{c} {{{\rm E}_{x,y}}({f_\vartheta }(y) - x)} \end{array}$$

However, utilizing (y, z) to construct the training image pairs directly and using the Noise2Noise to train the denoising network cannot produce the same result as (y, z). When $\varepsilon \to 0$, (y, z) can be regarded as an approximation of the Noise2Noise. The disadvantage of this method lies in that it requires to collect at least two images of the same scene with noise independence for network training. In other words, the denoising network can be trained effectively only when a suitable (y, z) that satisfies the condition of ``being similar but not identical” can be found. Such a requirement is not ideal practice in the clinical scenario.

In order to avoid the requirement of at least two similar but not identical image pairs, the sampling operations is introduced into Neighbor2neighbor method. This method uses the sampling to construct two similar but not identical image pairs and trains the denoising network in the manner of Noise2Noise. The process can be defined as:

(3)$$\mathop {\arg \min {{\rm E}_{x,y}}}\limits_\vartheta {||{{f_\vartheta }({g_1}(y)) - {g_2}(y)} ||^2}$$

where ${g_1}$ and ${g_2}$ are a pair of nearest neighbor samplers. If the above training method is applied directly, it may result in image over-smoothing because of the differences between the sampling positions of similar noisy images. To address this issue, Eq. (3) is corrected by adding a regular term:

(4)$$\begin{array}{l} {\min _\vartheta }{{\rm E}_{y|x}}||{{f_\vartheta }({g_1}(y)) - {g_2}(y)} ||_2^2,\textrm{s}\textrm{.t}.\\ {{\rm E}_{y|x}}\{ {f_\vartheta }({g_1}(y)) - {g_2}(y) - {g_1}({f_\vartheta }(y)) + {g_2}({f_\vartheta }(y))\} = 0 \end{array}$$

2.2 PNLM

For the NLM method, some pixels in OCT images are often severely corrupted so that the similarity between patches cannot be computed accurately. To address this issue, the PNLM has been proposed by reducing the weights of pixels that are severely corrupted in image patches. Based on the observation that the corrupted pixels have certain similarity to impulse noise, the PNLM method will implement noise suppression by identifying impulse noise pixels using the rank-ordered absolute differences (ROAD) statistic [42]. The core idea of the ROAD is to judge whether a pixel is corrupted by measuring the differences between the pixel and its neighbors. Because impulse noise pixels generally will have much higher intensity values than the uncorrupted pixels, the probability value p computed between the ROAD and the uncorrupted pixel will be used to calculate the Euclidean distance $L_{i,j}^2(B(i,r),B(j,r))$ between different patches as:

(5)$$L_{i,j}^2(B(i,r),B(j,r)) = \frac{1}{{{{(2r + 1)}^2}}}\sum\limits_{k \in B(0,r)} {{{({p_{i + k,j + k}}({y_{i + k}} - {y_{j + k}}))}^2}} $$

(6)$${p_{i + k,j + k}} = \min ({p_{i + k}},{p_{j + k}})$$

where $B(i,r)$ denotes the pixels in a search window of the radius r centered at i. The probability values of the uncorrupted pixels not only can be used to determine the similarity weight ${w_{i,j}}$ accurately, but also can be used to estimate the intensity value ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} _i}$ as:

(7)$${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} _i} = \frac{{\sum\limits_{j \in B(i,f)} {{w_{i,j}}{p_j}{u_j}} }}{{\sum\limits_{j \in B(i,f)} {{w_{i,j}}{p_j}} }}$$

(8)$${w_{i,j}} = \left\{ {\begin{array}{cc} {1,}&{L_{i,j}^2 < \alpha \sigma_n^2}\\ {0,}&{\textrm{otherwise}} \end{array}} \right.$$

where $\sigma _n^2$ denotes the background noise variance, and $\alpha $ is utilized to control the smoothing degree.

Among the above two denoising schemes, the Neighbor2neighbor can preserve some image details well, but it will produce the low-contrast denoised image with some residual noise. By contrast, the PNLM can deliver sufficient noise suppression by minimizing the effect of corrupted pixels on the denoised result, but it may result in image over-smoothing. Furthermore, the robustness of two methods is not ideal for the datasets with large differences in noise distribution.

3. Method

3.1 Overview

In this work, we have designed an unsupervised OCT image despeckling algorithm as shown in Fig. 1. Here, the two pairs of noisy images and denoised images are sub-sampled from the original noisy image and its despeckled result by a neighbor sub-sampler. The sub-sampled image pairs are utilized to compute the Neighbor2Neighbor loss (L_N2N). The PNLM loss (L_PNLM) is calculated on the structural similarity (SSIM) between the despeckled images by the presented method and the PNLM and the mean squared error (MSE) between the features of the two denoised images extracted by the VGG network. Fig. 2 presents the architecture of our denoising network including the cross-scale CNN and the transformer. In the proposed method, the stem module composed of three 3×3 convolutions is used to produce the preliminary feature maps ($F_{fea}^1$). Then three repeated feature extraction modules combined with the multi-scale convolution module and the inter-patch and intra-patch based transformer (Transformer_IP2) are utilized for feature detection and filtering. Different from the encoder-decoder structure, the multi-scale convolution module is designed as a cascaded network with the unchanged resolution to facilitate maintaining image details and removing noise better. For the Transformer_IP2, we have proposed to reduce the patch size through convolutional down-sampling in the interaction of the global and local features, thereby reducing the computational complexity of the transformer. In order to maintain the integrity of the information, the resolution of the feature maps will not be changed during the entire process. In addition, we will use a residual structure for feature extraction to maintain the features of the noisy images. Finally, the reconstruction network utilizes three convolutional layers with 3×3 kernels which can extract the fine structures better to restore the final denoised images.

Fig. 1. The proposed OCT image despeckling method.

Download Full Size | PDF

Fig. 2. The architecture of the CNN and Transformer_IP2 based despeckling network.

Download Full Size | PDF

3.2 CNN and Transformer_IP2 based denoising network

(1) Cross-scale convolution
There exist details of different scales in each input image. If a single convolution kernel is used, the extracted features may not be accurate enough. Therefore, we choose to use a convolution module to fully characterize image features at different scales such as the local details and edge structures, thereby ensuring that the useful information can be retained and speckle noise is removed as much as possible. Fig. 3 shows the designed cross-scale convolution module. Here, a 1×1 convolution is used to adjust the number of channels of input feature maps to 64. Then the feature maps are divided into four equal components in the channel dimension and they are used to learn the features of different scales where the size of the multi-scale convolution kernels is 3×3, 5×5, 7×7 and 9×9, respectively. For the features with different receptive fields, they can improve the feature representation performance through information interaction. The correlation between the output features extracted by different sizes of convolution kernels is further explored by means of convolution and skipping connection. The 3×3 convolution is used for extracting small-scale features by filtering the output from the convolution operation using the first set of kernels to realize the unification of receptive fields at different scales. The obtained feature map and the output of the higher-level convolution are added pixel by pixel for fusion. The above operations are repeated until the fusion with the final output feature is implemented. Finally, the obtained feature maps of different scales are concatenated together in the channel dimension and fused through a 1×1 convolution. In this way, a multi-level fused feature map containing information of different scales is generated.
(2) Transformer_IP2
The Transformer_IP2 is proposed to extract the local and global features to produce the integrated visual priors. The purpose of using the self-attention module is to conduct information mining and filtering of features generated by the feature extraction module to achieve superior detail preservation and noise removal performance.
The proposed transformer_IP2 is shown in Fig. 4. Here, the input feature map is divided into multiple patches which are denoted by red boxes. The patch size varies with the number of patches and the pixels in each patch can also be regarded as a small patch. To reduce computational complexity, the 3×3 convolution with a stride 2 is implemented on each patch to achieve patch encoding. Because not all the vectors formed by the pixels in each patch are critical to the denoising task, they will be filtered through the operation of patch encoding. In this way, the number of vectors in each patch will be reduced and the computational complexity will decrease to 1/256 that of self-attention before convolution downsampling. Then, the self-attention operation is performed on the different vectors for inter-patch and intra-patch, respectively, to realize the correlation mining of their own features.
The transformer module marked by the dashed box on the right side of Fig. 4 includes two components. For the first component, the input feature maps ${F_{fea}}$ are feed into the multi-head self-attention network with the encoder-decoder structure (E-D-MSA). The self-attention is computed as:
$(9)$$Attention(Q, K, V) = softmax(\frac{{Q{K^T}}}{{\sqrt d }})V$$$ where Q, K and V denote the query, key and value matrices, respectively; d means the number of channels for ${F_{fea}}$ and it will act as the scaling factor for avoiding too large matrix values.
In the decoder, we use two deconvolutions with a stride of 2 for each patch to achieve patch decoding. The pixels in the corresponding patch are formed as a vector to implement the self-attention operation. Following this step, each patch is also formed as a whole vector, and the self-attention operation between the patches is performed to characterize the global correlation between different patches. To maintain the integrity of information in the feature maps, a skipping connection operation is introduced to merge the output feature maps with the input ones. The output feature map is normalized by a LayerNorm (LN) layer. The features ${F_{att}}$ processed by the E-D-MSA are added to ${F_{fea}}$ through a residual connection.
For the second component, the output features of the first component are processed through a feedforward network (FFN), and then added to themselves to form a residual connection. The final output feature map ${Z_{fea}}$ is normalized through a LN layer again.
$(10)$$\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over F} = {\textrm{LN}} ({F_{att}} + {F_{fea}})$$$ $(11)$${Z_{fea}} = {\textrm{LN}} ({\textrm{FFN}} (\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over F} ) + \mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over F} )$$$

Fig. 3. The cross-scale convolution module.

Download Full Size | PDF

Fig. 4. The architecture of the Transformer_IP2.

Download Full Size | PDF

3.3 Loss function

We have incorporated two loss functions as the total loss. The first one is the loss produced from Nerighbor2Neighbor. The second one combines the SSIM with the MSE between the features which are extracted by the VGG network [43].

(1) Neighbor2Neighbor loss
The Neighbor2Neighor only requires one noisy image for constructing the loss, and it avoids the disadvantage of Noise2Noise that multiple sampling of the same scene is required. The steps to calculate the Neighbor2Neighbor loss are as follows. Firstly, ${g_1}(y)$ and ${g_2}(y)$ are obtained by means of neighbor sampling of the original speckled image twice. Then, the MSE between ${f_\vartheta }({g_1}(y))$ denoised by the proposed method and ${g_2}(y)$ is calculated. The regularization term loss is computed by the MSE between ${f_\vartheta }({g_1}(y)) - {g_2}(y)$ and ${g_1}({f_\vartheta }(y)) - {g_2}({f_\vartheta }(y))$. The Neighbor2Neighbor loss ${L_{N2N}}$ can be expressed as:
$(12)$${L_{N2N}} = ||{{f_\vartheta }({g_1}(y)) - {g_2}(y)} ||_2^2 + \gamma ||{{f_\vartheta }({g_1}(y)) - {g_2}(y) - ({g_1}({f_\vartheta }(y)) - {g_2}({f_\vartheta }(y)))} ||_2^2$$$ where the parameter $\gamma $ is utilized to achieve a trade-off between the reconstruction loss and the regularization term, and it is set to 1 according to [45].
(2) PNLM loss
The introduction of the PNLM loss is mainly to balance structural edge preservation and image smoothness. Here, the VGGNet is used for extracting the features of the denoised images by our method and the PNLM and the MSE between the extracted feature maps is calculated. Moreover, the SSIM is introduced to constrain contrast, brightness and structure of the despeckled image. The PNLM loss is given by:
$(13)$${L_{PNLM}} = 0.2(||{{\textrm{VGG}} ({f_\vartheta }(y)) - {\textrm{VGG}} ({f_{PNLM}}(y))} ||_2^2) + 2{\textrm{SSIM}} ({f_\vartheta }(y),{f_{PNLM}}(y))$$$ where ${f_{PNLM}}$ means the image denoised by the PNLM method.
(3) Total unsupervised loss
To preserve the structural details and ensure the denoising performance, the unsupervised loss of the proposed network is designed to combine ${L_{N2N}}$ and ${L_{PNLM}}$. The total unsupervised loss ${L_{total}}$ is given by:
$(14)$${L_{total}} = \lambda {L_{PNLM}} + (1 - \lambda ){L_{N2N}}. $$$ where $\lambda $ is used to balance ${L_{PNLM}}$ and ${L_{N2N}}$.

4. Experiments

4.1 Dataset

The data used in this paper are two OCT (Spectralis OCT, Heidelberg Engineering, Germany) retina datasets. The first dataset is provided by Shanghai First People's Hospital, Beijing Tongren Eye Center, Shiley Eye Institute of University of California San Diego, California Retinal Research Foundation, Medical Center Ophthalmology Associates. All images in this dataset are chosen from retrospective cohorts of adult patients [44]. This dataset includes diabetic macular edema (DME) with retinal-thickening-associated intraretinal fluid, choroidal neovascularization (CNV) with neovascular membrane and associated subretinal fluid, multiple drusen present in early age-related macular degeneration (AMD) and normal retina. We have randomly selected 5000 images as the training and test sets for our proposed method. The second dataset includes the OCT images with age-related macular degeneration which have been registered at ClinicalTrials.gov (Identifier: NCT00734487) and approved by the institutional review boards of the 4A2A SD-OCT clinics (Devers Eye Institute, Duke Eye Center, Emory Eye Center, and National Eye Institute) [45]. With adherence to the tenets of the Declaration of Helsinki, informed consent has been obtained from all subjects. Similarly, 5000 OCT images have been randomly selected from this dataset as the second training and test sets. For the two OCT image datasets, all randomly selected images will be cropped to 256×256 to ensure computational efficiency of our method and the number of the training set, the validation set and the test set is 4500, 480, and 20, respectively.

4.2 Evaluation metrics

To appreciate the restoration performance of all evaluated methods quantitatively, we have adopted the four metrics including contrast-to-noise ratio (CNR) [46], signal-to-noise ratio (SNR) [46], equivalent number of looks (ENL) [46] and image sharpness index $\beta $ [46]. Here, we will compute the average CNR on the six regions of interest (ROIs) including three homogeneous ROIs and three non-homogeneous ROIs. The SNR and ENL are computed on the background area and $\beta $ is computed on the entire image.

(1) CNR
The CNR is used for appreciating the speckle suppression performance of a despeckling method based on the contrast between multiple ROIs and the background area in the denoised image. This metric is calculated as:
$(15)$$\textrm{CN}{\textrm{R}_m} = 10\log \frac{{{\mu _m} - {\mu _B}}}{{\sqrt {\sigma _m^2 + \sigma _B^2} }}$$$ $(16)$$\textrm{CNR} = \frac{1}{M}\sum\limits_{m = 1}^M {\textrm{CN}{\textrm{R}_m}} $$$ where ${\mu _B}$ and ${\mu _m}$ mean the average pixel intensities of the background region and the m-th ROI respectively; ${\sigma _B}$ and ${\sigma _m}$ mean the standard deviations of the background region and the m-th ROI, respectively; M denotes the number of ROIs involved in the calculation.

(2) SNR
The SNR is used for evaluating the global denoising performance and it is given by:
$(17)$$\textrm{SNR} = 20\log \frac{{{I_{\max }}}}{{{\sigma _B}}}$$$ where ${I_{\max }}$ represents the maximum pixel intensity in the denoised image.

(3) ENL
The ENL is used for evaluating the denoising performance by measuring the smoothness of areas which appear to be homogeneous. A larger ENL value means better denoising performance in the corresponding area. The ENL is given by:
$(18)$${\textrm{ENL}} = \frac{{\mu _B^2}}{{\sigma _B^2}}$$$

(4) $\beta $
$\beta $ is used to measure how much the edge sharpness has degraded as a result of the denosing process. The higher $\beta $ means that the edge sharpness has degraded less. It is calculated as:
$(19)$$\beta = \frac{{\Gamma (\Delta y - \overline {\Delta y} ,\Delta I - \overline {\Delta I} )}}{{\sqrt {\Gamma (\Delta y - \overline {\Delta y} ,\Delta y - \overline {\Delta y} ) \cdot \Gamma (\Delta I - \overline {\Delta I} ,\Delta I - \overline {\Delta I} )} }}$$$ $(20)$$\Gamma ({I_1},{I_2}) = \sum {{I_1}} \cdot {I_2}$$$ where I denotes the despeckled OCT image; $\Gamma $ means that the corresponding pixels of the two images are multiplied and summed. $\Delta I$ and $\Delta y$ are the processed results of $\Delta I$ and $\Delta y$ by a 3×3 filter which approximates the Laplacian operator; $\overline {\Delta I} $ and $\overline {\Delta y} $ denote the mean of $\Delta I$ and $\Delta y$, respectively.

4.3 Compared methods

The compared despeckling algorithms include the traditional algorithms such as BM3D [20], PNLM [18], nonlinear complex diffusion filter (NCDF) [6] and OBNLM [19] as well as the DL based methods such as DnCNN [29], CNN-NLM [30] and Neighbor2Neighbor [40]. Here, we have not chosen the existing unsupervised despeckling algorithms for comparison as they require a certain number of adjacent images for network training or post-processing. However, our method will focus on the DL based OCT image despeckling which will not rely on the adjacent images at all. The compared traditional algorithms are realized using Matlab on a computer (CPU: Inter I5-10400, 2.9 GHz, 16GB of RAM). The compared DL based denoising algorithms and our method are realized using Pytorch on the Ubuntu system, where the NVIDIA GeForce RTX 2080Ti GPU with 11GB memory is used for acceleration. We have chosen Adam as the network optimizer, which absorbs the advantages of the adaptive learning rate gradient descent algorithm (Adagrad) and the momentum gradient descent algorithm. It can not only adapt to sparse gradients, but also alleviate the problem of gradient oscillations. We will fix the initial learning rate to 0.0001, and fix the batch size and the total number of training epochs to 16 and 30, respectively. For the fair comparison, all compared DL based algorithms use the same loss to that we have proposed in this paper.

4.4 Ablation study

(1) Effect of different $\lambda $
Here, we will make an analysis of the influence of $\lambda $ in the loss function on the despeckling performance of our method. The range of $\lambda $ is 0 to 1 with a step of 0.1. When $\lambda $ is 0, it means that Neighbor2Neighbor is utilized as the loss function. When $\lambda $ is 1, it means that the image denoised by the PNLM algorithm is used as a pseudo label to calculate the loss. Fig. 5 shows the performance of the proposed Transformer_IP2 performed on 20 test images using the various $\lambda $. Obviously, there exists obvious noise in the despeckled image, but the structure is preserved relatively well when $\lambda $ is small. When $\lambda $ is large, the noise can be smoothed out while some image details are lost. The average quantitative results for the different $\lambda $ values are listed in Table 1. Here, the red and green boxes in Fig. 5(a) are selected to calculate SNR and ENL while the green and blue boxes are used to calculate CNR. Clearly, with the increase of $\lambda $, CNR and $\beta $ will generally increase and decrease with certain fluctuations, respectively. The SNR and ENL will first increase and then decrease. The comprehensive consideration of the above four metrics indicates that the relatively satisfactory restoration results can be obtained when $\lambda $ is in the range of 0.4 to 0.5. Therefore, we will choose $\lambda = 0.4$ for the ablation study.
(2) Effect of different patch size
In order to explore the effect of the patch size on the performance of the proposed method, we have chosen different patch sizes 64×64, 32×32 and 16×16 for comparison. The corresponding average quantitative results are listed in Table 2. It can be seen from Table 2 that when the patch size is set to 32×32, the proposed method can achieve the best CNR, SNR and ENL and the second highest value of $\beta $. Based on the comprehensive consideration, we will set the patch size to 32×32 for the proposed method.
(3) The effectiveness of the proposed network structure
To further demonstrate the effectiveness of our network structure, we will compare two models including our method and its variant named Transformer_NO. Here, the Transformer_NO model is produced by removing the Transformer_IP2 module from our method. Fig. 6 shows the denoised results of the two models with the same loss function operating on the first dataset. We can see from Fig. 6 that the proposed method can remove noise more effectively and preserve details better the Transformer_NO. The average quantitative results of the proposed method and the Transformer_NO are listed in Table 3. Table 3 shows that the proposed method outperforms the Transformer_NO for all metrics.

Fig. 5. The denoised results of the CNN-Transformer_IP2 method for the different $\beta $ values. (a) The speckled OCT image; (b)-(l) The despeckled images for $\beta $ in the range of [0, 1] with a step of 0.1, respectively.

Download Full Size | PDF

Table 1. The average CNR, SNR, ENL and $\beta $ for the proposed method using different $\lambda $.

View Table | View all tables in this article

Table 2. The average CNR, SNR, ENL and $\beta $ for the proposed method using different patch size.

View Table | View all tables in this article

Fig. 6. The denoised results of our method and the Transformer_NO. (a) The noisy OCT image; (b) Transformer_NO; (c) Our method.

Download Full Size | PDF

Table 3. The average CNR, SNR, ENL and $\beta $ of the proposed method and the Transformer_NO.

View Table | View all tables in this article

4.5 Comparison of denoised results of different algorithms

To verify the superiority of our algorithm, we will make visual and quantitative comparisons among the despeckled results of our algorithm and the BM3D, OBNLM, NCDF, PNLM, DnCNN, CNN-NLM and Neighbor2Neighbor algorithms operating on the two OCT test sets, where we will set $\lambda = 0.45$ and $\lambda = 0.4$ for the proposed method operating on the first OCT test set and the second one, respectively. Fig. 7 presents the denoised results of the chosen OCT image in the first test set for the different algorithms. Obviously, all traditional algorithms cannot suppress noise in the background area of the image effectively. Meanwhile, Figs. 7 (b-d) show that the BM3D, OBNLM and PNLM methods can suppress noise in the object areas relatively well, but they damage image details to different extent. The NCDF not only cannot remove noise effectively in the object areas, but also damages the details seriously as indicated by the green and blue boxes in Fig. 7 (e). For the DL based algorithms, the Neighbor2Neighbor method produces discontinuous edges and maintains much residual noise in the denoised result as indicated by the red and green boxes in Fig. 7 (f). The DnCNN cannot maintain the sharpness of some edges well (see the blue and red boxes in Fig. 7 (g)). The CNN-NLM demonstrates outstanding despeckling performance, but it damages the layer structures in the OCT image very seriously (see the blue and red boxes in Fig. 7 (h)). By contrast, our method can preserve the fine details and layer structures better while suppressing noise more effectively. Fig. 8 further shows the corresponding method noise, which is computed as the difference between the noisy image and each despeckled result in Fig. 7. For the BM3D, OBNLM, PNLM and NCDF algorithms, the noise in their method noise results seem to be greatly different from that in the original noisy image, which means that these algorithms cannot suppress noise effectively. The Neighbor2Neighbor performs similarly to these traditional algorithms in terms of noise removal, and it remains some image details in its method noise. The comparison with the DnCNN and the CNN-NLM demonstrates that our algorithm can maintain higher similarity between the noise in its method noise and that in the noisy image.

Fig. 7. The first test OCT image chosen from the first dataset and its denoised results for the various algorithms. (a) The noisy image, (b) BM3D, (c) OBNLM, (d) PNLM, (e) NCDF, (f) Neighbor2Neighbor, (g) DnCNN, (h) CNN-NLM, (i) Our method.

Download Full Size | PDF

Fig. 8. The method noise computed between the noisy image in Fig. 7(a) and the denoised images by (a) BM3D, (b) OBNLM, (c) PNLM, (d) NCDF, (e) Neighbor2Neighbor, (f) DnCNN, (g) CNN-NLM, (h) Our method.

Download Full Size | PDF

Fig. 9 presents the denoised results of the different despeckling algorithms on another OCT image from the first test set. Likewise, our algorithm outperforms other evaluated algorithms in that it can enhance the edge sharpness and deliver noise reduction better as indicated by the green, blue and red boxes.

Fig. 9. The second test OCT image chosen from the first dataset and its denoised results for the various methods. (a) The noisy image, (b) BM3D, (c) OBNLM, (d) PNLM, (e) NCDF, (f) Neighbor2Neighbor, (g) DnCNN, (h) CNN-NLM, (i) Our method.

Download Full Size | PDF

The average quantitative results of all evaluated methods on the first test set are listed in Table 4. Obviously, our method can provide the best values for such evaluation metrics as SNR, ENL and $\beta $. The higher SNR and ENL values mean that our method can generate higher-quality despeckled image and can remove background noise more effectively, respectively. Although our algorithm performs slightly poorer than several other algorithms in terms of CNR, it outperforms all other algorithms in detail preservation due to its higher $\beta $ value. The quantitative comparison further proves that our algorithm has superior performance in background noise suppression and edge preservation.

Table 4. The average SNR, ENL, CNR, $\beta $ and running time for the different methods performed on the first dataset.

View Table | View all tables in this article

Moreover, the average running time of different denoising methods operating on 20 test images is also listed in Table 4. Clearly, our method has significantly higher implementation efficiency than the traditional algorithms. Although the proposed method has slightly more running time than other DL based despeckling methods, it only takes 6.5 ms to despeckle one OCT image, which benefits from the end-to-end network architecture and the utilization of GPU and multithread.

Besides, Fig. 10 visually shows the denoised results of the different despeckling algorithms performed on the first test OCT image chosen from the second dataset. Fig. 11 shows the method noise for Fig. 10. Fig. 12 further presents the despeckled results for the second test OCT image from this dataset.

Fig. 10. The first test OCT image chosen from the second dataset and the corresponding denoised images. (a) The noisy image, (b) BM3D, (c) OBNLM, (d) PNLM, (e) NCDF, (f) Neighbor2Neighbor, (g) DnCNN, (h) CNN-NLM, (i) Our method.

Download Full Size | PDF

Fig. 11. The method noise computed between the noisy image in Fig. 10(a) and the denoised images by (a) BM3D, (b) OBNLM, (c) PNLM, (d) NCDF, (e) Neighbor2Neighbor, (f) DnCNN, (g) CNN-NLM, (h) Our method.

Download Full Size | PDF

Fig. 12. The second test OCT image chosen from the second dataset and the corresponding denoised images. (a) The noisy image, (b) BM3D, (c) OBNLM, (d) PNLM, (e) NCDF, (f) Neighbor2Neighbor, (g) DnCNN, (h) CNN-NLM, (i) Our method.

Download Full Size | PDF

We can see from Fig. 10 and Fig. 12 that the denoised result of our method has clearer layer structure than those of other compared methods as shown by the red, blue and green boxes. The observation from Fig. 11 shows that as regards the noise in the method noise results, the proposed method provides the most similar noise to that in the original image than other methods. Meanwhile, the proposed method maintains fewer image details in the method noise result than the most competitive DnCNN, which indeed demonstrates that our method has superior image detail preservation performance than the DnCNN.

Table 5 further lists the average quantitative metric results of different algorithms for the second dataset. Likewise, the SNR, ENL and $\beta $ of our algorithm are superior to those of the other evaluated algorithms.

Table 5. The average SNR, ENL, CNR and $\beta $ for the different methods performed on the second dataset.

View Table | View all tables in this article

5. Conclusion

This work has proposed a novel unsupervised OCT image denoising algorithm which combines the cross-scale CNN and the transformer. The cross-scale convolution uses the convolution kernels of different sizes to extract the low-level local feature information at different scales.

The proposed encoder-decoder based Transformer-IP2 can explore the inter-patch and intra-patch correlation of features extracted by the multi-scale CNN to capture the global and local features effectively. The proposed method is trained using an unsupervised loss function including the detail preservation-oriented term derived from Neighbor2Neighbor and the image-smoothing oriented term derived from the despeckled result of the PNLM algorithm. The experimental results on the two OCT datasets show that our algorithm performs better than several popular denoising algorithms in that it not only removes noise and maintains the layer structure more effectively but also provides higher quantitative metrics such as SNR, ENL and image sharpness index. Indeed, the presented method has the great potential to be applied to the real OCT imaging systems.

In the future work, we will improve the efficiency and restoration performance of the proposed method. The efficiency improvement can be made by transferring the code to C++, which will dramatically shorten the time to facilitate the application of the proposed method to the real-time OCT image despeckling. Meanwhile, the higher performance GPU and multi-GPU can also be used to further accelerate the proposed algorithm. For the despeckling performance improvement, we can further explore the multiscale feature representations of query-key pairs, and optimize the patch division scheme to improve the accuracy of the global information representation, thereby delivering better restoration performance in terms of noise reduction and detail preservation.

Funding

National Natural Science Foundation of China (61871440).

Acknowledgments

We will thank the medical ultrasound lab at Huazhong University of Science and Technology for providing GPU computation platform.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. M. Choma, M. Sarunic, C. Yang, and J. Izatt, “Sensitivity advantage of swept source and Fourier domain optical coherence tomography,” Opt. Express 11(18), 2183–2189 (2003). [CrossRef]

2. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, C. A. Puliafito, and J. G. Fujimoto, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

3. J. M. Schmitt, S. H. Xiang, and K. M. Yung, “Speckle in optical coherence tomography,” J. Biomed. Opt. 4(1), 95–105 (1999). [CrossRef]

4. B. Karamata, K. Hassler, M. Laubscher, and T. Lasser, “Speckle statistics in optical coherence tomography,” J. Opt. Soc. Am. A 22(4), 593–596 (2005). [CrossRef]

5. R. Bernardes, C. Maduro, P. Serranho, A. Araújo, S. Barbeiro, and J. Cunha-Vaz, “Improved adaptive complex diffusion despeckling filter,” Opt. Express 18(23), 24048 (2010). [CrossRef]

6. W. Zhao and H. Lu, “Medical image fusion and denoising with alternating sequential filter and adaptive fractional order total variation,” IEEE Trans. Instrum. Meas. 66(9), 2283–2294 (2017). [CrossRef]

7. A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR, 60–65 (2005).

8. Z. Jian, L. Yu, B. Rao, B. J. Tromberg, and Z. Chen, “Three-dimensional speckle suppression in optical coherence tomography based on the curvelet transform,” Opt. Express 18(2), 1024–1032 (2010). [CrossRef]

9. H. Rabbani, R. Nezafat, and S. Gazor, “Wavelet-domain medical image denoising using bivariate laplacian mixture model,” IEEE Trans. Biomed. Eng. 56(12), 2826–2837 (2009). [CrossRef]

10. H. Rabbani, M. Vafadust, P. Abolmaesumi, and S. Gazor, “Speckle denoising of medical ultrasound images in complex wavelet domain using mixture priors,” IEEE Trans. Biomed. Eng. 55(9), 2152–2160 (2008). [CrossRef]

11. S. Chitchian, M. A. Fiddy, and N. M. Fried, “Denoising during optical coherence tomography of the prostate nerves via wavelet shrinkage using dual-tree complex wavelet transform,” J. Biomed. Opt. 14(1), 014031 (2009). [CrossRef]

12. L. Fang, S. Li, Q. Nie, J. A. Izatt, C. A. Toth, and S. Farsiu, “Sparsity based denoising of spectral domain optical coherence tomography images,” Biomed. Opt. Express 3(5), 927–942 (2012). [CrossRef]

13. M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. on Image Process. 15(12), 3736–3745 (2006). [CrossRef]

14. A. W. Scott, S. Farsiu, L. B. Enyedi, D. K. Wallace, and C. A. Toth, “Imaging the infant retina with a hand-held spectral-domain optical coherence tomography device,” Am. J. Ophthalmol. 147(2), 364–373.e2 (2009). [CrossRef]

15. L. Fang, S. Li, R. P. McNabb, Q. Nie, A. N. Kuo, C. A. Toth, J. A. Izatt, and S. Farsiu, “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation,” IEEE Trans. Med. Imaging 32(11), 2034–2049 (2013). [CrossRef]

16. L. Fang, S. Li, D. Cunefare, and S. Farsiu, “Segmentation based sparse reconstruction of optical coherence tomography images,” IEEE Trans. Med. Imaging 36(2), 407–421 (2017). [CrossRef]

17. M. Esmaeili, A. Dehnavi, H. Rabbani, and F. Hajizadeh, “Speckle denoising in optical coherence tomography using two-dimensional curvelet-based dictionary learning,” J Med Signals Sens 7(2), 86–91 (2017). [CrossRef]

18. H. Yu, J. Gao, and A. Li, “Probability-based non-local means filter for speckle noise suppression in optical coherence tomography images,” Opt. Lett. 41(5), 994–997 (2016). [CrossRef]

19. P. Coupe, P. Yger, S. Prima, P. Hellier, C. Kervrann, and C. Barillot, “An optimized blockwise nonlocal means denoising filter for 3-D magnetic resonance images,” IEEE Trans. Med. Imaging 27(4), 425–441 (2008). [CrossRef]

20. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. on Image Process. 16(8), 2080–2095 (2007). [CrossRef]

21. A. Guo, L. Fang, M. Qi, and S. Li, “unsupervised denoising of optical coherence tomography images with nonlocal-generative adversarial network,” IEEE Trans. Instrum. Meas. 70, 1–12 (2021). [CrossRef]

22. H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2392–2399 (2012).

23. K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a fast and flexible solution for CNN-based image denoising,” IEEE Trans. on Image Process. 27(9), 4608–4622 (2018). [CrossRef]

24. H. Yao, H. Ma, Y. Li, and Q. Feng, “DnResNeXt network for desert seismic data denoising,” IEEE Geosci. Remote Sensing Lett. 19, 1–5 (2022). [CrossRef]

25. B. Qiu, Z. Huang, X. Liu, X. Meng, Y. You, G. Liu, K. Yang, A. Maier, Q. Ren, and Y. Lu, “Noise reduction in optical coherence tomography images using a deep neural network with perceptually-sensitive loss function,” Biomed. Opt. Express 11(2), 817 (2020). [CrossRef]

26. J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, “Generative adversarial networks for denoising in low-dose CT,” IEEE Trans. Med. Imaging 36(12), 2536–2545 (2017). [CrossRef]

27. Y. Zhou, K. Yu, M. Wang, Y. Ma, Y. Peng, Z. Chen, W. Zhu, F. Shi, and X. Chen, “Speckle noise reduction for OCT images based on image style transfer and conditional GAN,” IEEE J. Biomed. Health Inform. 26(1), 139–150 (2022). [CrossRef]

28. Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kaira, Y. Zhang, L. Sun, and G. Wang, “Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE Trans. Med. Imaging 37(6), 1348–1357 (2018). [CrossRef]

29. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. on Image Process. 26(7), 3142–3155 (2017). [CrossRef]

30. D. Cozzolino, L. Verdoliva, G. Scarpa, and G. Poggi, “Nonlocal CNN SAR image despeckling,” Remote Sens. 12(6), 1006 (2020). [CrossRef]

31. F. Shi, N. Cai, Y. Gu, D. Hu, Y. Ma, Y. Chen, and X. Chen, “DeSpecNet: a CNN-based method for speckle reduction in retinal optical coherence tomography images,” Phys. Med. Biol. 64(17), 175010 (2019). [CrossRef]

32. S. K. Devalla, G. Subramanian, T. H. Pham, X. Wang, S. Perera, T. Tun, T. Aung, L. Schmetterer, and A. H. Thiery, “A deep learning approach to denoise optical coherence tomography images of the optic nerve head,” Sci. Rep. 9(1), 14454 (2019). [CrossRef]

33. N. A. Kande, R. Dakhane, A. Dukkipati, and P. K. Yalavarthy, “SiameseGAN: A generative model for denoising of spectral domain optical coherence tomography images,” IEEE Trans. Med. Imaging 40(1), 180–192 (2021). [CrossRef]

34. Y. Huang, N. Zhang, and Q. Hao, “Real-time noise reduction based on ground truth free deep learning for optical coherence tomography,” Biomed. Opt. Express 12(4), 2027–2040 (2021). [CrossRef]

35. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16×16 words: transformers for image recognition at scale,” 2021, Available: https://arxiv.org/abs/2010.11929.

36. H. Chen, Y. Wang, T. Guo, C. Xu, Y. Peng, Z. Liu, S. Ma, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR, 12294–12305 (2021).

37. W. Li, X. Lu, J. Lu, X. Zhang, and J. Jia, “On efficient transformer and image pre-training for low-level vision,” 2021, Available: https://arxiv.org/abs/2112.10175.

38. Z. Wang, Y. Xie, and S. Ji, “Global voxel transformer networks for augmented microscopy,” Nat Mach Intell 3(2), 161–171 (2021). [CrossRef]

39. J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using swin transformer,” in Proceedings of IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 1833–1844 (2021).

40. T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2Neighbor: Self-supervised denoising from single noisy images,” in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR, 14776–14785 (2021).

41. J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning image restoration without clean data,” in Proceedings of 35th Int. Conf. Mach. Learn. (ICML), 4620–4631 (2018).

42. R. Garnett, T. Huegerich, C. Chui, and W. He, “A universal noise removal algorithm with an impulse detector,” IEEE Trans. on Image Process. 14(11), 1747–1754 (2005). [CrossRef]

43. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, Available: http://arxiv.org/abs/1409.1556.

44. P. Mooney, “Retinal OCT Images (optical coherence tomography),” OCT2017 (2018), https://www.kaggle.com/paultimothymooney/kermany2018.

45. S. Farsiu, S. J. Chiu, R. V. O’Connell, F. A. Folgar, E. Yuan, J. A. Izatt, and C. A. Toth, “Quantitative classification of eyes with and without intermediate age-related macular degeneration using optical coherence tomography,” Ophthalmology 121(1), 162–172 (2014). [CrossRef]

46. A. Pizurica, L. Jovanov, B. Huysmans, V. Zlokolica, P. De Keyser, F. Dhaenens, and W. Philips, “Multiresolution denoising for optical coherence tomography: A review and evaluation,” Curr. Med. Imaging Rev. 4(4), 270–284 (2008). [CrossRef]

$λ$	CNR	SNR	ENL	$β$
0.0	1.262	28.279	3.524	0.644
0.1	4.776	41.439	130.329	0.772
0.2	4.592	39.540	79.612	0.779
0.3	5.603	40.927	164.566	0.732
0.4	8.532	48.984	1045.717	0.725
0.5	7.564	43.398	273.866	0.732
0.6	7.145	62.263	197.705	0.704
0.7	9.096	39.681	95.280	0.698
0.8	8.580	40.274	68.765	0.679
0.9	9.783	40.065	143.546	0.681
1.0	10.206	40.459	224.373	0.675

Patch size	CNR	SNR	ENL	$β$
64×64	8.450	40.134	589.075	0.701
32×32	8.532	48.984	1045.717	0.725
16×16	8.117	42.216	736.598	0.727

Methods	CNR	SNR	ENL	$β$
Transformer-NO	6.714	39.110	585.684	0.724
Our method	8.532	48.984	1045.717	0.725

Methods	CNR	SNR	ENL	$β$	Time(ms)
BM3D	8.260	37.360	112.196	0.639	560
OBNLM	11.101	41.020	260.594	0.685	2698
PNLM	11.016	40.979	259.861	0.687	784
NCDF	5.670	34.786	60.453	0.706	76
Neighbor2Neighbor	1.182	25.167	3.753	0.584	5.75
DnCNN	9.889	41.308	55.918	0.722	3.02
CNN-NLM	9.942	35.761	45.310	0.611	2.33
Proposed	9.121	49.243	987.715	0.737	6.43

Method	CNR	SNR	ENL	$β$	Time(ms)
BM3D	4.346	35.932	498.979	0.721	532
OBNLM	7.866	37.701	666.197	0.591	2712
PNLM	9.025	45.553	4228.853	0.509	766
NCDF	3.550	31.054	118.065	0.653	73
Neighbor2Neighbor	1.332	110.134	9820.196	0.572	5.38
DnCNN	10.920	43.764	29.626	0.433	3.21
CNN-NLM	6.786	28.269	10.929	0.739	2.65
Proposed	7.937	154.619	13160.263	0.741	6.14

Unsupervised despeckling of optical coherence tomography images by combining cross-scale CNN with an intra-patch and inter-patch based transformer

Abstract

1. Introduction

2. Related works

2.1 Neighbor2Neighbor

2.2 PNLM

3. Method

3.1 Overview

3.2 CNN and Transformer_IP2 based denoising network

3.3 Loss function

4. Experiments

4.1 Dataset

4.2 Evaluation metrics

4.3 Compared methods

4.4 Ablation study

4.5 Comparison of denoised results of different algorithms

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (5)

Equations (20)

Optics Express