Semi-supervised OCT lesion segmentation via transformation-consistent with uncertainty and self-deep supervision

Hailan Shen; Qiao Yang; Zailiang Chen; Ziyu Ye; Peishan Dai; Xuanchu Duan

doi:10.1364/BOE.492680

1. Introduction

Optical coherence tomography (OCT) is a rapidly advancing ocular imaging technique that provides non-invasive and high-resolution imaging of the retina. OCT imaging has become an essential tool for diagnosing and managing retinal diseases, including choroidal neovascularization (CNV), central serous chorioretinopathy (CSC), diabetic macular edema (DME), and drusen. Accurate segmentation of lesions in OCT images is critical for evaluating disease progression and treatment outcomes.

Most existing studies have focused on OCT lesion segmentation [1–9]. Traditional OCT lesion segmentation methods [1–3] require manual feature extraction and demonstrate poor generalization performance. In recent years, deep learning-based approaches [4–9] have shown promising results in OCT lesion segmentation. For example, Hu et al. [4] developed a deep neural network with modified atrous spatial pyramid pooling (ASPP) to automatically segment subretinal fluid (SRF) and pigment epithelial detachment (PED) lesions in OCT images. Similarly, Hassan et al. [5] proposed the RFS-Net, a network that integrates ASPP, residual, and inception modules to segment multi-class retinal fluid (MRF), including intraretinal fluid (IRF), SRF, and PED. Liu et al. [6] introduced an improved U-Net segmentation method that automatically locates the fluid region by introducing an attention mechanism and utilizes dense skip connections to combine high-level and low-level features of the image. Xing et al. [7] proposed a network architecture based on FCN to simultaneously segment three types of pathological fluid lesions in OCT images. Wang et al. [8] proposed a novel multi-scale wavelet-enhanced transformation network for segmenting small lesions in OCT images. Parra et al. [9] proposed LOCTSeg for end-to-end automated segmentation of diagnostic markers in OCT B-scans.

Despite recent advances in OCT lesion segmentation, variances in lesion shape, variations in lesion shape, blurred boundaries, and interference from scattered plate noise cause it to remain challenging. Additionally, fully supervised deep learning methods typically rely on a large number of pixel-wise annotations, which are time-consuming and labor-intensive to obtain. In contrast, semi-supervised segmentation-based methods offer a promising alternative by effectively reducing the requirement for pixel-wise annotations while maintaining good performance.

In recent years, numerous studies have attempted to reduce the cost of pixel-wise annotations required for medical image segmentation tasks through semi-supervised learning. These approaches can be broadly categorized into several categories, such as extensions of the Mean-Teacher framework [10], including pixel-wise consistency [11], uncertainty assessment [12,13], and transformation-consistent [14] methods. Co-training [15–17] has also yielded good results in many semi-supervised medical image segmentation tasks. Deep adversarial training [18,19] is another approach that utilizes unlabeled data by employing discriminators to adjust the distribution of labeled and unlabeled data. Additionally, cross-task consistency methods encourage different tasks to achieve similar representations in predefined spaces, such as image reconstruction [20], segmentation and size regression [21], and level set regression [22]. Despite the success of semi-supervised learning in reducing the cost of pixel-wise annotations in medical image segmentation tasks, there are currently only a limited number of semi-supervised methods available for OCT lesion segmentation [23]. As a result, the accuracy of OCT lesion semi-supervised segmentation needs further improvement, especially for lesions with faint regions and various shapes and sizes.

To this end, we propose a novel semi-supervised framework for OCT lesion segmentation, termed Transformation-Consistent with Uncertainty and Self-deep Supervision (TCUS), to tackle the problems above. Firstly, TCUS introduces a transformation-consistent to enhance the regularization effect of the semi-supervised framework. It aims to ensure that the student network learns the transformation-invariant representation as the teacher network. Secondly, an uncertainty-guided transformation-consistent strategy is designed to address the issue of hazy lesion borders. The strategy utilizes uncertainty information from the teacher network to gradually guide the student network to learn from meaningful and reliable targets. Thirdly, self-deep supervision is employed by adding a segmentation head at different resolutions of the student network decoder to generate a set of segmentation results at different scales. It allows the framework to learn about lesions of various sizes and shapes by utilizing multi-scale information and improving the segmentation performance. We evaluate TCUS on the Kermany dataset [24] and private OCT dataset. The results show that TCUS improves significantly compared with the supervised baseline and outperforms existing semi-supervised segmentation methods. This paper’s main contributions are as follows:

1) The proposed TCUS framework for semi-supervised lesion segmentation in OCT images can be applied to various diseases, effectively reducing the reliance on pixel-wise annotations. The experiment results on two OCT datasets show that TCUS is more effective than other semi-supervised segmentation methods.
2) A uncertainty-guided transformation-consistent strategy is proposed for mitigating performance degradation caused by potential errors in the prediction results of the teacher network.
3) Self-deep supervision is designed to capture multi-scale information in OCT images, enabling the framework to learn about lesions of various sizes and shapes and generate more accurate segmentation predictions.

2. Method

In this section, we present our proposed TCUS framework for OCT lesion segmentation, as illustrated in Fig. 1. The framework consists of a student network and a teacher network. The student network is trained with loss functions, while the teacher network weights are updated through exponential moving average (EMA) of the student network weights. The labeled data is fed into the student network and the unlabeled data is fed into both the student and teacher network. To enhance the effect of unsupervised regularization, we employ transformation-consistent. To mitigate the impact of unreliable predictions on the student network optimization and improve the reliability of the teacher network’s results, we estimate the uncertainty of the teacher network predictions. Moreover, we design self-deep supervision to capture the multi-scale information of the OCT images. In self-deep supervision, robust pseudo-labels are generated by the reliable teacher network, which combines all models from previous iterations, to improve the accuracy of the segmentation predictions.

Fig. 1. Overview of the proposed semi-supervised framework TCUS. $x_i$ represents the original image, and $y_i$ represents the pixel-wise annotation of the labeled image. $\psi _i$ refers to the transformation-consistent regularization, including rotation, flipping, and scaling operations. TCUS leverages the mean-teacher framework, where the student network is trained with a total loss that is a weighted combination of $\mathcal {L}_{sdp}$ (loss on labeled data) and $\mathcal {L}_{rdp}$ (loss on unlabeled data).

Download Full Size | PDF

2.1 Transformation-consistent semi-supervised framework

In the semi-supervised segmentation task, we consider a framework with a training set of $M + N$ input OCT images containing $M$ labeled OCT images and $N$ unlabeled OCT images. The labeled and unlabeled sets are defined as $\mathcal {D}_{L}=\left \{ x_i,y_i\right \}_{i=1}^{M}$, and $\mathcal {D}_{U}=\left \{ x_i\right \}_{i=1}^{N}$, respectively, where $x_i\in \mathbb {R}^{H \times W \times 3}$ represents the input OCT image and $y_i\in \left \{ 0,1\right \}^{H \times W}$ represents the corresponding pixel-wise annotation. Our objective is to minimize the following loss function:

(1)$$\sum_{i=1}^{M}\mathcal{L}_s\left ( f\left ( x_i;\theta \right ),y_i \right ) + \omega \sum_{i=1}^{N}\mathcal{L}_r\left ( f\left ( x_i;\theta,\pi \right ), f\left ( x_i;{\theta}',\pi' \right ) \right )$$

where $\mathcal {L}_s$ represents the supervised loss (e.g., cross-entropy) and $\mathcal {L}_r$ represents the regularization loss, which measures the consistency of the predictions of the student and teacher network for inputs $x_i$ with different perturbations; $f\left ( \cdot \right )$ denotes the segmentation network; $\left ( \theta,\pi \right )$ and $\left ( {\theta }',\pi ' \right )$ represent the student and teacher network weights and the different perturbations applied to the inputs $x_i$ (e.g. dropout); $\omega$ is a weighting factor to balance the supervised loss and the regularization loss. During training, the weights of the teacher network $\theta '$ are updated using the EMA of the student network weights, as $\theta '_t=\alpha \theta '_{t-1}+\left ( 1-\alpha \right )\theta _t$, where $t$ is the current training step and $\alpha$ is a smoothing factor. According to the empirical evidence in [10], setting $\alpha = 0$ makes the model as a variation of the $\Pi$ model. We obtain the best performance when setting $\alpha = 0.999$. Hence, we follow this empirical experience and set $\alpha$ to 0.999 in our experiments.

The transformation-consistent [14] is introduced to enhance the regularization effect and make better use of the unlabeled data. TCUS involves three transformation operations, as shown in Fig. 1. For a given input $x_i$, the student network and teacher network are evaluated twice, resulting in two outputs: $o_i$ and $o'_i$, respectively. In the first evaluation, the transformation operation $\psi _i$ is applied to the prediction results of the student network. In the second evaluation, the transformation operation $\psi _i$ is applied to the input image of the teacher network. Additionally, to fully utilize the labeled data, the same transformation operation $\psi _i$ is applied to $y_i$. The overall loss of the framework is a combination of the supervised and unsupervised losses, defined as $\mathcal {L} = \mathcal {L}_s + \omega \mathcal {L}_r$, where $\mathcal {L}_s$ is the supervised loss, and $\mathcal {L}_r$ is the unsupervised regularization loss. The supervised loss $\mathcal {L}_s$ is computed using the labeled data and is typically a cross-entropy loss:

(2)$$\mathcal{L}_s=\mathcal{L}_{ce}\left ( o_i,\psi_i\left ( y_i \right ) \right )$$

where $o_i=\psi _i\left ( f_\theta \left ( x_i \right ) \right )$ is the output of the student network. The unsupervised regularization loss $\mathcal {L}_r$ aims to promote consistent predictions under different transformations. It is computed as the mean squared error between the outputs $o_i$ and $o'_i$:

(3)$$\mathcal{L}_r=\mathcal{L}_{mse}\left ( o_i, o'_i \right )$$

where $o'_i=f_{\theta '}\left ( \psi _i\left ( x_i \right ) \right )$ represents the output of the teacher network, respectively. By incorporating transformation-consistent into the training process, the TCUS framework effectively leverages the unlabeled data to improve its performance and achieve better generalization.

2.2 Uncertainty-guided transformation-consistent

One limitation of the framework discussed in Section 2.1 is that the input $x_i \in \mathcal {D}_U$ to the teacher network is unlabeled, which may result in predictions containing both correct and unreliable pixels, such as lesion boundaries in OCT images. These unreliable predictions can negatively impact the optimization of the student network. To address this issue, we draw inspiration from uncertainty theory [25–27] and estimate the epistemic uncertainty of the teacher network, which allows the framework to guide the student network to learn from the predictions of a more reliable teacher network.

The epistemic uncertainty is estimated by passing the input image through the teacher network $K$ times, with random Gaussian noise added to the input image and dropout applied during evaluation. This process generates different outputs for the same input, which can be used to approximate the uncertainty. The uncertainty $U_i$ for a given input $x_i$ is formulated as follows:

(4)$$U_i\approx \frac{1}{K}\sum_{k=1}^{K}\left ( u'_{i,k} \right )^2-\left ( \frac{1}{K}\sum_{k=1}^{K}u'_{i,k} \right )^2$$

where $u'_{i,k}$ represents the output of the teacher network for the $k$-th perturbed input. Note that the uncertainty is estimated at the pixel level and the resulting uncertainty map $U_i \in \mathbb {R}^{H \times W}$ is normalized to [0, 1] before further operations. In our experiments, we set $K$ to 10, which is further detailed in Section 4.

$\mathcal {L}_r$ is masked mean squared error loss and rewritten as Eq. (5):

(5)$$\mathcal{L}_r=\frac{\sum_{v=1}^{H \times W}\delta \left ( U_{i,v}<th \right )\left ( o_{i,v}-o'_{i,v} \right )^2}{\sum_{v=1}^{H \times W}\delta \left ( U_{i,v}<th \right )}$$

where $\delta$ represents the indicator function; $o_{i,v}$ and $o'_{i,v}$ are the predictions of the student and teacher networks at the $v$-th pixel; $H \times W$ is the number of pixels; $th$ is a threshold that restricts the network to select the most deterministic pixel for optimization. In all experiments, we set $th$ to 0.15, as further discussed in Section 4. The uncertainty-guided transformation-consistent strategy effectively guides the student network to learn from the reliable predictions of the teacher network, which reduces the influence of unreliable predictions and improves the accuracy of the segmentation results.

2.3 Self-deep supervision

Inspired by the role of pyramid networks [28] and deep supervision [29] in fully supervised learning and its application in semi-supervised learning [30], we design self-deep supervision with multi-scale prediction to improve the performance of the framework. The decoder of student network is modified by adding a segmentation head that produces segmentation results at different scales (see Fig. 1), allowing the framework to learn about lesion areas of different sizes and shapes. The labeled data are supervised using pixel-wise annotations. Since there are no annotations available for the unlabeled data and upsampling processing may lead to inaccurate predictions, encouraging consistency between the student and teacher network predictions at each pixel directly may be affected by outliers and lead to a performance drop [12,19,27]. To alleviate this issue, the framework infers robust pseudo-labels by a reliable teacher network, and minimizes the difference between pseudo-labels and segmentation results at different scales using mean squared error loss.

More specifically, for an input image $x_i$, the student network $f_\theta \left ( x_i \right )$ generates a set of segmentation results $o_i^s$, where $o_i^s$ represents the prediction result from the $s$-th scale. Note that smaller $s$ means higher resolution and we use $S$ to represent the total number of scales. $o_i^s$ is upsampled to match the size of the input image and then applied the same transformation operation $\psi _i$, which is denoted as $p_i^s$. For the labeled data, the loss function is defined as Eq. (6):

(6)$$\mathcal{L}_{sdp}=\frac{1}{S}\sum_{s=1}^{S}\mathcal{L}_{CE}\left ( p_i^s,\psi_i\left ( y_i \right ) \right )$$

The unlabeled data utilize the predictions of the teacher network as pseudo-labels to minimize the differences between $p_{i,v}^s$ and $o'_{i,v}$ by mean squared error loss. Directly employing the predictions of the teacher network as pseudo-labels may be problematic because it may contain pixels with unreliable predictions. To solve this problem, we incorporate self-deep supervision with uncertainty estimation:

(7)$$\mathcal{L}_{rdp}=\frac{\sum_{s=1}^{S}\sum_{v=1}^{H \times W}\delta \left ( U_{i,v}<th \right )\left ( p_{i,v}^s-o'_{i,v} \right )^2}{S \times \sum_{v=1}^{H \times W}\delta \left ( U_{i,v}<th \right )}$$

It is worth noting that when $S=1$, TCUS is not able to leverage the multi-scale information of the image. In our experiments, we set $S = 3$, as discussed in more detail in Section 4.

2.4 Loss function

For the OCT lesion segmentation task, the overall loss function of the proposed TCUS is defined as follows:

(8)$$\mathcal{L}_{total}=\mathcal{L}_{sdp} + \omega \left ( t \right )\mathcal{L}_{rdp}$$

where $\omega \left ( t \right )=w_{max}e^{-5\left ( 1-\frac {t}{t_{max}} \right )^{2}}$ is a Gaussian ramp-up curve weighting function that determines the contribution of $\mathcal {L}_{rdp}$ to the total loss, $t$ is the current training step, $t_{max}$ is the maximum training step, and $w_{max}$ controls the maximum value of the weighting function. We set $w_{max}$ as 1.0. Initially, the total loss is mainly influenced by $\mathcal {L}_{sdp}$, and as $t$ increases, the contribution of $\mathcal {L}_{sdp}$ gradually increases in $\mathcal {L}_{total}$.

3. Experiment

3.1 Experimental setup

3.1.1 Dataset

In this paper, our proposed TCUS is evaluated on two OCT datasets: the Kermany [24] dataset and the private OCT dataset.

The Kermany dataset is an OCT disease classification dataset with four OCT image types: CNV, DME, DRUSEN, and normal. We randomly select 600 DME and 600 DRUSEN images, and an experienced ophthalmologist labeled the lesions of each image. The training set contains 1000 images, and the test set consists of 200 images.

The private OCT dataset contains a total of 760 images of 34 diseased eyes from Changsha Aier Eye Hospital. It includes CNV and CSC, consisting of 458 and 302 images, respectively. All images have a resolution of 1020$\times$960. An experienced ophthalmologist annotates the ground truth of each image. We split the dataset into a training set of 650 images and a test set of 110 images for the purpose of training and evaluation.

3.1.2 Implementation details

We use the 2-D DenseUNet [31] as both the student and teacher network. The TCUS framework is implemented in PyTorch and trained on an NVIDIA GTX 1080Ti GPU. We employ the Adam optimizer with an initial learning rate of 1e-4 and a batch size of 12 during training. The Kermany dataset is trained for 300 epochs, while the private OCT dataset is trained for 500 epochs. Moreover, we follow the scaling, cropping, and flipping operations used in [14] for transformation operations $\psi _i$. To balance the segmentation performance and computational cost, all images are resized to 248$\times$248.

3.1.3 Evaluation metrics

For the proposed TCUS framework, dice coefficient (DI), pixel accuracy (ACC), sensitivity (SE), and specificity (SP) are used to measure the segmentation performance. These evaluation metrics are shown in Eq. (9).

(9)$$\begin{aligned}DI&=\frac{2 \times TP}{FP+FN+TP+TN} \\ ACC&=\frac{TP+TN}{FP+FN+TP+TN} \\ SE&=\frac{TP}{TP+FN},SP= \frac{TN}{TN+FP}\end{aligned}$$

where TP, TN, FP, and FN refer to true positives, true negatives, false positives, and false negatives, respectively.

3.2 Performance evaluation and analysis

3.2.1 Experiments on the Kermany dataset

Ablation study. We conduct ablation experiments on the Kermany dataset using 10${\%}$ labeled data to validate the effectiveness of our TCUS framework. Table 1 shows the results, and we observe that: (1) TCUS significantly improves the DI by 3.06${\%}$ compared to the supervised baseline, demonstrating that our method can fully utilize the unlabeled data. (2) The introduction of uncertainty is beneficial for the OCT lesion segmentation task, resulting in a 0.5${\%}$ improvement in DI. Uncertainty ensures the relative reliability of the pseudo-labels obtained from the teacher network. (3) Self-deep supervision is crucial in our proposed framework. Capturing multi-scale information effectively improves segmentation performance.

Table 1. Ablation study on the Kermany dataset with 10${\%}$ labeled data. "TC" denotes Transformation-consistent, "UG" denotes Uncertainty-guided, and "SDP" denotes Self-deep supervision, respectively.

View Table | View all tables in this article

To further illustrate the role of uncertainty and self-deep supervision in the TCUS, we also visualize the segmentation results of the ablation study. Figure 2 shows the visualization results, and two cases are given. When trained with only 100 labeled images, the supervised baseline obtained poor segmentation results with many false positives (yellow regions in Fig. 2). When using unlabeled data as additional training data, the framework incorporating transformation-consistent gets more precise segmentation results than the supervised baseline. By introducing uncertainty into the framework, the isolated FP regions can be effectively removed (orange arrow in Fig. 2). Furthermore, with the integration of self-deep supervision, TCUS obtains more accurate prediction results for lesions of different sizes and shapes (green box in Fig. 2).

Fig. 2. Visualization results of ablation studies on the Kermany dataset with 10${\%}$ labeled data. Red regions represent lesions that are segmented correctly, yellow regions represent lesions that are incorrectly, and blue regions represent those that are not segmented, respectively.

Download Full Size | PDF

Results with different amounts of labeled data. Table 2 presents the effect of the TCUS framework on the performance of OCT lesion segmentation with different amounts of labeled data. The results show that regardless of the ratio of labeled data in the training set, TCUS consistently outperforms the supervised baseline (which is trained with labeled data only), indicating that TCUS can effectively leverage the unlabeled data. As the number of labeled data increases, both the supervised baseline and TCUS achieve higher segmentation performance. Notably, when trained with only 5${\%}$ of labeled data (50 labeled images), TCUS achieves the largest performance improvement over the supervised baseline (DI = 3.96${\%}$). However, the performance improvement of TCUS becomes limited as the number of labeled data increases.

Comparison with other semi-segmentation methods. We compare TCUS with some classical and state-of-the-art semi-supervised segmentation methods, including Mean Teacher (MT) [10], Uncertainty Aware Mean Teacher (UAMT) [12], TCSM_v2 [14], Cross Pseudo Supervision (CPS) [32], and Uncertainty Rectified Pyramid Consistency (URPC) [30]. For a fair comparison, 2-D DenseUNet is used for the backbone of all aforementioned methods. Table 3 shows the quantitative results of our TCUS with these semi-supervised segmentation methods. When using 10${\%}$ labeled data for training, all semi-supervised methods improve in DI compared with the supervised baseline. Among the existing methods, TCSM_v2 demonstrates the best segmentation performance. However, TCUS outperforms TCSM_v2 by achieving the best segmentation performance with a 1.77${\%}$ improvement in DI, highlighting the ability of TCUS to effectively utilize information from unlabeled data. Figure 3 shows the visualization results of the supervised baseline and semi-supervised methods using 10${\%}$ labeled data. Note that TCUS outperforms other semi-supervised methods in segmenting lesions in OCT images. Specifically, TCUS achieves higher segmentation accuracy, particularly when the lesion area is blurred (line 2 in Fig. 3). Furthermore, when using 20${\%}$ labeled data for training, TCUS also outperformed the existing semi-supervised segmentation methods, as shown in Table 3.

Fig. 3. Lesion segmentation results of different methods on the Kermany dataset. The first and second lines are DME, and the third and fourth lines are DRUSEN. Red regions represent lesions that are segmented correctly, yellow regions represent lesions that are segmented incorrectly, and blue regions represent lesions that are not segmented.

Download Full Size | PDF

Table 2. Segmentation performance of Kermany dataset trained with different amounts of labeled data.

View Table | View all tables in this article

Table 3. Performance comparison on the Kermany dataset. "10${\%}$" and "20${\%}$" denote training with "10${\%}$" and "20${\%}$" labeled data, respectively.

View Table | View all tables in this article

3.2.2 Experiments on the private OCT dataset

We also conduct experiments on the private OCT dataset. Table 4 shows the experimental results. TCUS achieves state-of-the-art performance compared to other semi-supervised segmentation methods with only 10${\%}$ and 20${\%}$ labeled data. Specifically, when trained with 10${\%}$ and 20${\%}$ labeled data (65 and 130 labeled images, respectively), TCUS outperforms the supervised baseline by 6.59${\%}$ and 5.0${\%}$ in DI. This demonstrates the robustness of TCUS in significantly improving the performance of the supervised baseline for different diseases. Additionally, Table 4 shows that TCUS effectively utilizes a large amount of information from unlabeled data to achieve the greatest improvement in segmentation performance when only 10${\%}$ labeled data is available for training.

Table 4. Performance comparison on the private OCT dataset. "10${\%}$" and "20${\%}$" denote training with "10${\%}$" and "20${\%}$" labeled data, respectively.

View Table | View all tables in this article

4. Discussion

We also conduct a discussion on three parameters involved in TCUS to validate its robustness across different parameter settings.

Table 5 shows the effects of using different values of $K$ (Eq. (4)) on segmentation performance. We conduct experiments on the Kermany dataset with 10${\%}$ labeled data. It can be observed that regardless of the value of $K$, TCUS outperforms the supervised baseline in terms of segmentation performance. Additionally, TCUS demonstrates relatively roubles performance with minimal fluctuations across different $K$ values. However, it is worth noting that as K gradually increases, the performance of TCUS also improves. Theoretically, as the value of $K$ increases, the estimated uncertainty becomes closer to the actual value. It aligns with our experimental results. However, when using larger values of $K$, training time is also longer [27]. Thus, we set $K$ to 10 in all experiments.

Table 5. Testing DI under different $K$ values

View Table | View all tables in this article

Table 6 shows the influence of using different $th$ values (Eq. (5)) on the segmentation performance. It can be seen that the best segmentation performance is achieved when $th$ is set to 0.15. When $th$ is too small ($th=0.05$), TCUS selects too few pixels for training, resulting in insufficient training. When $th$ is too large ($th=0.25$), TCUS selects pixels with high uncertainty for training, which adversely affects the optimization of the parameters of the student network. Both of these cases lead to a decrease in lesion segmentation performance. Therefore, it is crucial to select an appropriate value of $th$ for training. In all experiments, we set $th$ to 0.15.

Table 6. Testing DI under different $th$ values

View Table | View all tables in this article

Table 7 illustrates the impact of different values of $S$. It can be observed that increasing $S$ from 1 to 3 gradually improves the lesion segmentation performance. However, when $S$ is set to 4, the segmentation performance is decrease compared to $S=3$. It can be attributed to the lower resolution of the obtained $p_i^4$, which leads to the loss of boundary details. Therefore, in the experiments, we set $S$ to 3.

Table 7. Testing DI under different $S$ values

View Table | View all tables in this article

In summary, TCUS offers several advantages in the field of semi-supervised OCT lesion segmentation. Firstly, it effectively segments OCT image lesions even with limited labeled data. Secondly, it improves segmentation performance by leveraging unlabeled data. Thirdly, it prevents negative impacts on training performance through the implementation of an uncertainty-guided transformation-consistent strategy. Lastly, TCUS provides self-supervision, enabling the learning of lesion information with varying shapes and sizes.

However, it is important to acknowledge certain limitations of TCUS. Firstly, TCUS relies on a set of random transformation operations applied to input images, which may lead to suboptimal transformations and performance. Additionally, uncertainty estimation necessitates multiple forwards passes, resulting in longer training times. Moreover, TCUS has not been tested for multi-class segmentation as binary masks were the only available data. Lastly, TCUS is designed based on the assumption that the data originates from a single domain with a shared data distribution, potentially impacting performance when incorporating data outside the distribution range.

5. Conclusion

In this paper, we introduce TCUS, a semi-supervised segmentation framework designed specifically for OCT lesion segmentation. We propose the uncertainty-guided transformation-consistent strategy in TCUS to address the issue of unreliable predictions made by the teacher network on unlabeled images. By evaluating uncertainty in the teacher network and enforcing transformation-consistent, TCUS enhances the efficacy of unsupervised regularization and guides the student network to learn from meaningful and reliable targets, thereby improving lesion segmentation performance. Additionally, TCUS incorporates self-deep supervision, which involves generating multi-scale predictions on OCT images to capture the characteristics of lesion regions with diverse shapes and sizes, resulting in more accurate segmentation outcomes.

Through extensive evaluation on two OCT datasets, TCUS demonstrates a significant performance improvement over the supervised baseline and outperforms other existing state-of-the-art semi-supervised segmentation methods.

Funding

National Natural Science Foundation of China (61972419); Natural Science Foundation of Hunan Province (2020JJ4120, 2021JJ30865).

Acknowledgments

This work was supported in part by the High Performance Computing Center of Central South University.

Disclosures

The authors declare no conflicts of interest.

Data Availability

Kermany datasets underlying the results presented in this paper are available at [24]. The private dataset underlying the results presented in this paper is not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. J. Chiu, M. J. Allingham, P. S. Mettu, S. W. Cousins, J. A. Izatt, and S. Farsiu, “Kernel regression based segmentation of optical coherence tomography images with diabetic macular edema,” Biomed. Opt. Express 6(4), 1172–1194 (2015). [CrossRef]

2. M. Wu, W. Fan, Q. Chen, Z. Du, X. Li, S. Yuan, and H. Park, “Three-dimensional continuous max flow optimization-based serous retinal detachment segmentation in SD-OCT for central serous chorioretinopathy,” Biomed. Opt. Express 8(9), 4257–4274 (2017). [CrossRef]

3. J. Wang, M. Zhang, A. D. Pechauer, L. Liu, T. S. Hwang, D. J. Wilson, D. Li, and Y. Jia, “Automated volumetric segmentation of retinal fluid on optical coherence tomography,” Biomed. Opt. Express 7(4), 1577–1589 (2016). [CrossRef]

4. J. Hu, Y. Chen, and Z. Yi, “Automated segmentation of macular edema in OCT using deep neural networks,” Med. Image Anal. 55, 216–227 (2019). [CrossRef]

5. B. Hassan, S. Qin, R. Ahmed, T. Hassan, A. H. Taguri, S. Hashmi, and N. Werghi, “Deep learning based joint segmentation and characterization of multi-class retinal fluid lesions on OCT scans for clinical use in anti-vegf therapy,” Comput. Biol. Med. 136, 104727 (2021). [CrossRef]

6. X. Liu, S. Wang, Y. Zhang, D. Liu, and W. Hu, “Automatic fluid segmentation in retinal optical coherence tomography images using attention based deep learning,” Neurocomputing 452, 576–591 (2021). [CrossRef]

7. G. Xing, L. Chen, H. Wang, J. Zhang, D. Sun, F. Xu, J. Lei, and X. Xu, “Multi-scale pathological fluid segmentation in oct with a novel curvature loss in convolutional neural network,” IEEE Trans. Med. Imaging 41(6), 1547–1559 (2022). [CrossRef]

8. M. Wang, K. Yu, X. Xu, Y. Zhou, Y. Peng, Y. Xu, R. S. M. Goh, Y. Liu, and H. Fu, “Tiny-lesion segmentation in oct via multi-scale wavelet enhanced transformer,” in Ophthalmic Medical Image Analysis: 9th International Workshop, OMIA 2022, Held in Conjunction with MICCAI 2022, Singapore, Singapore, September 22, 2022, Proceedings, (Springer, 2022), pp. 125–134.

9. E. Parra-Mora and L. A. da Silva Cruz, “LOCTseg: a lightweight fully convolutional network for end-to-end optical coherence tomography segmentation,” Comput. Biol. Med. 150, 106174 (2022). [CrossRef]

10. A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, (2017), pp. 1195–1204.

11. W. Cui, Y. Liu, Y. Li, M. Guo, Y. Li, X. Li, T. Wang, X. Zeng, and C. Ye, “Semi-supervised brain lesion segmentation with an adapted mean teacher model,” in IPMI, (2019), pp. 554–565.

12. L. Yu, S. Wang, X. Li, C. W. Fu, and P. A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation,” in MICCAI, (2019), pp. 605–613.

13. G. Wang, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren, and S. Zhang, “Uncertainty-guided efficient interactive refinement of fetal brain segmentation from stacks of MRI slices,” in MICCAI, (2020), pp. 279–288.

14. X. Li, L. Yu, H. Chen, C. W. Fu, L. Xing, and P. A. Heng, “Transformation-consistent self-ensembling model for semisupervised medical image segmentation,” IEEE Trans. Neural Netw. Learning Syst. 32(2), 523–534 (2021). [CrossRef]

15. X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised medical image segmentation via cross teaching between CNN and transformer,” in Medical Imaging with Deep Learning, (2021).

16. P. Wang, J. Peng, M. Pedersoli, Y. Zhou, C. Zhang, and C. Desrosiers, “Self-paced and self-consistent co-training for semi-supervised image segmentation,” Med. Image Anal. 73, 102146 (2021). [CrossRef]

17. S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, “Deep co-training for semi-supervised image recognition,” in ECCV, (2018), pp. 135–152.

18. S. Li, C. Zhang, and X. He, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI, (2020), pp. 552–561.

19. Y. Zhang, L. Yang, J. Chen, M. Fredericksen, D. P. Hughes, and D. Z. Chen, “Deep adversarial networks for biomedical image segmentation utilizing unannotated images,” in MICCAI, (Springer, 2017), pp. 408–416.

20. S. Chen, G. Bortsova, A. García-Uceda Juárez, G. Van Tulder, and M. De Bruijne, “Multi-task attention-based semi-supervised learning for medical image segmentation,” in MICCAI, (Springer, 2019), pp. 457–465.

21. H. Kervadec, J. Dolz, É. Granger, and I. Ben Ayed, “Curriculum semi-supervised segmentation,” in MICCAI, (Springer, 2019), pp. 568–576.

22. X. Luo, J. Chen, T. Song, and G. Wang, “Semi-supervised medical image segmentation through dual-task consistency,” in AAAI, vol. 35 (2021), pp. 8801–8809.

23. M. Wang, W. Zhu, F. Shi, J. Su, H. Chen, K. Yu, Y. Zhou, Y. Peng, Z. Chen, and X. Chen, “Mstganet: Automatic drusen segmentation from retinal oct images,” IEEE Trans. Med. Imaging 41(2), 394–406 (2022). [CrossRef]

24. D. S. Kermany, M. Goldbaum, W. Cai, et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell 172(5), 1122–1131 (2018). [CrossRef]

25. Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” in ICML, (2016), pp. 1050–1059.

26. A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in NeurIPS, (2017), pp. 5580–5590.

27. X. Cao, H. Chen, Y. Li, Y. Peng, S. Wang, and L. Cheng, “Uncertainty aware temporal-ensembling model for semi-supervised abus mass segmentation,” IEEE Trans. Med. Imaging 40(1), 431–443 (2021). [CrossRef]

28. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, (2017), pp. 2117–2125.

29. C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, (PMLR, 2015), pp. 562–570.

30. X. Luo, G. Wang, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, D. N. Metaxas, and S. Zhang, “Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency,” Med. Image Anal. 80, 102517 (2022). [CrossRef]

31. X. Li, H. Chen, X. Qi, Q. Dou, C. W. Fu, and P. A. Heng, “H-denseunet: hybrid densely connected unet for liver and tumor segmentation from CT volumes,” IEEE Trans. Med. Imaging 37(12), 2663–2674 (2018). [CrossRef]

32. X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, (2021), pp. 2613–2622.

Setting	ACC( $%$ )	DI( $%$ )	SE( $%$ )	SP( $%$ )
supervised [31]	$98.61 \pm 0.12$	$67.72 \pm 1.37$	$65.95 \pm 2.25$	$99.57 \pm 0.04$
+TC	$98.14 \pm 0.29$	$69.01 \pm 0.42$	$71.83 \pm 1.24$	$98.90 \pm 0.23$
+TC+UG	$98.65 \pm 0.47$	$69.51 \pm 0.26$	$72.35 \pm 0.98$	$99.35 \pm 0.06$
+TC+UG+SDP (TCUS)	$98.63 \pm 0.05$	$70.78 \pm 0.44$	$75.75 \pm 0.76$	$99.27 \pm 0.09$

Ratio	Metric	supervised	TCUS
5 $%$	DI	$64.82 \pm 0.99$	$68.76 \pm 0.58$
	ACC	$97.56 \pm 0.11$	$97.87 \pm 0.08$
	SE	$71.05 \pm 1.99$	$73.30 \pm 1.08$
	SP	$98.27 \pm 0.12$	$98.62 \pm 0.04$
10 $%$	DI	$67.72 \pm 1.37$	$70.78 \pm 0.44$
	ACC	$98.61 \pm 0.12$	$98.63 \pm 0.05$
	SE	$65.95 \pm 2.25$	$75.75 \pm 0.76$
	SP	$99.57 \pm 0.04$	$99.27 \pm 0.09$
20 $%$	DI	$68.94 \pm 0.53$	$72.23 \pm 0.39$
	ACC	$98.57 \pm 0.06$	$98.67 \pm 0.02$
	SE	$73.97 \pm 2.02$	$75.99 \pm 0.32$
	SP	$99.36 \pm 0.08$	$99.28 \pm 0.11$

Ratio	Method	DI( $%$ )	ACC( $%$ )	SE( $%$ )	SP( $%$ )
10 $%$	supervised [31]	$67.72 \pm 1.37$	$98.61 \pm 0.12$	$65.95 \pm 2.25$	$99.57 \pm 0.04$
	MT [10]	$67.89 \pm 0.95$	$98.66 \pm 0.13$	$68.48 \pm 1.32$	$99.51 \pm 0.10$
	UAMT [12]	$68.63 \pm 0.87$	$98.03 \pm 0.34$	$75.82 \pm 1.99$	$98.68 \pm 0.11$
	TCSM_v2 [14]	$69.01 \pm 0.42$	$98.14 \pm 0.29$	$71.83 \pm 1.24$	$98.90 \pm 0.23$
	CPS [32]	$68.74 \pm 1.14$	$98.15 \pm 0.21$	$66.90 \pm 2.23$	$99.03 \pm 0.27$
	URPC [30]	$68.08 \pm 0.67$	$98.57 \pm 0.17$	$69.64 \pm 0.98$	$99.37 \pm 0.17$
	TCUS (ours)	$70.78 \pm 0.44$	$98.63 \pm 0.05$	$75.75 \pm 0.76$	$99.27 \pm 0.09$
20 $%$	supervised [31]	$68.94 \pm 0.53$	$98.57 \pm 0.06$	$73.97 \pm 2.02$	$99.36 \pm 0.08$
	MT [10]	$69.19 \pm 0.45$	$98.63 \pm 0.02$	$71.37 \pm 1.33$	$99.43 \pm 0.13$
	UAMT [12]	$70.94 \pm 0.37$	$98.67 \pm 0.01$	$72.35 \pm 1.47$	$99.47 \pm 0.08$
	TCSM_v2 [14]	$71.03 \pm 0.46$	$98.69 \pm 0.03$	$71.32 \pm 0.76$	$99.42 \pm 0.12$
	CPS [32]	$70.68 \pm 0.61$	$98.64 \pm 0.04$	$72.98 \pm 0.99$	$99.38 \pm 0.12$
	URPC [30]	$70.93 \pm 0.52$	$98.66 \pm 0.01$	$73.60 \pm 1.96$	$99.32 \pm 0.04$
	TCUS (ours)	$72.23 \pm 0.39$	$98.67 \pm 0.02$	$75.99 \pm 0.32$	$99.28 \pm 0.11$

Ratio	Method	DI( $%$ )	ACC( $%$ )	SE( $%$ )	SP( $%$ )
10 $%$	supervised [31]	$54.22 \pm 1.37$	$94.99 \pm 1.11$	$54.59 \pm 4.24$	$95.73 \pm 0.79$
	MT [10]	$57.25 \pm 2.08$	$98.56 \pm 0.98$	$55.43 \pm 5.32$	$99.21 \pm 0.09$
	UAMT [12]	$57.12 \pm 1.76$	$96.93 \pm 1.24$	$53.52 \pm 5.91$	$97.62 \pm 0.58$
	TCSM_v2 [14]	$59.43 \pm 1.23$	$97.13 \pm 2.09$	$62.66 \pm 5.54$	$97.61 \pm 0.24$
	CPS [32]	$56.29 \pm 2.09$	$97.01 \pm 1.94$	$53.93 \pm 6.09$	$97.72 \pm 0.38$
	TCUS (ours)	$60.81 \pm 1.15$	$96.62 \pm 1.97$	$67.24 \pm 4.23$	$96.96 \pm 1.14$
20 $%$	supervised [31]	$70.74 \pm 1.33$	$95.64 \pm 2.47$	$67.47 \pm 3.39$	$96.09 \pm 1.25$
	MT [10]	$71.22 \pm 1.15$	$96.54 \pm 2.32$	$68.54 \pm 3.01$	$97.03 \pm 1.24$
	UAMT [12]	$71.98 \pm 2.01$	$99.13 \pm 0.43$	$70.58 \pm 2.34$	$99.82 \pm 0.02$
	TCSM_v2 [14]	$71.81 \pm 1.87$	$98.29 \pm 0.98$	$71.19 \pm 1.98$	$98.79 \pm 0.98$
	CPS [32]	$74.25 \pm 0.99$	$97.55 \pm 1.34$	$69.70 \pm 3.45$	$97.97 \pm 0.13$
	TCUS (ours)	$75.74 \pm 0.92$	$99.31 \pm 0.46$	$77.53 \pm 1.01$	$99.67 \pm 0.04$

Setting	ACC( $%$ )	DI( $%$ )	SE( $%$ )	SP( $%$ )
supervised [31]	$98.61 \pm 0.12$	$67.72 \pm 1.37$	$65.95 \pm 2.25$	$99.57 \pm 0.04$
+TC	$98.14 \pm 0.29$	$69.01 \pm 0.42$	$71.83 \pm 1.24$	$98.90 \pm 0.23$
+TC+UG	$98.65 \pm 0.47$	$69.51 \pm 0.26$	$72.35 \pm 0.98$	$99.35 \pm 0.06$
+TC+UG+SDP (TCUS)	$98.63 \pm 0.05$	$70.78 \pm 0.44$	$75.75 \pm 0.76$	$99.27 \pm 0.09$

Semi-supervised OCT lesion segmentation via transformation-consistent with uncertainty and self-deep supervision

Abstract

1. Introduction

2. Method

2.1 Transformation-consistent semi-supervised framework

2.2 Uncertainty-guided transformation-consistent

2.3 Self-deep supervision

2.4 Loss function

3. Experiment

3.1 Experimental setup

3.1.1 Dataset

3.1.2 Implementation details

3.1.3 Evaluation metrics

3.2 Performance evaluation and analysis

3.2.1 Experiments on the Kermany dataset

3.2.2 Experiments on the private OCT dataset

4. Discussion

5. Conclusion

Funding

Acknowledgments

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (3)

Tables (7)

Equations (9)

Biomedical Optics Express