Self-attention CNN for retinal layer segmentation in OCT

Guogang Cao; Yan Wu; Zeyu Peng; Zhilin Zhou; Cuixia Dai

doi:10.1364/BOE.510464

1. Introduction

The structure of the retinal layer provides crucial diagnostic information for ophthalmologists. In most typical retinal diseases, such as age-related macular degeneration (AMD) [1], the central macula, which is originally smooth and slightly concave, is elevated, and the normal retinal layer structure is disrupted. In addition, retinal diseases, such as detachment of the retinal pigment epithelium, retinal pigmentary changes, and diabetic macular edema, can cause degrees of deformation in the retinal layers [2]. Ophthalmologists diagnose the condition of patients by assessing the deformations in the retinal layers.

Optical Coherence Tomography (OCT) is a non-invasive imaging modality. It utilizes low-coherence light to analyze the internal structure of biological tissues, acquiring high-resolution cross-sectional images of tissues with sufficient depth of penetration [3–5]. Now, OCT is widely used to observe the layered structure of the retina and the pathological fluids within it. By observing the biological properties of the retinal layers, such as layer thickness, analyzing the morphology of each layer, and comparing this layer information with normal layers, it is possible to diagnose retinal diseases such as diabetic macular edema and AMD. Due to the manual annotation of retinal layer boundaries relying on the subjective judgment of annotators and being time-consuming and labor-intensive, the issue of automatic segmentation of OCT retinal images is gaining attention from researchers.

In recent years, the role of deep learning in image segmentation has become increasingly important, leading to the emergence of many retinal layer segmentation methods based on convolutional neural networks (CNNs). CNNs have dominated various biomedical image segmentation tasks, with U-Net being one of the most used models [6]. U-Net features an encoder-decoder structure and skip connections, enabling it to preserve details. Consequently, many retinal layer segmentation models are built upon the foundation of U-Net. However, due to the inherent limitations of convolutional operations, these models struggle to capture long-range information in tasks involving distant relationships effectively.

Due to the global receptive field of the self-attention mechanism within the Transformer, many studies have begun to integrate Transformers with Convolutional Neural Networks (CNNs) to analyze medical images. For instance, Chen et al. proposed TransUnet, a model based on U-Net, where the encoder part employs a Transformer mechanism. It takes feature maps from the CNN output as input, passes them through the Transformer, and then feeds the encoded tokenized image blocks to the decoder for up-sampling, merging them with the high-resolution feature maps from CNN. This process allows the model to acquire advanced learning information from CNN and global contextual information from the Transformer [7]. Cao et al. introduced Swin-Unet, a pure Transformer U-shaped structure resembling U-Net, which utilizes SwinTransformer blocks for global and local feature learning with identical skip connections between the encoder and decoder [8]. Gao et al. proposed UTNet, featuring a novel self-attention decoder in the model. It extracts locally enhanced features using convolutional layering and captures long-range information through a self-attention mechanism, achieving accurate segmentation while reducing computational complexity [9].

Various segmentation techniques for OCT retinal layers continue to emerge with the development of medical image processing and analysis methods. Classical approaches involve using adaptive mathematical models to analyze the anatomical structure of the retina and leverage clinical prior knowledge to detect typical layered structures [10]. For instance, Monemian et al. utilized a model based on Laplace distribution to calculate the probability of neighboring pixels becoming boundary pixels, completing the segmentation of retinal layers [11]. Sun et al. proposed a level-set method based on the Bayesian theorem, combining anatomical prior information and adaptive details to generate boundary probability maps iteratively, enhancing the sub-pixel accuracy of the boundaries [12]. Chiu et al. introduced a fully automated layering method based on graph theory and dynamic programming, achieving precise segmentation of the boundaries of eight retinal layers [13]. However, due to the inherent limitations of each mathematical model, these methods exhibit shortcomings in terms of robustness and computational complexity.

Most recent methods are based on deep neural networks. For example, Roy et al. introduced a fully convolutional neural network named Relaynet, which employs an encoder-decoder structure and incorporates skip connections with anti-pooling. This model can segment seven retinal layers, fluid, and background [14]. Wang et al. extract boundaries and retinal layers simultaneously through two U-shaped networks and subsequently fuse the two results to enhance the correctness of layer segmentation [15]. He et al. proposed a retinal layering method based on a fully convolutional regression network. This method takes the original B-SCAN map and the normalized spatial position of each pixel as input, outputs the pixel segmentation result of the retina as well as a structured surface, and combines the two results to achieve layering [16]. Kumar et al. presented a multi-layer, multi-scale encoder-decoder architecture. By stacking two different encoder-decoder networks multiple times, they iteratively performed feature extraction and denoising, ultimately obtaining the layer segmentation results of the retina [17].

These newly developed methods have shown significant improvements in both robustness and accuracy. However, they ultimately extract information within a limited receptive field and still face challenges in long-range modeling. Therefore, some researchers have introduced various attention mechanisms into CNNs to expand the receptive field and enhance the network's ability to learn more effectively. For instance, Moradi et al. proposed a semantic segmentation model based on the Residual-Attention-UNET model, designed to segment ten layers of the retina. The model forms a U-shaped structure using residual blocks and incorporates an attention gate in the up-sampling and skip-connection processes. This enhances strong-feature correlation while suppressing weaker-feature correlations, leading to improved segmentation accuracy [18]. Tan et al. introduced a model that combines CNN and a lightweight Transformer. The model processes image inputs through two main frameworks, one based on Transformer and the other on Cross-Convolution, to extract global and local features. They also designed and introduced a boundary regression loss function and feature polarization to improve boundary accuracy and maximize feature distance between different layers, reducing mutual interference during segmentation [19]. Cao et al. proposed an enhanced Transformer-based single-step regression method, incorporating convolution to improve the multi-head self-attention of the Transformer. A-SCAN was used as training data for retinal layering, achieving the segmentation of nine retinal boundaries [20]. These methods combine attention mechanisms with CNNs, resulting in higher precision. In terms of global feature extraction, the self-attention mechanism of the Transformer holds an advantage due to its extensive receptive field. However, this large receptive field is not always necessary at every stage of feature extraction, especially considering the presence of noise in low-level feature maps, which is unfriendly to global feature extraction.

Therefore, we propose a retinal OCT image layer segmentation model based on self-attention mechanisms and CNNs. The model combines an encoder-decoder structure with self-attention. As the network's maximum receptive field occurs at the bottom of the encoder, a transformer block is added to the bottom of the network, receiving deep feature maps from the encoder. Before passing the feature maps to the transformer, a transformation is applied, allowing the transformer to compute attention only in the vertical direction. This approach reduces the model's computational complexity and improves training speed. The main contributions of the proposed method are as follows:

1. A new retinal layer segmentation framework is proposed based on the encoder-decoder structure. In the shallow layers, a convolutional neural network is employed to extract fine and low-level features, while in the deep layers, a self-attention mechanism is used to capture global semantic information. Combining these two components enhanced the model's performance, resulting in better segmentation results than existing methods.
2. A one-dimensional Transformer is added to the bottom of the encoder-decoder structure, calculating self-attention only in the vertical direction of the feature map. This reduces the computational complexity of the model while enhancing its performance. This combination enables the model to exhibit better generalization, even on small datasets, showcasing excellent performance.
3. An attention mechanism has been introduced in the up-sampling and skip connection process, combining the up-sampled feature map with the one from the same encoder level after channel attention. Different weights are assigned to these features through linear and non-linear transformations, which are then applied to the original image, amplifying or suppressing the importance of features. Adding this module can reduce the model's parameters and enhance its performance.

2. Methodology

The retinal layer segmentation involves assigning each pixel on a cross-sectional image of the retina to different classes using a network model, thereby accomplishing the segmentation. Inspired by Attention-UNet [21] and TransUnet, we proposed a new network to accomplish the retinal OCT image layering. The framework consists of an encoder and a decoder, as shown in Fig. 1.

Fig. 1. The framework of the proposed method.

Download Full Size | PDF

A one-dimensional Transformer block is added after the encoders’ last convolution block to process the convolution blocks’ feature map output, which obtains a larger perceptual field and more spatial location information. In addition, channel and spatial attention mechanisms are added to the skip connection and up-sampling processes, where the channel attention mechanism focuses more on the correlation between channels and improves the representation of features on each channel. The spatial attention mechanism allows the model to focus more on critical local areas in the image, improving the spatial localization of features. Combining these two attention mechanisms enhances the model's ability to extract local features. The following details the encoder improved by the one-dimension transformer, the decoder improved by the Attention Gate, and the combined loss function.

2.1 Encoder improved by one-dimension transformer

The left part of the framework is the encoding branch, which consists of several encoder blocks and a Transformer block.

To enable the network to learn more comprehensive features, each encoder’s block consists of two convolutional layers, batch normalization layers, and ReLU activation layers. The first set of convolutional layers is designed to capture relatively low-level features, while the second captures higher-level features. By continuously stacking layers, the network learns increasingly complex and abstract features. The convolutional layers utilize a 3 × 3 convolutional kernel, a stride of 1, and the zero-padding on the feature map to ensure consistency in output and input sizes.

In providing the same receptive field, combining two 3 × 3 convolutional kernels, as opposed to a larger kernel, not only introduces fewer parameters but also facilitates the extraction of more affluent and more complex features. This aids in the learning process of the network.

The ReLU layer introduces non-linear factors to enhance the model's expressive ability. The maximum pooling layer between convolutional blocks alleviates the feature invariance of the convolutional layers and reduces the redundant information introduced by the convolutional layers.

The output of the last convolution block is serialized before being fed into the transformer block. First, the feature map from convolution blocks, a two-dimensional feature map of size (B, C, $\frac{H}{{16}},\; \; \frac{W}{{16}}$), is reshaped to size (B×$\frac{W}{{16}}$, $\frac{H}{{16}},\; $C) before being passed into the Transformer. Then, the self-attention is only calculated for the reshaped feature map, i.e., in the vertical direction of convolution block outputs.

The Transformer block consists of multiple Transformer layers, each of which consists of a multi-headed self-attentive (MSA) as well as a multilayer perceptron (MLP) block, as shown in Fig. 2, and the output of layer i is represented as follows:

(1)$$Z_i^{\prime} = MSA({LN({{Z_{i - 1}}} )} )+ {Z_{i - 1}}$$

(2)$${Z_i} = MLP({LN({Z_i^{\prime}} )} )+ Z_i^{\prime}$$

where LN (·) represents layer normalization and ${Z_i}$ represents the encoded map.

Fig. 2. The framework of the Transformer layer.

Download Full Size | PDF

To facilitate subsequent convolution in the decoder, the result after the Transformer module is reshaped from (B×$\frac{W}{{16}}$, $\frac{H}{{16}},\; $C) to (B, C, $\frac{H}{{16}},\; \; \frac{W}{{16}}$).

2.2 Decoder improved by attention gate

The decoder branch of the network mainly includes an up-sampling block, an improved Attention Gate (AG) block, and a convolution block. We improved the AG block based on the literature [21], as shown in Fig. 3, which incorporates channel attention to assign weights to each feature channel. These weights amplify or suppress the importance of different features, reducing the parameter count and enhancing the model's performance. Additionally, this module adopts the Exponential Linear Unit (ELU) activation function to alleviate the gradient explosion problem and improve model accuracy.

Fig. 3. Attention Gate block.

Download Full Size | PDF

In Fig. 3, y represents the result of up-sampling on the previous layer of feature maps, and x represents the feature maps from the same level of encoder that is fed to the channel attention first to enhance features. The next two feature maps are convolved by conv 1 × 1 to get outputs with the same size and number of channels, which are summated to highlight important features. Then, the result is processed by an ELU (Exponential Linear Unit). A weight α is generated by a conv 1 × 1 and a sigmoid function, which is multiplied with the original input x, and obtained the result $\hat{x}$.

The final decoder block is a convolutional layer with a convolutional kernel size of 1 × 1 and a SoftMax layer for the output classification result.

3. Loss functions

In this paper, the network is trained using a combination function of the multi-classification cross-entropy loss and the Dice loss. The loss function formula is as follows:

(3)$${L_{seg}} = {\lambda _1}{L_{\textrm {dice}}} + {\lambda _2}{L_{\textrm {ce}}}\; $$

where ${\lambda _1}$ and ${\lambda _2}$ are the weights of the two loss functions, and the sum of ${\lambda _1}$ and ${\lambda _1}$ is 1.

The Dice score is used to assess the similarity between two samples and takes values in the range [0, 1], which is expressed by the following formula:

(4)$$Dice = \frac{{2|{X \cap Y} |}}{{|X |+ |Y |}}\; $$

where X represents the probability map of the ground truth labels, Y represents the probability map obtained from the model's predictions, and |X∩Y| represents the overlapping regions between the two maps. It is calculated by element-wise multiplication and summation of the pixels in both maps. |X| and |Y| represent the pixels’ summation in each respective map.

The Dice loss function is expressed as:

(5)$${L_{dice}} = 1 - Dice$$

The multi-classification cross-entropy loss function measures the similarity between the actual and predicted probability maps. A smaller loss value indicates a smaller discrepancy and helps prevent gradient vanishing. The formula for this loss function is expressed as follows:

(6)$$H({p,q} )={-} \mathop \sum \limits_{i = 1}^M p({{x_i}} )\log ({q({{x_i}} )} )$$

where M represents the number of categories in the classification, $p({{x_i}} )$ represents the true distribution of the sample of category i, if it is a sample of that category then it is 1, otherwise, it is 0, and $q({{x_i}} )$ represents the probability that the sample is predicted to be of category i.

For image segmentation, the dice loss accesses the images globally. In contrast, the multi-classification cross-entropy loss accesses the images pixel by pixel, and the two complement each other to some extent. To highlight the advantages of combining these two loss functions, in our experiments, we also combined the Dice loss function with the other loss functions MultiLabelSoftMaginLoss [14] and Focal Loss [22], which are commonly used in medical image segmentation, and compared their results.

4. Experiments and analysis

4.1 Datasets and preprocessing

The proposed model was evaluated on two datasets, DUKE DME [23] and the optic disc retina dataset from Shanghai Jiao Tong University [24].

DUKE DME is publicly released by Chiu et al. at Duke University Eye Center, which contains 110 (11 B-SCAN per patient, 496 × 768) hand-annotated B-SCAN images from 10 patients with diabetic macular edema. These 110 images are labeled by experts for retinal fluid and seven retinal layers, which are represented by 10 labels in the masked image, namely RNFL, GCL-IPL, INL, OPL, ONL-ISM, ISE, OS-RPE, cumulus, and upper and lower background. Our experiments are divided into independent training, validation, and test sets with a ratio of 6:2:2.

The optic disc retina dataset is a collection of 61 subjects (12 B-SCANs per patient, 1024 × 992) of peripapillary images of the retinal disc collected by Li et al. at Shanghai Jiao Tong University. The subjects include highly myopic patients, those with peripapillary atrophy, and cataract patients. For each subject, two B-SCAN images are randomly selected as the dataset, which is manually annotated by an expert with nine retinal layers and the optic disc. There are 11 labels in the masked image, including background, RNFL, GCL, IPL, INL, OPL, ONL, IS/OS, RPE, Choroid and disc. Our experiments are divided into independent training, validation, and test sets with a ratio of 6:2:2.

In our experiments, both datasets were done with data enhancement by horizontal flipping and the images were resized by nearest neighbor interpolation. The first dataset is fed to the network with the size of 224 × 224 and the second one with the size of 512 × 496 to ensure a better segmentation result with as little loss of image detail features as possible.

4.2 Experimental settings

The neural network framework used in our experiment was PyTorch. The optimizer used Adam with the initial learning rate of 0.001, the linear warm-up of 10 epochs, and the learning rate schedule of cosine annealing. The first dataset used a cosine annealing of 200 epochs to set the decay of the learning rate, and the second used 50 epochs. The experiments ran on a graphic workstation, of which the CPU is i5-11400F with 16 G of RAM and the GPU is NVIDIA TITAN X (Pascal) with 16 G of video memory.

4.3 Results and analysis

4.3.1 Evaluation indicators

This article utilizes the Dice score and Pixel Accuracy (PA) [25] as evaluation metrics. As explained earlier, the Dice score is employed to assess the similarity between the segmentation results and the ground truth images. PA evaluates the percentage of accurately classified pixels in the image, considering the overall segmentation accuracy. The formula of PA is as follows:

(7)$$PA = \frac{{\mathop \sum \nolimits_{i = 0}^n {P_{ii}}}}{{\mathop \sum \nolimits_{i = 0}^n \mathop \sum \nolimits_{j = 0}^n {P_{ij}}}}$$

where n represents the total number of categories, ${P_{ii}}$ denotes the total number of pixels with real pixels i predicted to be in the category i, and ${P_{ij}}$ represents the total number of pixels with true pixels i predicted to be in the category j.

4.3.2 Experiments on DUKE DME

The proposed method was trained and tested on the first dataset, using the dice score as the evaluation metric, and the results are shown in Table 1 and Fig. 4. It can be seen from Fig. 4 that the proposed method gives more accurate delamination compared to other methods, without any mixing between each layer, and is with higher accuracy in the segmentation position of the cumulus, without misidentifying the cumulus as another layer. From Table 1, the proposed method has the best Dice for most layers and the best average, while the INL layer’s Dice is similar to the Relaynet’s, and the ONL-ISM layer’s is slightly worse than the other three methods’, probably due to the effect of fluid accumulation. The inclusion of the self-attention mechanism, as well as the spatial channel attention mechanism, has improved the model with a higher accuracy. Overall, our proposed method Outperforms other methods.

Fig. 4. Segmentation results of OCT retina with diabetic macular edema. (a) Original image. (b) Ground truth. (c) Results of U-net. (d)Results of Attention-Unet. (e)Results of TransUnet. (f)Results of our method.

Download Full Size | PDF

Table 1. Dice scores on the DUKE DME dataset.^a

View Table | View all tables in this article

Here, it is worth noting that during our research, we came across a method based on Residual Attention-UNET from [18]. This method achieves a Mean Dice coefficient of 91.5% and MIOU of 93% in the layer segmentation on their private dataset. However, since this method does not provide Dice coefficients for each layer, it has not been included in the table for comparison.

4.3.3 Experiments on optic disc retina dataset

The proposed method was trained and tested on the optic disc retina dataset and compared with other methods, evaluating the metrics Dice score and PA, as shown in Table 2, Table 3 and Fig. 5. It can be seen from Fig. 5 that the results of other methods are not accurate for some layers, such as pixels between two layers are misclassified or pixels of one layer are wrapped by another layer, whereas our method gets rid of this problem. Out method gets better segmentation possibly because the self-attention mechanism enhances the model's ability to extract global features, which also includes positional information on the image, allowing the model to rely not only on pixel values for classification, but also its position.

Fig. 5. Segmentation results of optic disc retinal OCT images. (a) Original image. (b) Ground truth. (c) Results of Attention-Unet. (d) Results of TransUnet. (e) Results of our method.

Download Full Size | PDF

Table 2. Dice scores on the optic disc retina dataset.^a

View Table | View all tables in this article

Table 3. PA on the optic disc retina dataset.^a

View Table | View all tables in this article

As can be seen from Table 2 and Table 3, our method gets the best metrics on the layers of RNFL, IPL, INL, OPL, ONL, and IS/OS, while its layers’ average metrics are also best, the same as another method, MGU-Net. In addition, the U-net gets the best metrics on Disc but not good on retina layers, probably because it is more suitable for general objects than layer structure. Our method achieves the best segmentation results for the retina layers, not weakened by the presence of the optic disc, even though the results for the Disc are slightly worse, probably because the model has some bias in selecting features at the boundary between the optic disc and the retinal layers, resulting in poor segmentation results for the Disc.

4.3.4 Experiments with the number of transformer layers

On a small-scale medical dataset, the excessive number of layers may cause over-fitting due to the lack of training data, while too few layers do not fully reflect the role of the Transformer. In our experiments, we tried the number of Transformer layers in the Transformer module to achieve the best results. According to the current standards of the Transformer, five layer numbers of 2, 3, 4, 5 and 6 were chosen for comparison. Figure 6 shows the best results for both datasets when the Transformer layer is 3.

Fig. 6. Experiments with the number of Transformer layers.

Download Full Size | PDF

4.3.5 Experiments with combined loss functions

In the experiment, a combined loss function that integrates multi-class cross-entropy loss and Dice loss was used, and the results were compared with different weights. Additionally, we compared this combined loss function with other commonly used loss functions in medical image segmentation, such as MultiLabelSoftMarginLoss and Focal Loss. According to the results shown in Fig. 7, it can be observed that the combined loss function improved the Dice values by 0.011 and 0.012 compared to these two loss functions separately in the DUKE DME dataset. Similarly, in the optic disc and retina dataset, the Dice scores were increased by 0.009 and 0.042, respectively. These indicate that the combined loss function significantly enhances the model's performance.

Fig. 7. Experiments with combined loss functions.

Download Full Size | PDF

4.4 Ablation experiments

We implemented ablation experiments to validate the Transformer's self-attention mechanism at the bottom of the network, and the attention mechanism in the decoder.

In the experiment, the Transformer's self-attention mechanism was incorporated with the regular CNN and the self-attention mechanism was improved too. The results in Fig. 8 showed that the model with the enhanced one-dimensional Transformer module had better results than the original one, with Dice scores increasing by 0.01 and 0.1 in the two datasets, respectively. The improved capability of the model to extract global features has significantly contributed to the enhancement of segmentation accuracy. Considering the distinctive attributes of retinal data, the one-dimensional Transformer selectively employed the self-attention mechanism solely along the vertical direction of the feature map. By specifying the orientation of the self-attention mechanism within the Transformer, the utilization of image information became more effective. This approach maximized the utilization of samples within the same dataset size and reduced computational complexity while enhancing the model's segmentation accuracy.

Fig. 8. Ablation experiments. The values in the Figure are the corresponding average Dice score.

Download Full Size | PDF

In addition, an attention mechanism module has been added to the decoder of our method to enhance the features. As can be seen in Fig. 8, after the enhancement of the features in the channel dimension and spatial dimension, the Dice scores of the model improved by 0.009 and 0.005 in the two Datasets, respectively.

This indicated that the addition of the attention mechanism module enhanced the important features, which play a greater role in subsequent operations and improve the model's performance.

Ultimately, through examination of Fig. 8, it is apparent that the introduction of the enhanced one-dimensional Transformer alongside the attention mechanism module in the decoder yields discernible improvements in the model's performance. Specifically, the Dice coefficients experience notable enhancements of 0.041 and 0.145 in the respective datasets, providing empirical evidence for the efficacy of the incorporated modules.

5. Discussion and conclusion

In this study, a new network architecture combining Transformer and CNN is proposed for OCT retinal layer segmentation, which not only acquires local features by the CNN, but also extracts global features because of the Transformer's powerful global perceptual field, accomplishing the layer segmentation of the retina. According to the characteristics of the retinal data, we use the Transformer's self-attention mechanism only for the vertical direction of the image, and convert the two-dimension feature map into a one-dimension one before being processed by the Transformer, which not only increases the number of samples for the Transformer module, but also reduces the computation cost, improving the performance. Compared to other methods, the proposed method significantly improves the performance in segmenting the retinal layers. In summary, the proposed method can perform the task of retinal segmentation well and assist professional ophthalmologists in diagnosing retinal diseases.

Funding

National Natural Science Foundation of China (62175156, 81827807, 61675134); Science and Technology Commission of Shanghai Municipality (19441905800); Collaborative Innovation Fund of Shanghai Institute of Technology (XTCX2022-04).

Disclosures

The authors declare that they have no conflict of interest.

Data availability

Data underlying the results presented in this paper includes two datasets. The first dataset can be found in Ref. [23] and the second dataset in Ref. [24].

References

1. P.A. Keane, S. Liakopoulos, R. V. Jivrajka, et al., “Evaluation of optical coherence tomography retinal thickness parameters for use in clinical trials for neovascular age-related macular degeneration,” Invest. Ophthalmol. Visual Sci. 50(7), 3378–3385 (2009). [CrossRef]

2. S. Saidha, S. B. Syc, M. A. Ibrahim, et al., “Primary retinal pathology in multiple sclerosis as detected by optical coherence tomography,” Brain 134(2), 518–533 (2011). [CrossRef]

3. D. Huang, E. A. Swanson, C. P. Lin, et al., “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

4. N. Nassif, B. Cense, B. H. Park, et al., “In vivo human retinal imaging by ultrahigh-speed spectral domain optical coherence tomography,” Opt. Lett. 29(5), 480–482 (2004). [CrossRef]

5. E. M. Anger, A. Unterhuber, B. Hermann, et al., “Ultrahigh resolution optical coherence tomography of the monkey fovea. identification of retinal sublayers by correlation with semithin histology sections,” Exp. Eye Res. 78(6), 1117–1125 (2004). [CrossRef]

6. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” Medical image computing and computer-assisted intervention 1, 234–241 (2015). [CrossRef]

7. J. Chen, Y. Y. Lu, Q. H. Yu, et al., “TransUNet: Transformers make strong encoders for medical image segmentation,” arXiv, arXiv:2102.04306 (2022). [CrossRef] .

8. H. Cao, Y. Y. Wang, J. Chen, et al., “Swin-Unet: Unet-like pure transformer for medical image segmentation,” arXiv, arXiv:2105.05537 (2022). [CrossRef] .

9. Y. H. Gao, M. Zhou, and D. Metaxas, “UTNet: a hybrid transformer architecture for medical image segmentation,” arXiv, arXiv:2107.00781 (2022). [CrossRef] .

10. K. Hu, B. W. Shen, Y. Zhang, et al., “Automatic segmentation of retinal layer boundaries in OCT images using multiscale convolutional neural network and graph search,” Neurocomputing 365, 302–313 (2019). [CrossRef]

11. M. Monemian and H. Rabbani, “Mathematical analysis of texture indicators for the segmentation of optical coherence tomography images,” Optik 219(5), 165227 (2020). [CrossRef]

12. Y. Sun, S. Niu, X. Gao, et al., “Adaptive-guided-coupling-probability level set for retinal layer segmentation,” IEEE J Biomed Health Inform 24(11), 3236–3247 (2020). [CrossRef]

13. S.J. Chiu, X. T. Li, P. Nicholas, et al., “Automatic segmentation of seven retinal layers in sdoct images congruent with expert manual segmentation,” Opt. Express 18(18), 19413–19428 (2010). [CrossRef]

14. A. G. Roy, S. Conjeti, S. P. K. Karri, et al., “Relaynet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks,” Biomed. Opt. Express 8(8), 3627–3642 (2017). [CrossRef]

15. J. Wang, Z. Wang, F. Li, et al., “Joint retina segmentation and classification for early glaucoma diagnosis,” Biomed. Opt. Express 10(5), 2639–2656 (2019). [CrossRef]

16. Y. He, A. Carass, Y. Liu, et al., “Structured layer surface segmentation for retina OCT using fully convolutional regression networks,” Med Image Anal. 68, 101856 (2021). [CrossRef]

17. A. S. Kumar, T. Schlosser, H. Langner, et al., “Improving OCT image segmentation of retinal layers by utilizing a machine learning based multistage system of stacked multiscale encoders and decoders,” Bioengineering 10(10), 1177 (2023). [CrossRef]

18. M. Moradi, Y. Chen, X. Du, et al., “Deep ensemble learning for automated non-advanced AMD classification using optimized retinal layer segmentation and SD-OCT scans,” Comput. Biol. Med. 154, 106512 (2023). [CrossRef]

19. Y. Tan, W. D. Shen, M. Y. Wu, et al., “Retinal layer segmentation in OCT images with boundary regression and feature polarization,” IEEE Trans Med Imaging 43(2), 686–700 (2024). [CrossRef]

20. G. G. Cao, S. Zhang, H. D. Mao, et al., “A single-step regression method based on transformer for retinal layer segmentation,” Phys. Med. Biol. 67(14), 145008 (2022). [CrossRef]

21. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention U-Net: learning where to look for the pancreas,” arXiv, arXiv:1804.03999 (2022). [CrossRef] .

22. T. TY. Lin, P. Goyal, R. Girshick, et al., “Focal Loss for Dense Object Detection,” IEEE Trans Pattern Anal Mach Intell. 42(2), 318–327 (2020). [CrossRef]

23. S. J. Chiu, M. J. Allingham, P. S. Mettu, et al., “Kernel regression based segmentation of optical coherence tomography images with diabetic macular edema,” Biomed. Opt. Express 6(4), 1172–1194 (2015). [CrossRef]

24. J. Li, P. Jin, J. Zhu, et al., “Multi-scale GCN-assisted two-stage network for joint segmentation of retinal layers and discs in peripapillary OCT images,” Biomed. Opt. Express 12(4), 2204–2220 (2021). [CrossRef]

25. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans Pattern Anal Mach Intell. 39(4), 640–651 (2015).

26. A. Chakravarty and J. Sivaswamy, “A supervised joint multi-layer segmentation framework for retinal optical coherence tomography images using conditional random field,” Comput. Methods Programs Biomed. 165, 235–250 (2018). [CrossRef]

Methods	Tissue
Methods	RNFL	GCL-IPL	INL	OPL	ONL-ISM	ISE	OS-RPE	Average
Manual expert 2 [26]	0.86	0.89	0.80	0.72	0.88	0.86	0.84	0.84
Chiu [23]	0.86	0.88	0.73	0.73	0.86	0.86	0.80	0.82
Chakravarty [26]	0.86	0.89	0.80	0.72	0.88	0.86	0.84	0.84
Wang [15]	0.86	0.90	0.78	0.78	0.94*	0.90	0.86	0.86
DRUnet [24]	0.851	0.881	0.720	0.726	0.919	0.887	0.847	0.833
Relaynet [24]	0.872**	0.905**	0.807*	0.782**	0.939**	0.898	0.863	0.867**
U-net	0.863	0.900	0.782	0.765	0.896	0.908**	0.886**	0.857
Attention-Unet	0.844	0.876	0.764	0.764	0.898	0.900	0.885	0.847
Transunet	0.834	0.891	0.761	0.734	0.883	0.895	0.872	0.838
Ours	0.877*	0.913*	0.801**	0.789*	0.910	0.914*	0.892*	0.871*

Methods	Tissue
Methods	RNFL	GCL	IPL	INL	OPL	ONL	IS/OS	RPE	Choroid	Layers’ Average	Disc
U-Net [24]	0.820	0.668	0.720	0.751	0.792	0.905	0.857	0.822	0.883	0.802	0.830*
Relaynet [24]	0.816	0.686	0.730	0.773	0.808	0.904	0.862**	0.824**	0.877	0.808	0.759
DRUnet [24]	0.824	0.683	0.727	0.769	0.805	0.906	0.858	0.822	0.882	0.808	0.785
MGU-Net [24]	0.831**	0.699*	0.749**	0.789**	0.820**	0.909**	0.862**	0.826*	0.892**	0.820*	0.821**
Attention-Unet	0.829	0.682	0.750	0.773	0.807	0.906	0.860	0.823	0.898*	0.814**	0.819
Transunet	0.769	0.600	0.667	0.685	0.743	0.882	0.829	0.749	0.842	0.752	0.662
Ours	0.835*	0.696**	0.751*	0.792*	0.822*	0.910*	0.863*	0.823	0.887	0.820*	0.806

Methods	Tissue
Methods	RNFL	GCL	IPL	INL	OPL	ONL	IS/OS	RPE	Choroid	Layers’ Average	Disc
U-Net [24]	0.831**	0.669	0.721	0.760	0.783	0.914**	0.862	0.831	0.882	0.806	0.868*
Relaynet [24]	0.818	0.687	0.709	0.779	0.797	0.903	0.866**	0.823	0.866	0.805	0.775
DRUnet [24]	0.821	0.679	0.714	0.778	0.795	0.919*	0.857	0.819	0.874	0.806	0.856
MGU-Net [24]	0.846*	0.703	0.747*	0.810*	0.820*	0.919*	0.870*	0.830	0.892	0.826*	0.861**
Attention-Unet	0.771	0.725*	0.736	0.777	0.812	0.912	0.852	0.843**	0.904**	0.814**	0.842
Transunet	0.803	0.638	0.641	0.728	0.806	0.902	0.843	0.814	0.829	0.778	0.769
Ours	0.827	0.721**	0.745**	0.788**	0.815**	0.906	0.854	0.865*	0.912*	0.826*	0.795

Methods	Tissue
Methods	RNFL	GCL-IPL	INL	OPL	ONL-ISM	ISE	OS-RPE	Average
Manual expert 2 [26]	0.86	0.89	0.80	0.72	0.88	0.86	0.84	0.84
Chiu [23]	0.86	0.88	0.73	0.73	0.86	0.86	0.80	0.82
Chakravarty [26]	0.86	0.89	0.80	0.72	0.88	0.86	0.84	0.84
Wang [15]	0.86	0.90	0.78	0.78	0.94*	0.90	0.86	0.86
DRUnet [24]	0.851	0.881	0.720	0.726	0.919	0.887	0.847	0.833
Relaynet [24]	0.872**	0.905**	0.807*	0.782**	0.939**	0.898	0.863	0.867**
U-net	0.863	0.900	0.782	0.765	0.896	0.908**	0.886**	0.857
Attention-Unet	0.844	0.876	0.764	0.764	0.898	0.900	0.885	0.847
Transunet	0.834	0.891	0.761	0.734	0.883	0.895	0.872	0.838
Ours	0.877*	0.913*	0.801**	0.789*	0.910	0.914*	0.892*	0.871*

Methods	Tissue
Methods	RNFL	GCL	IPL	INL	OPL	ONL	IS/OS	RPE	Choroid	Layers’ Average	Disc
U-Net [24]	0.820	0.668	0.720	0.751	0.792	0.905	0.857	0.822	0.883	0.802	0.830*
Relaynet [24]	0.816	0.686	0.730	0.773	0.808	0.904	0.862**	0.824**	0.877	0.808	0.759
DRUnet [24]	0.824	0.683	0.727	0.769	0.805	0.906	0.858	0.822	0.882	0.808	0.785
MGU-Net [24]	0.831**	0.699*	0.749**	0.789**	0.820**	0.909**	0.862**	0.826*	0.892**	0.820*	0.821**
Attention-Unet	0.829	0.682	0.750	0.773	0.807	0.906	0.860	0.823	0.898*	0.814**	0.819
Transunet	0.769	0.600	0.667	0.685	0.743	0.882	0.829	0.749	0.842	0.752	0.662
Ours	0.835*	0.696**	0.751*	0.792*	0.822*	0.910*	0.863*	0.823	0.887	0.820*	0.806

Self-attention CNN for retinal layer segmentation in OCT

Abstract

1. Introduction

2. Methodology

2.1 Encoder improved by one-dimension transformer

2.2 Decoder improved by attention gate

3. Loss functions

4. Experiments and analysis

4.1 Datasets and preprocessing

4.2 Experimental settings

4.3 Results and analysis

4.3.1 Evaluation indicators

4.3.2 Experiments on DUKE DME

4.3.3 Experiments on optic disc retina dataset

4.3.4 Experiments with the number of transformer layers

4.3.5 Experiments with combined loss functions

4.4 Ablation experiments

5. Discussion and conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (3)

Equations (7)

Biomedical Optics Express