Glaucoma detection model by exploiting multi-region and multi-scan-pattern OCT images with dynamical region score

Kai Liu; Kai Liu; Kai Liu; Jicong Zhang; Jicong Zhang; Jicong Zhang

doi:10.1364/BOE.512138

1. Introduction

Glaucoma, a prevalent and chronic fundus disease, is one of the leading causes of global blindness, primarily result in the degeneration of visual cells. Without timely and effective intervention, it will lead to significant visual loss and ultimately blindness. Furthermore, due to its inherent irreversibility, early detection is crucial to mitigate its development [1–3]. However, accurate detection of glaucoma remains a challenge, the indicators are subtle alterations in deformation and a reduction in thickness within glaucoma-related retinal layers. While these indicators are situated within the deep fundus region, posing a challenge to observation [4–6].

Currently, several imaging methods are available for tracking the condition of the fundus region. Color fundus imaging, one of the most common imaging methods, is widely adopted for screening fundus pathologies, especially in economically disadvantaged regions. However, it’s restricted by only providing a global and superficial overview of the fundus region [7]. Optical coherence tomography (OCT), with the capacity to provide comprehensive and non-invasive insights into the deep and localized fundus region, is popular in the diagnosis of glaucoma. Moreover, as shown in Fig. 1, it offers detailed information by various scan patterns (circular, single-line, volume, and radial) across various fundus regions (optical nerve head (ONH), middle, and macular), enabling a comprehensive presentation of fundus condition, especially facilitating deriving glaucoma-related markers (e.g., thickness map of retinal nerve fiber layer) [8,9]. However, the emphasis on a local region restricts the field of view within the fundus region by a single OCT image.

Fig. 1. The illustration of our motivation. The models solely rely on OCT images captured by a single scan pattern within a region and are not the optimal choice for detecting glaucoma since they hold a high risk of losing valuable features in remaining regions. We exploit OCT images from multi-regions and multi-scan patterns to establish a comprehensive model for alleviating the issue.

Download Full Size | PDF

Deep learning-based models have achieved promising performance in glaucoma diagnosis, exhibiting capability in extracting glaucoma-related features [10,11]. However, it is imperative that the establishment of these models relies on OCT images captured by a single scan pattern within a given region (i.e., the single circular scan pattern in the ONH region) [12,13]. The utilization of images by a single scan pattern (e.g., one image per case) has demonstrated a highly efficient and low computation burden. However, it’s evident that focusing on a single scan pattern or a single region falls short of capturing fully glaucoma-related markers, resulting in misdiagnosis, due to various valuable lesions being distributed across all fundus regions. Consequently, these models are susceptible to failure when deployed in the clinic practice. Therefore, as shown in Fig. 1, we employ various scan patterns to capture comprehensive features from three fundus regions, dedicated to improving the accuracy of glaucoma detection.

In order to address the above limitations, we proposed a novel glaucoma detection model by exploiting comprehensive OCT images from multiple regions and multiple scan patterns. In addition, to integrate various features from large sizes of images for each sample (over 20 OCT images per case), we proposed two fusion modules, one for integrating features extracted from the same region but obtained by various scan patterns, and another for incorporating features derived from multiple regions. In addition, we employ the self-attention strategy, enabling automatic fine-tuning of the impact of features from different scan patterns within one region and different regions, ultimately improving the overall performance of glaucoma diagnosis. Moreover, the attention multi-region fusion module can generate region scores to decipher the contributions to the decision-making process by different regions.

To alleviate the absence of available datasets and obtain a thorough representation of the anatomical fundus, we have collected a comprehensive dataset from three available fundus regions, named MRMSG-OCT. In order to fully exploit resources in a given region, we employ multiple scan patterns to obtain images with the same region. Therefore, each sample in the dataset comprises OCT images from three distinct regions and is captured by four scan patterns.

The contributions of this work are summarized as follows:

1. We proposed a novel multi-region and multi-scan-pattern glaucoma detection model to comprehensively exploit glaucoma-related features from OCT images within ONH, middle, and macular regions by circular, single-line, volume, and radial scan patterns, which alleviates the risk of losing critical glaucoma-related features in single region-based or single scan-pattern-based models.
2. We proposed two novel self-attention feature fusion modules: the attention multi-scan-pattern module, designed to automatically optimize and fuse features from various scan patterns within a given region; the attention multi-region fusion module, designed to automatically optimize the fusion of features from all available regions, simultaneously generating region contribution scores to augment interpretability of the decision-making process. These modules collaboratively contribute to the high capacity of glaucoma diagnosis for the proposed model.
3. We collected a novel multi-scan-pattern and multi-region glaucoma dataset comprising OCT images captured by four scan patterns from three fundus regions for each sample. We have conducted extensive experiments on the dataset. The experimental results demonstrate that our proposed model yields superior performance than single scan-pattern-based models, single region-based models, and the models with the average fusion strategy.

1.2 Related works

1.2.1 OCT-based glaucoma detection models

Most glaucoma detection models rely on the OCT image captured by a single scan pattern, particularly the circular scan pattern within the ONH region. This scan pattern efficiently covers the majority of pathological areas with a single image, exhibiting a high information density and low computational burden [14,15]. Illustratively, utilizing the manually extracted features from these images, followed by employing traditional machine learning algorithms like the support vector machine [16,17] or logistic regression [18,19] to categorize these features, achieving high accuracy. While it requires prior knowledge to design extracted features. Recent advances in deep learning have revolutionized glaucoma detection tasks, with automatic extraction of task-specific discriminant features [20], significantly improving diagnostic accuracy and eliminating the necessity of prior knowledge [21]. Additionally, various novel networks, such as the incorporation of graph neural network [22], and transformers [23,24], have been employed with circular scan pattern images. Another commonly used scan pattern in the ONH region is the radial scan pattern, which captures images by performing a cross-section scan systematically at every setting degree along the center axis of ONH, yielding a sequence of consecutive images [25,26]. Compared with the single image by the circular scan pattern, the radial scan pattern provides a more information-rich dataset. Therefore, several glaucoma detection models have been developed on radial scan images [27].

Except for images from ONH, various scan patterns have been applied in the macular region [28]. Notably, the radial scan pattern, similar to those employed in ONH, scans the macular region at each setting degree to obtain a sequence of images for each sample. These images have exhibited utility in diagnosing glaucoma, achieving comparable results to models using images from the ONH region, demonstrating the radial scan pattern can afford detailed features over the circular scan pattern [29]. In addition, the volume scan pattern confers the advantage of the number of images for each sample than other scan patterns, as it comprises a large size of images captured at specified intervals within a given region [30,31]. Therefore, using extensive images to build the model, holds the potential of achieving promising performance [32,33] However, the substantial number of images for each sample (up to 31 images) increases the expenses with the acquisition, reducing the diagnosis efficiency and imposing a high computational burden.

In clinical practice, not all patients will undergo multi-region scanning with various patterns, while for those necessitating such procedures, ophthalmologists encounter significant challenges in efficiently diagnosing extensive images, greatly reducing diagnostic efficacy. Moreover, the auto-assisted detection models developed by deep learning are restricted by computational burden, limiting their ability to make full use of these images. However, we argue that it’s appealing to establish a model by using multi-region and multi-scan-pattern OCT images to avoid missing valuable features in various scan patterns and multiple fundus regions.

1.2.2 Attention mechanism-based models

In recent years, inspired by human visual mechanisms, with a selective focus on task-related regions, attention mechanism-based models have been proposed as a pivotal component in enhancing feature learning [34]. These modules are conceived to facilitate the automatic acquisition of task-related features and the establishment of inter-relationships among different features [35]. Thus, various attention-based approaches have been proposed to address tasks. In the classification tasks, attention-based modules enable the models to capture discriminant features for target classes [36]. Meanwhile, within segmentation tasks, attention mechanism modules yield promising results by enhancing attention on challenging pixels like boundaries, facilitating the understanding of contextual information [37,38]. Moreover, the spatial-attention module directs concentration toward the salient inputs, saving computation requirements and enhancing processing efficiency, even adapting the variable-sized inputs. As evidenced by the utility of structural attention modules and graph attention models, yielding promising performance by establishing inter-relationships within structural data [39–41]. Moreover, attention mechanisms-based models have been shown to yield contributions to other tasks, such as image inpainting [42,43], speech recognition [44,45], and object detection [46,47].

The attention-based models have been extended to medical images, serving analogous utilities in general tasks, such as facilitating the identification of discriminant features to improve diagnostic efficiency and accuracy [48–51]. In addition, adapted attention-based models offer distinct medical advantages. They have proven to be valuable tools for the precise localization of interested regions, particularly those associated with pathological regions like cancers and lesions, achieved by assigning elevated attention weights to these interested regions, while concurrently diminishing weights to non-relevant regions and backgrounds [52–54] Furthermore, the attention mechanism-based models have demonstrated potential in efficiently processing extensive multi-modal medical data (MRI, CT, PET, and OCT), with a targeted focus on the key modal data and integrating various modal features, yielding a comprehensive representation to subsequent diseases diagnosis [55–57]. Moreover, another clinical advantage resides in providing interpretable results behind the model’s decisions, achieved by visualizing the focused regions with high attention weights, thus the reliable visualization evidence assists physicians to make a comprehensive diagnosis and formulate precise treatment plans [58–60].

Inspired by the advantages of attention mechanism, particularly effectiveness in integrating multi-modal data and capturing salient features, we propose a novel attention multi-scan-pattern fusion module and an attention multi-region fusion module, designed to fuse features captured by various scan patterns within a region dedicated to fuse features extracted from multiple regions, respectively.

2. Methods

2.1 Overview of the proposed model

To fully exploit images from three distinct regions, as well as various scan patterns simultaneously, we proposed a novel multi-region and multi-scan-pattern fusion model. The overview of the proposed model, as shown in Fig. 2, comprises two primary paths: the single-region features fusion pathways, which focuses on integrating features captured by various scan patterns within a given region, and the multi-region features fusion pathway, which is dedicated to integrating features extracted from three distinct regions. Consequently, the former parts yield comprehensive multi-region fusion features. These features are subsequently utilized as inputs for the classifier, serving the purpose of predicting the probability of being glaucoma or normal, with outputting the region scores corresponding to the contribution of the model’s decision-making by various regions.

Fig. 2. The illustration of our proposed model. The proposed model comprises a feature extraction part and a classification part. The feature extraction part, responsible for extracting features from various OCT images (captured by various scan patterns within three fundus regions), comprises three region-based paths (macular, middle, and ONH region). Moreover, attention multi-scan-pattern fusion modules and an attention multi-region fusion module are employed to integrate the corresponding scan-pattern features within the same region and to combine the features from various regions, respectively. The feature classifier part serves as a classifier to categorize the former multi-region fusion features, with outputting region contribution scores by three regions.

Download Full Size | PDF

The whole procedure can be summarized as follows: We first perform feature fusion by combining features extracted from various scan patterns within a given region. Subsequently, we extend this fusion process to integrate features derived from different regions. Ultimately, we derive diagnostic results based on the fusion features obtained in the previous step.

2.2 Attention multi-scan-pattern fusion module (AMSFM)

We are dedicated to obtaining extensive information from a target fundus region by various scan patterns. It’s acknowledged that the strategy employed to fuse these features affects the performance of the proposed model. Fundamentally, regarding characteristics inherent to various scan methods, different scan patterns offer varying information with specific focus. For example, the images captured by circular scan pattern mainly focus on the retinal layers surrounding the optic nerve head, efficiently covering retinal layers within a single image. In contrast, images captured by the radial scan pattern emphasize detailed and continuous fundus conditions, obtaining a wider scope of information with six images. Consequently, a notable divergence emerges in terms of diagnostic efficiency and intrinsic informational value inherent in various scan patterns. In addition, the performance of the deep learning-based model will be dominated by the large dimensions of features (the circular scan pattern only yields 1 image, while the radial and volume scan pattern generate 6 and 9 images, respectively).

Therefore, the strategy of simple averaging features is inadequate to avoid the above limitations. To effectively combine these features and exploit images by various scan patterns within a target region, we proposed an attention multi-scan-pattern fusion module (AMSFM) that focuses on integrating features from various scan patterns in the same region.

The proposed AMSFM module, as shown in Fig. 3, involves detailed steps, outlined as follows: 1) First, we concatenate feature maps from various scan patterns along the channel axis, resulting in a feature matrix with three-dimension. 2) Subsequently, we perform a reshaping on the concatenated matrix along the dimensions of height and width, converting the three-dimensional matrix into a two-dimensional matrix. 3) Following by, we conduct a matrix multiplication operation between the reshaped two-dimensional matrix and its transposed matrix, yielding a two-dimensional matrix with identical channel dimensions in both length and width. 4) Then, we normalize the former feature matrix by a soft-max process to avoid biased issues. The resulting normalized matrix is then subjected to a multiplied operation with the original reshaped matrix. This product is then added to the original concatenated feature maps. These operations yield the ultimate attention scan pattern fusion features matrix. The AMSFM feature maps are calculated as follows.

(1)$$MSF({{S_i},{S_j}} )= softmax({c{{({{S_i},{S_j}} )}^R}\mathrm{\ast }{{({c{{({{S_i},{S_j}} )}^R}} )}^T}} )\ast \; c{{({{S_i},{S_j}} )}^R} + \; c({{S_i},{S_j}} )$$

where $\textrm{M}SF({{S_i},{S_j}} )$ denotes the fusion features from multi-scan-pattern, $c({\bullet} )$ denotes the concatenate operation, R and T denote the reshape and transpose operation, respectively.

Fig. 3. The illustration of the attention multi-scan-pattern fusion module, shows the detailed steps of fusing the multiple scan patterns features within a given region.

Download Full Size | PDF

2.3 Attention multi-region fusion module (AMRFM)

To mitigate the loss of region-related information, our collected images cover the majority of the fundus, resulting in obtaining features from ONH, macula, and intermediate regions. However, the methods adopted to integrate these regional features significantly influence the performance of the model. This phenomenon stems from the fact that the features covered by different region exhibits heterogeneity. As a notable example, the clinical significance of the ONH region is prominently evident. Moreover, an additional concern is to mitigate the bias issue arising from dimensional inequality inherent in the different regional paths. Furthermore, enhancing model interpretability holds importance in gaining insight into the decision-making process, which involves providing quantified assessments of contributions from distinct regions.

Therefore, we propose an attention multi-region fusion module (AMRFM), as shown in Fig. 4, that aims to integrate features from three regions. We employ an adapted fusion strategy, leveraging the attention mechanism to auto-derive contribution by each region with a self-learning strategy, denoted as region contribution score. These scores hold a dual purpose: firstly, they auto-modulate the role of different regions on the model’s predictions. Secondly, they explicitly offer a metric to directly assess the contribution disparities among different regions.

Fig. 4. The illustration of the attention multi-region fusion module, shows the detailed steps of integrating multiple regions features, with generating the region contribution scores.

Download Full Size | PDF

To achieve this goal, we establish a shallow neural network with several fully connection layers to autonomously learn scores, effectively auto-assigning the weights to each region. Moreover, to prevent bias caused by variations in the regions and to facilitate explicit results analysis, we normalize the adapted region scores by summing them to one.

The detailed transform steps are as follows: Initially, the feature maps from various regions are concatenated together into a new multi-dimensional representation. Subsequently, we perform a flattened operation with the newly formed representation, transforming it into two dimensions. Followed by they are fed into a neural network with three fully connection layers, with each layer subjected to a nonlinear activation operation, resulting in a vector of region contribution coefficients. This vector is then applied to the original feature representation with element-wise multiplication, yielding the fusion feature representation. These steps are computed according to the following equation, which is defined as follows:

(2)$$\textrm{M}RF({{R_{mac}},{R_{mid}},{R_{ONH}}} )= {F_3}({F_2}({F_1}({c({{R_{mac}},{R_{mid}},{R_{ONH}}} )} )))\mathrm{\ast }c({{R_{mac}},{R_{mid}},{R_{ONH}}} )$$

Where $\textrm{M}RF({{R_{mac}},{R_{mid}},{R_{ONH}}} )$ denotes the fusion features from multiple regional features, c represents the concatenate operation and ${F_i}$ denotes the i layer of the neural network. All the region contribution scores are computed according to the following equation, which is defined as follows:

(3)$$R{S_k} = \; \frac{{\exp \{{{w^\dagger }\ast \tanh ({V\ast R_k^\dagger } )} \}}}{{\mathop \sum \nolimits_{j = 1}^3 \exp \{{{w^\dagger }\ast \tanh ({V\ast R_k^\dagger } )} \}}}$$

Where w and V are parameters of the neural network. $\tanh ({\bullet} )$ denotes the nonlinear activation operation. Where $R{S_k}$ denotes the region contribution score associated with the representation of region k, and ${R_k}$ represents the feature representation of region k.

2.4 Details of the proposed model

The details of our proposed model, as shown in Fig. 2, which comprise two main parts. The former part serves as an extractor, utilizing some base frameworks to extract features from three regions. The latter part serves as a classifier, employing several fully connection layers to predict categories by using the previously extracted features. Additionally, two essential modules (multi-scan-pattern fusion module and multi-region fusion module) are linked with two parts, each with its distinct methodology, responsible for integrating features captured by various scan patterns within the same region and for fusing multiple regions features, respectively. In the following section, we will provide deep insights into each part of the proposed model.

The former part holds three single-region-based paths, each corresponding to a specific fundus region: the macular, middle, and ONH regions. Notably, both the ONH and macular region-based pathways have two separate inputs for dual scan pattern images, while the middle region-based pathway has a unique input for single-line scan pattern images. Consequently, the proposed model encompasses a total of five input pathways. The input pathways share a similar framework with the utilization of the ResNet18 network, which can be easily replaced by any feature extractors and offers flexibility for the integration of more advanced models when needed. Each residual network is composed of a convolutional layer coupled with an activation function, followed by a residual connection that enables an element-wise addition with previous feature representations.

Subsequently, the integration of multi-input features within ONH and macular region-based pathways is accomplished by using the attention multi-scan-pattern fusion modules, specifically designed to integrate features from various scan patterns that correspond to the same anatomical region. Furthermore, an attention multi-region fusion module is employed to integrate the features from three regions, crafted to auto-assign regional contribution scores to their respective regional features. The detailed and comprehensive elaboration of these two fusion modules is available in the corresponding section of our method.

The latter part comprises a classification path. The path is tailored to classify the multi-region-fusion features. The output pathway within the framework incorporates several fully connection layers. Ultimately, generating a prediction likelihood of being glaucoma based on the multi-regional fusion features, determined by a weighted-focus loss function.

2.5 Training steps and objection function

The training steps can be summarized as follows: We first perform feature fusion by combining features captured by various scan patterns within a given region. Subsequently, we extend this fusion process to integrate features originating from all three regions. Followed by, the network optimization is pursued by backpropagating the loss while leveraging the fusion features obtained in the previous step. We will not stop these iterations until a stable and small loss is achieved. During the inference phase, we derive diagnostic results by inputting test images into the optimized model. Simultaneously, the model computes region contribution scores, augmenting the diagnostic interpretability.

Therefore, we formulate the task by an objection function with a weighted focus loss, which serves to capture the multi-region and multi-scan-pattern features. The raw focus loss [61], initially designed for the detection task, has been adapted to better address the imbalanced classification task. Unlike the original detection task, which relied on an empirical coefficient to obtain optimal performance without the precise class information, our task received such information, which is readily available within the designated dataset. Therefore, this varied focus loss enables us to directly address the complexity of the task. The overall function can be expressed as follows:

(4)$${\mathrm{{\cal L}}_{fusion}} = \; - \frac{1}{n}\mathop \sum \limits_i^n ({{C_i}\mathrm{\ast }{{({1 - \widehat {{Y_i}}} )}^r}\mathrm{\ast }{Y_i}\mathrm{\ast }log\widehat {{Y_i}} + {C_j}\mathrm{\ast }{{\widehat {{Y_i}}}^r}\mathrm{\ast }({1 - {Y_i}} )\mathrm{\ast log}({1 - \,\widehat {{Y_i}}} )\,} )$$

Where ${\mathrm{{\cal L}}_{fusion}}$ denotes the fusion loss, a balanced variant of the focus loss, supervised by the ground truth of multi-region fusion features. $\widehat {{Y_i}}$ denotes the output soft-max multi-region fusion features, while ${Y_i}$ represents the corresponding ground truth annotations. The variable n is the total size of the training dataset. Furthermore, ${C_i}$ and ${C_j}$ stand for the class weights associated with class i and class j, respectively.

3. Experiments

3.1 Dataset

3.1.1 Data acquisition and data approval

All samples utilized in this study originated from an annual physical examination project carried out at a Kailuan Meikuang community, a Han Chinese population located in Tangshan, PR China, spanning a decade (2008-2018). Additionally, the OCT images were acquired utilizing the Heidelberg Engineering Spectralis SD-OCT device (Heidelberg, Germany) and subsequently processed by the software (HSF-OCT-103). These images covered a majority view of the fundus, including the macular, middle, and ONH regions. In addition, various scan patterns were employed to capture detailed features within these regions. As shown in Fig. 5, circular and radial scan patterns were specifically utilized to acquire images of the ONH region, while radial and volume scan patterns were applied for the macular region. Furthermore, a single-line scan pattern was adopted for the middle region. The dataset is established at eye-level, with one eye per individual to save cost. The primary focus on OD, utilizing OS serves as an alternative when data collection from the OD eye isn’t feasible due to abnormal or other fundus diseases that may impact our experiments. In addition, we have eliminated incomplete samples, lacking OCT images from all three regions, or failing to be captured by the required scan patterns, thus the OCT images captured by various scan-patterns from multiple regions should be paired for each eye. Importantly, we obtained written informed consent from all individuals for using their images in our study.

Fig. 5. The illustration of OCT images in our MRMSG-OCT dataset, which are from three fundus regions and captured by various scan patterns.

Download Full Size | PDF

3.1.2 Image quality control and data annotation

The performance of models is susceptible to the quality of images, leading to failure in low-quality images. To adapt to the varying quality of images in clinical practice, we have not deliberately set a predefined Quality threshold or imposed a strong constraint that excludes images with low quality. Nevertheless, we conducted a manual view of all collected images to ensure the images were without deficiencies and with full retinal layers. Moreover, we have taken measures to mitigate the potential impact on experimental results by removing the case with other pathological conditions. To align with real-world clinical practice, we haven’t imposed strict constraints on the manual criteria. Only the instances involving the loss of retinal layers, masked by anomalous factors, as well as lesions of extremely large fluids, have been removed from our dataset. It’s noteworthy these exclusions are guided that can be easily found at a quick glance.

Subsequently, we annotated samples in the dataset. The annotation procedure involved categorizing samples as glaucoma or normal, performed by a highly experienced expert (with over a decade of diagnostic experience) from Tongren Eye Hospital. To ensure accurate diagnostic results for individual eyes, the expert relied on comprehensive clinical evidence, including visual field assessments, retinal layer thickness map, and other various clinical indicators, to annotate samples. Consequently, these expert-derived classifications were assigned as class labels for our dataset.

3.1.3 Overview of dataset

Eventually, we collected a comprehensive dataset, as shown in Table 1, comprising a total of 1561 samples. It contains 136 glaucoma samples and 1425 normal samples, each including OCT images from all three anatomical regions and required scan patterns. Notable, various scan patterns generate images with varying sizes and dimensions. Specifically, the single-line scan pattern yields a single image, while the radial scan pattern generates 6 distinct images, and the volume scan pattern produces 31 individual images. In addition, all the samples belong to the adult demographic, with a mean age of 66.94 years. Glaucoma samples exhibit a higher mean age of 73.15, compared to normal samples at 65.91 age. The gender distribution for all samples is approximately equilibrium with a ratio of 1:1.17 (720 males to 841 females). Other information, including the mean thickness of the retinal nerve fiber layer, Quality metric, and ART values, can be found in Table 1.

Table 1. The demographic characteristics of the MRMSG-OCT dataset

View Table | View all tables in this article

3.1.4 Data preprocess

Before being fed into the experimental models, the images underwent a two-step preprocess procedure to improve their overall quality: 1) Denoised via BM3D Algorithm [62]: The BM3D algorithm was employed to eliminate spike artifacts in images, performed by fine-tuning parameters to yield high-quality images with low noise artifacts; 2) Contrast Enhancement through Adaptive Histogram Equalization [63]: Addressing the inherent challenge of a significant variation of contrast degrees in raw images, we employed adaptive histogram equalization algorithm. This strategy effectively improves image contrast degree, approximating a similar distribution of contrast degree for all images, thereby mitigating blur boundaries and prompting uniformity in our dataset.

3.2 Experimental configuration

To accurately and comprehensively evaluate our proposed model, we conduct comparative analyses with various models, including the single scan-pattern-based models, the single region-based models, the two-region-based models, and our proposed multi-region and multi-scan-pattern fusion model. Notably, an equal training dataset size is employed for all models. Specifically, for the single scan-pattern-based models, the training dataset consists of images captured by a single scan pattern. In contrast, the single region-based methods relied on the training dataset, comprised of images captured by various scan patterns within a given region. Furthermore, the two region-based methods relied on the training dataset, comprised of images from any two distinct regions. Our proposed method is established by the training dataset, using detailed images captured by various scan patterns from all three regions. We employ a five-cross-validation method for all experiments. All these measures ensure a consistent and reliable basis for comparative analysis.

All experiments were implemented using PyTorch (Version 1.8) [64], and executed on NVIDIA RTX 4080 GPU with 16GB of memory. All the models in this study employed the same input image size of 256 × 256 × 1, denoting the dimensions of height, width, and channel, respectively. For the scan-patterns (more than one image), we concatenate them directly along the channel dimension, these concatenated images serve as inputs for the corresponding branch. A batch size of 5 was employed across all models. Stochastic gradient descent (SGD) serves as the optimization algorithm to optimize parameters for all models, with initialized settings comprising a learning rate of 2.5 × 10⁻⁴, a momentum of 0.9, and a weight decay factor of 5 × 10⁻⁴. During the optimization produce, the parameters were updated until reaching a fixed training epoch of 100 for all models.

3.3 Evaluation metrics

To assess the performance of our proposed model, we initially utilize the metric of the area under the receiver operating curve (AUC), which is commonly used to assess the performance of the glaucoma detection model. The AUC coefficient provides an overall assessment of the discriminative power of the model, computed by constructing a receiver operating curve (ROC) across different thresholds, followed by obtaining the area under ROC. The formula for obtaining AUC is defined as follows:

(5)$$AUC = \; \frac{{\sum [{(FP{R^{i + 1}} - \,FP{R^i}} )\times ({TP{R^{i + 1}} + TP{R^i}} )]}}{2}$$

Where FPR and TPR denote false positive rate and true positive rate, respectively. i denote each point in the ROC curve and obtain the AUC by performing the sum operation over all the points.

Moreover, we introduce the F1 coefficient, since it concurrently considers both precision and recall, providing a comprehensive evaluation of the model’s performance. The formula for computing the F1 coefficient is defined as follows:

(6)$$F1 = \; \frac{{2 \times precision \times recall}}{{precision + recall}}$$

Due to the imbalanced class within our dataset, assessing our proposed model solely by a single metric, like recall or precision, is limited. Therefore, we introduce additional two metrics Matthews Correlation Coefficient (MCC) and G-Mean, mitigating the potential bias stemming from the class imbalance issue. The MCC metric provides a balanced assessment of the model’s effectiveness, considering both true positives, true negatives, false positives, and false negatives. The formula for computing the MCC coefficient is defined as follows:

(7)$$MCC = \; \frac{{TP \times TN - FP \times FN}}{{\sqrt {({({TP + FP} )\times ({TP + FN} )\times ({TN + FP} )\times ({TN + FN} )} )} }}$$

Where TP, FP, TN, and FN denote true positive, false positive, true negative, false negative prediction results, respectively.

The G-Mean serves as another robust metric to assess the performance of the model in the imbalanced dataset, as it takes into account both sensitivity and specificity. A higher G-Mean signifies superior model performance in correctly classifying both positive and negative samples. The formula for computing the G-Mean coefficient is defined as follows:

(8)$${G_{Mean}} = \; \sqrt {Sens \times Spec} $$

Where Sens and Spec denote sensitivity and specificity, respectively.

4. Experimental results

4.1 Results of the single scan-pattern-based models

To facilitate comprehensive comparisons and explore the potential of OCT images captured by various scan patterns, we have established five baseline models, each focused on a specific scan pattern. These baselines adopt the same framework as our proposed model, while explicitly assigning zero contribution to the remaining scan patterns. Furthermore, the experimental configurations are consistent, ensuring a fair comparison across all experiments.

The experimental results, as shown in Table 2, reveal that the model, employing images captured by the radial scan pattern within the ONH region, yields the optimal performance among all the single scan-pattern-based models, achieving impressive performance (AUC: 0.7695, F1: 0.3039, MCC: 0.2407, and G-Mean: 0.6691) across all evaluation metrics, while the single-line scan-pattern-based model exhibits the lowest performance (AUC: 0.6932, F1: 0.2141, MCC: 0.1397, G-Mean: 0.6004). Moreover, the volume scan-pattern-based model demonstrates a comparable performance (AUC: 0.7458), benefiting from the incorporation of comprehensive scan data with 9 images per sample. In contrast, the circular scan-pattern-based model, which relies on a single image for each sample, also yields comparable performance (AUC: 0.7358). These findings align with clinical observations, where circular images within the ONH region are widely acknowledged as a cost-efficient source for detecting glaucoma, mitigating computation burden, and enhancing efficiency, contributing to their widespread utilization in published models. While the models employing the remaining scan patterns exhibit a lower level of performance, still yielding favorable results (e.g., the model using radial in macular and middle region achieves 0.6990 in AUC), which implies that all scan pattern images encapsulate significant and indispensable features for glaucoma diagnosis. Therefore, it’s profitable to establish an integrated model with incorporating all available images from various scan patterns.

Table 2. Results of the single scan-pattern-based models for 5-fold cross-validation experiments^a

View Table | View all tables in this article

4.2 Results of single region-based models

To address the limitations inherent in the single scan-pattern-based models and enhance the efficacy of glaucoma diagnosis, we proposed the single region-based models, that utilize all available OCT images within a target region by various scan patterns. Therefore, we established three distinct region-based models, each focusing on a given region. The framework and settings of these models are consistent with the single scan-pattern-based models, ensuring a fair comparison across all models.

4.2.1 Results of single region-based models employing average fusion strategy

We initially employ an average fusion strategy to integrate features from multiple scan patterns within a given region. The experimental results, as shown in Table 3, indicate the superior performance of the single region-based models compared with the single scan-pattern-based models. Notably, the ONH region-based model exhibits remarkable enhancements of 7.34% and 2.64% in AUC compared to the single circular and radial scan-pattern-based models within the same region, respectively. Furthermore, it achieves the highest performance among all single region-based models, yielding an increase of 13.94% than the lowest performance by the middle region-based model (AUC: 0.6932), consistent with evidence observed in clinical practice. Nevertheless, an unexpected trend emerges in the analysis of the single macular region-based model (AUC: 0.7474), where it yields a superior performance over the single radial scan-pattern-based model (AUC: 0.6990) but falls in matching the single volume scan-pattern-based model in terms of F1 and MCC metrics. These unstable results reveal limitations inherent in the fusion strategy, employing fixed weights by assigning all scan-pattern features with identical weights, failing to effectively adapt to diverse scenarios, and occasionally leading to suboptimal performance.

Table 3. Results of the single region-based models for 5-fold cross-validation experiments^a

View Table | View all tables in this article

4.2.2 Results of single region-based models employing AMSFM module

To avoid the limitation herein the average fusion strategy and optimize the integration of multi-scan-pattern-based features, we proposed a novel attention multi-scan-pattern fusion module known as AMSFM, automatically assigning weights according to their diagnostic significance. The comprehensive framework of AMSFM is presented in Fig. 3 and exhaustively detailed in the methodology section. The experimental results, as shown in Table 3, exhibit a consistent improvement in the performance of both the single ONH and macular region-based models, outperforming the models utilizing the average fusion strategy with enhancements of 0.75% and 1.36% in terms of AUC. Significantly, the implementation of the AMSFM module effectively mitigates the declining trend observed in the single macular region-based model when employing the average fusion strategy, yielding comparable results in terms of F1 and MCC metrics and surpassing the single volume scan-pattern-based model with an increase of 1.58% in AUC, 3.68% in G-Mean. Furthermore, the AMSFM module yields a significant increment of 8.38% in AUC compared to the single radial scan-pattern model in the macular region, which surpasses the marginal 5.92% enhancement achieved by the average fusion method. Therefore, an appropriate fusion strategy, as exemplified by our proposed self-learning fusion module, emerges as an indispensable tool for optimizing their respective influences in accordance with the contributions of distinct scan patterns in glaucoma diagnosis.

4.3 Results of multi-region and multi-scan-pattern-based models

While single region-based models exhibit proficiency in optimizing various features within a targeted region, they are susceptible to missing the features in the non-targeted regions. To address this limitation and potentially enhance the efficacy of glaucoma diagnosis, we proposed multi-region and multi-scan-pattern-based models, exploiting features from multiple fundus regions, involving two distinct regions and all three available regions. To ensure fair comparisons, the framework and experimental settings are maintained consistently across all models. Moreover, to facilitate comparison and analysis, we give a summary of the optimal results for different classes of models in Table 4.

Table 4. A summary of various classes of the optimal models for 5-fold cross-validation experiments^a

View Table | View all tables in this article

4.3.1 Results of the proposed models employing average fusion strategy

Extensive experiments were conducted to evaluate the efficacy of utilizing multiple regional features. Initially, a simple average fusion strategy was employed to integrate multiple regional features. The experimental results, as depicted in Table 5, reveal that the two-region-based model, which integrates the multiple regional features from macular and ONH regions, yields superior performance (AUC: 0.8112) compared to the other variants of two-region-based models, which integrate features from the ONH and middle regions or from the macular and middle regions. Moreover, compared with the single region-based models, especially the optimal performance achieved by utilizing features from the ONH region, the optimal two-region-based model of ONH and macular yields superior performance with a significant enhancement of 1.95% and 1.57% in terms of AUC and G-Mean, respectively.

Table 5. Results of the two-region-based models for 5-fold cross-validation experiments^a

View Table | View all tables in this article

However, as shown in Table 6, it’s noteworthy that the multi-region-based model, leveraging features from whole three available regions, has not achieved the expected optimal performance across all multi-region-based models. Instead, it has merely achieved a comparable performance (AUC: 0.7939) compared to the optimal single region-based model (AUC: 0.7957), while exhibiting inferior results than the optimal two-region-based models (AUC: 0.8112) in terms of AUC metric, even exhibiting a decline in remaining metrics. The inconsistent trend in results can be attributed to the deficient fusion strategy, assigning a fixed weight to all regions, which fails to adapt to different scenarios, resulting in a performance decline when confronted with variable scenarios.

Table 6. The separate 5-fold cross-validation results of the proposed all three-region-based model^a

View Table | View all tables in this article

4.3.2 Results of the proposed models employing AMRFM module

To mitigate the limitations of the average fusion strategy, we proposed an attention multi-region fusion module, denoted as AMRFM, that aims at effectively exploiting features from multiple regions while simultaneously exploring their roles in the task of glaucoma diagnosis. All experimental settings are consistent with the models using the average fusion strategy. The detailed framework of our AMRFM module is shown in Fig. 4 and detailed descriptions are in the methodology section.

The experimental results, as shown in Table 6 and Fig. 6, reveal a promising performance achieved by our proposed AMRFM model. Compared to the optimal single scan-pattern-based model (radial scan pattern), our proposed multi-region-based model yields a remarkable enhancement at all evaluation metrics, with increases of 4.43%, 15.10%, 27.42%, and 9.42% in AUC, F1, MCC, and G-Mean, respectively. Specifically, in comparison to the optimal single region-based model (ONH), our proposed model demonstrates a slight enhancement of 0.99% in AUC, accompanied by significant improvements in the remaining key evaluation metrics, including F1, MCC, and G-mean, reaching 11.19%, 15.87%, and 3.23%, respectively. Moreover, compared with the optimal two-region-based model (ONH-macular), our proposed multi-region model exhibits a minor decrease in terms of AUC. However, this reduction is offset by notable gains of 9.35% and 9.22% in F1 and MCC, respectively, accompanied by a modest enhancement of 0.70% in the G-mean metric.

Fig. 6. The ROC curves of different models for 5-fold cross-validation experiments.

Download Full Size | PDF

Compared with the corresponding models employing the average fusion strategy, it’s evident that both the two-region-based and three-region-based models exhibit superior performance across various evaluation metrics. Specifically, the ONH-macular two-region-based model yields an increment of 1.05%, 5.30%, 4.89%, and 0.93% in AUC, F1, MCC, and G-means, respectively. While the three-regions-based model with the AMRFM module yields a notable enhancement of 1.22%, 22.18%, 22.68%, and 2.71% in terms of AUC, F1, MCC, and G-means. Notably, the three-region-based model, augmented with the AMRFM module, effectively alleviates the performance degradation observed in the model employing the average fusion strategy. It achieves results that are comparable in terms of AUC while demonstrating a remarkable boost of 9.22% and 9.35% in terms of both MCC and F1. This achievement also results in a margin increase of 0.70% in G-means.

These promising results can be attributed to our AMRFM module, with its inherent capability to automatically optimize the weight coefficients individually for distinct regions. This optimization facilitates the precise integration of various regional features, thereby effectively mitigating limitations associated with the average fusion strategy. Consequently, the integration of features from multiple anatomical regions holds significance in comprehensively capturing valuable information, boosting the model’s capacity for glaucoma detection. In addition, the computed p-values provide evidence of statistical differences across all models. A detailed description is available in the Supplement 1.

5. Visualization

5.1 Visualization of feature maps for glaucomatous optic neuropathy

To explicitly compare the diagnostic capacity of various models, as well as to augment the interpretability of the proposed model, we implement a visual representation approach [65,66]. In detail, the method involves the generation of feature maps originating from an identical layer across various models, followed by mapping these derived features back to the original OCT images. The generated feature maps, as illustrated in Fig. 7, offer a deep insight into the heat distribution patterns, which are associated with discriminator features employed by various models in their decision-making processes. Remarkably, it is of significance to observe that areas characterized by a heightened heat intensity are clustered around glaucoma-related layers, particularly the retinal nerve fiber layer. The presented visualization results underscore the adeptness of our proposed model in successfully capturing salient features from anatomical regions directly relevant to glaucoma, thus aligning our findings with the clinical observations.

Fig. 7. The heat-maps of various scan-pattern images from the proposed model. The heighted regions in the heat-maps make significant contribution to the decision-making process.

Download Full Size | PDF

In addition, as opposed to the model employing the average fusion strategy, as shown in Fig. S1, the feature maps provide visual validation of the heightened diagnostic capability achieved by the models equipped with our proposed attention fusion modules. Specifically, the feature maps generated by the models with self-attention fusion modules display notably intensified colors in the regions affected by glaucomatous pathology, surpassing their counterparts employing the average fusion strategy. The regions of heightened activation depicted on the heatmaps serve as compelling evidence that affirms a positive contribution of proposed modules on the model’s decision-making process, ultimately resulting in enhanced diagnostic results.

5.2 Visualization of region contribution score

By incorporating the attention multi-region-fusion module into our proposed model, we enable the model to automatically derive weighted coefficients associated with individual regions, referred to as ‘region score’. The coefficient serves a dual purpose of optimizing the roles of distinct regional features in decision-making while simultaneously encoding the relative disparities across various regions. The distribution map of regional scores for all the samples, as shown in Fig. 8, illustrates a visual representation of the relative score across the three delineated regions. Notably, the ONH region emerges as the highest score, signifying its pivotal role in the target task, proved by the mean values of ONH (0.5046), middle (0.1360), and macular (0.3594) for glaucoma samples, as well as ONH (0.5541), middle (0.2051) and macular (0.2408) for normal samples. In addition, it is noteworthy that each of these three regions has been assigned significant scores, denoting the presence of glaucoma-related features within each distinct region. Therefore, it’s recommended to leverage multiple regional features when establishing glaucoma detection models. Nonetheless, the regional score within a given region exhibits variability across different samples, as shown in the bottom violin plots in Fig. 8, indicating the impracticality of manually assigning fixed values as the scores. Consequently, a self-learning strategy within our proposed model yields superior performance and benefits from dynamical regional score adaption according to the distinctive characteristics of various scenarios.

Fig. 8. The top thumbnail displays a scatter diagram showcasing the three region scores for both normal and glaucoma samples, with each axis corresponding to the region score for the macular, middle, and ONH regions, respectively. In further detail, the bottom two thumbnails depict the distribution of individual region scores for three regions within glaucoma and normal samples, providing a comprehensive comparative view of three distinct regions and two associated classes.

Download Full Size | PDF

In addition, the region score holds the potential for application in clinical practice. Since they can serve as a quantitative metric to decipher results, offering an interpretative and comparable value that enables ophthalmologists to gain meaningful insights from the results. Consequently, the region score offers an overview of the model’s decision-making across various regions, assisting ophthalmologists in prioritizing regions with elevated region scores, thus increasing diagnostic precision and overall efficiency in clinical practice.

6. Discussion

Currently, deep learning-based models have achieved promising performance in glaucoma detection by utilizing OCT images captured by a single scan pattern from a given region, notably the circular scan pattern with the advantages of computational efficiency. However, these models hold a high risk of misdiagnosis due to the potential for omitting valuable features in the remaining regions or scan patterns. To make full use of features from all available fundus regions, we proposed a multi-region and multi-scan-pattern fusion model to comprehensively exploit glaucoma-related features from various regions (macular, middle, and ONH), captured by diverse scan patterns (volume, circular, single-line, and radial scan patterns). Moreover, the proposed model employs two attention multi-scan-pattern fusion modules to fuse various scan-pattern features, one for integrating features from circular and radial scan patterns in the ONH region, and one for fusing features from radial and volume scan patterns in the macular region. In addition, the proposed model integrates multiple region-based features from three different regions by an attention multi-region fusion module, along with generating region contribution scores for distinct regions. Furthermore, our proposed model can be applied to different scenarios, including a single region and a single scan pattern. This transformation is achieved by directly assigning zero to missing images. Consequently, our proposed model is easily degenerated into a single-scan-pattern model or a single-region-based model.

Various features fusion modules exhibit efficacy across various tasks, yielding promising fusion results. For example, in autonomous driving tasks, Zhang et al proposed the AFTR model to fuse 3D point cloud and multi-view images features captured by LiDAR and cameras, ensuring an increasing consistency on multi-sensor data representations, and matching of dynamic scenes by cross-attention and self-attention mechanisms [67]. Moreover, in medical tasks, Liu et al proposed the DBMF model to segment colorectal polyps, by strategically fusing multi-scale local and global features with combined advantages of CNN and transformer, respectively, effectively addressing the challenges posed by complex scenes and tiny polyps [68]. Furthermore, in ophthalmic tasks, Wang et al proposed a two-modal fused model that integrates features from OCT images and fundus images, yielding a promising performance in prediction of multiple fundus diseases [69]. However, a notable limitation is the lack of interpretability for individual modalities, specifically the distinct contributions of different modal features to predictions. This issue is critical in medical tasks, where interpretability stands as a pivotal consideration.

In pursuit of this goal, instead of using various convolution or advanced feature extractor modules to obtain complex fused features that may be challenging for human comprehension, our AMRFM module only incorporates fully connection layers. This strategic design aims to discern and encapsulate the importance of different regional features by self-attention mechanism during predictive processes. In addition, compared to most methods employing channel-attention or spatial-attention with a feature map or local region within a feature map [35,55], our model is founded upon multiple feature maps, wherein features extracted from OCT images captured by multi-scan-patterns within the same fundus region are regarded as an integrated whole. In this way, we effectively mitigate information loss and concurrently promote the acquisition of interpretability weights, which serve as a mapping that delineates the importance of predictions associated with the corresponding fundus regions.

To address the issue of the lack of specific datasets, we have collected a comprehensive dataset for this task, called MRMSG-OCT, with OCT images from multiple regions and various scan patterns. We conducted extensive experiences on the dataset, the experimental results demonstrate the proposed multi-region and multi-scan-pattern fusion model achieves superior performance than all the single scan-pattern-based models, the single region-based models, and the two-region-based models, which clearly indicates the effectiveness of our proposed model and highlights the significance of all available fundus regions and various scan patterns in the diagnosis of glaucoma. Furthermore, the two-region-based model, utilizing region-based features from ONH and macular regions, yields the optimal performance among all the two-region-based models, achieving significant enhancement over the single region-based model and single scan-based model. Moreover, the single-region-based model, utilizing images captured by circular and radial scan patterns from the ONH region, outperforms other single-region models and all the single-scan-pattern models, affirming ONH’s role in glaucoma diagnosis. In addition, single-region-based models, using OCT images from the remaining regions, demonstrate discriminate value in glaucoma diagnosis over the single-scan-pattern-based models, proving the effectiveness of OCT images captured by various scan patterns within a given region. Specifically, compared with the models using the average fusion strategy, the models employing the novel attention fusion modules consistently yield superior performance, particularly reversing the declining trend observed in the performance of the models using the average fusion models, demonstrating the effectiveness of the attention fusion modules.

Moreover, our proposed model exhibits enhanced interpretability by generating region scores, which correspond to the contributions of various regions in the decision-making process, aiding ophthalmologists in the prioritization of the region with an elevated region score, and enhancing diagnostic precision and efficiency. Furthermore, we implement the visual representation of feature maps for various models by heat-maps, which explicitly underscores the effectiveness of our proposed model in capturing glaucoma-related features over the single-scan-pattern models, the single-region-pattern models, and the models using average fusion strategy.

While our proposed model offers several advantages, it is hindered by several inherent limitations. The first constraint stems from the substantial cost of collecting and annotating images from multiple regions for each sample. Moreover, the presence of the imbalanced OD-OS within our dataset may introduce the potential risk of biased issue. Only one expert gives the annotations for our dataset, there exists a potential susceptibility to errors in our annotations, although the expert conducted a comprehensive double-checked of all the samples relying on clinical evidence. Additionally, the large amount of data for each sample imposes a significant computation burden to establish a model, the small size and distorted ratio of height-width of images may lead to loss some information. Moreover, the absence of detailed annotations of glaucoma stages, as roughly categorized as normal or glaucoma in our dataset, represents another constraint. Regrettably, our proposed model shows limited capacity to offer comprehensive diagnostic insights, especially with regard to pathology, thus diminishing its clinical value. Unfortunately, no publicly available external data to verify the generalization of our proposed model remains a limitation.

In response to the above limitations, our future endeavor aims to enhance the interpretability of the model, involving improving its visualizations to facilitate the generation of clinical explanations for deciphering predicted results. In addition, we intend to concentrate our efforts on the enhancement of annotations quality for our dataset by incorporating detailed glaucoma stage annotations and collecting OS samples to obtain an OD-OS balanced dataset, enabling a further comprehensive assessment of our proposed model. Furthermore, we hope the proposed model can be extended to various fundus diseases, thereby enhancing its clinical value.

7. Conclusion

In this paper, to avoid the risk of missing features by utilizing OCT images from a single scan pattern, we proposed a novel multi-region multi-scan-pattern fusion model for glaucoma detection to exploit features from all available regions. Moreover, we proposed an attention multi-scan-pattern fusion module and attention multi-region fusion module to integrate comprehensive features captured by four scan patterns from ONH, middle, and macular regions, auto-assigning dynamic weights to different features and regional features according to their roles in target diagnosis, respectively. Moreover, we have collected a specific dataset composed of OCT images captured by four scan patterns from three fundus regions. The experimental results and visualized figures both demonstrate the superior performance of our proposed model compared with the single scan-pattern-based models, the single region-based models, the two-region-based models, and the models using the average fusion strategy. Therefore, the proposed model exhibits promising suitability for deployment in clinical practice to mitigate the risk of misdiagnosis. We hope our proposed model can be extended to other fundus-related pathologies.

Funding

National Natural Science Foundation of China (Grant 61301005); in part by the University Synergy Innovation Program of Anhui Province (Grant GXXT-2019-044); Beijing Municipal Natural Science Foundation (Z200024)

Disclosures

Kai Liu and Jicong Zhang declare no conflict of interest.

Data availability

The data underlying the results presented in this paper are not publicly at this time but can be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. Y.-C. Tham, X. Li, T.Y. Wong, et al., “Global prevalence of glaucoma and projections of glaucoma burden through 2040 a systematic review and meta-analysis,” Ophthalmology 121(11), 2081–2090 (2014). [CrossRef]

2. D.C. Hood, A.S. Raza, C.G.V. de Moraes, et al., “Glaucomatous damage of the macula,” Prog. Retin. Eye Res. 32, 1–21 (2013). [CrossRef]

3. R.N. Weinreb, T. Aung, and F.A. Medeiros, “The pathophysiology and treatment of glaucoma: a review,” JAMA 311(18), 1901–1911 (2014). [CrossRef]

4. A. Azuara-Blanco, A. Tuulonen, and A. King, “Glaucoma,” BMJ 346, 3518 (2013). [CrossRef]

5. I.I. Bussel, G. Wollstein, and J. S. Schuman, “OCT for glaucoma diagnosis, screening and detection of glaucoma progression,” Br. J. Ophthalmol. 98, ii15–ii19 (2014). [CrossRef]

6. A. Grzybowski and A. Harris, “Primary open-angle glaucoma,” N. Engl. J. Med. 360, 2679–2680 (2009). [CrossRef]

7. M.D. Abràmoff, M.K. Garvin, and M. Sonka, “Retinal imaging and image analysis,” IEEE Rev. Biomed. Eng. 3, 169–208 (2010). [CrossRef]

8. D. Huang, E.A. Swanson, C.P. Lin, et al., “Optical coherence tomography,” Science 254(5035), 1178 (1991). [CrossRef]

9. B. Zitová and J. Flusser, “Image registration methods: a survey,” Image Vis. Comput. 21(11), 977–1000 (2003). [CrossRef]

10. H. Fu, M. Baskaran, Y. Xu, et al., “A deep learning system for automated angle-closure detection in anterior segment optical coherence tomography images,” Am. J. Ophthalmol. 203, 37–45 (2019). [CrossRef]

11. L. Fang, D. Cunefare, C. Wang, et al., “Automatic segmentation of nine retinal layer boundaries in OCT images of non-exudative AMD patients using deep learning and graph search,” Biomed. Opt. Express 8(5), 2732 (2017). [CrossRef]

12. T. Li, W. Bo, C. Hu, et al., “Applications of deep learning in fundus images: A review,” Méd. Image Anal. 69, 101971 (2021). [CrossRef]

13. A.R. Ran, C.C. Tham, P.P. Chan, et al., “Deep learning in glaucoma with optical coherence tomography: a review,” Eye. 35(1), 188–201 (2021). [CrossRef]

14. A.J. Tatham and F.A. Medeiros, “Detecting structural progression in glaucoma with optical coherence tomography,” Ophthalmology. 124(12), S57–S65 (2017). [CrossRef]

15. M.T. Leite, L.M. Zangwill, R.N. Weinreb, et al., “Structure-function relationships using the cirrus spectral domain optical coherence tomograph and standard automated perimetry,” J. Glaucoma. 21(1), 49–54 (2012). [CrossRef]

16. N. Anantrasirichai, A. Achim, J. Morgan, et al., “SVM-based texture classification in optical coherence tomography,” 2013 IEEE 10th International Symposium On Biomedical Imaging (ISBI) (2013) 1332–1335.

17. C.-W. Wu, H.-Y. Chen, J.-Y. Chen, et al., “Glaucoma detection using support vector machine method based on spectralis OCT,” Diagnostics 12(2), 391 (2022). [CrossRef]

18. H. Chen, M. Huang, and P. Hung, “Logistic regression analysis for glaucoma diagnosis using stratus optical coherence tomography,” Opt. Vis. Sci. 83(7), 527–534 (2006). [CrossRef]

19. G.M. Richter, X. Zhang, O. Tan, et al., A. I. for G. S. Group, “Regression analysis of optical coherence tomography disc variables for glaucoma diagnosis,” J. Glaucoma 25(8), 634–642 (2016). [CrossRef]

20. G. García, R. del Amor, A. Colomer, et al., “Glaucoma detection from raw circumpapillary OCT images using fully convolutional neural networks,” 2020 IEEE International Conference On Image Processing (ICIP). (2020) 2526–2530.

21. G. García, R. del Amor, A. Colomer, et al., “Circumpapillary OCT-focused hybrid learning for glaucoma grading using tailored prototypical neural networks,” Artif. Intell. Med. 118, 102132 (2021). [CrossRef]

22. S. Hashemabad, M. Eslami, M. Shi, et al., “A graph neural network-based clustering method for glaucoma detection from OCT scans considering uncertainties in the number of clusters,” Invest. Ophthal. Vis. Sci. 64(9), PB001 (2023).

23. D. Song, B. Fu, F. Li, et al., “Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function,” IEEE Trans. Med. Imaging. 40(9), 2392–2402 (2021). [CrossRef]

24. T. Spaide, Y. Bagdasarova, C. Lee, et al., “Leveraging unlabeled OCT data for training better deep learning vision-transformer models,” Invest. Ophthal. Vis. Sci. 63(7), 2327 (2022).

25. E. Lebed, S. Lee, M.V. Sarunic, et al., “Rapid radial optical coherence tomography image acquisition,” J. Biomed. Opt. 18(3), 036004 (2013). [CrossRef]

26. M. Loureiro, J. Vianna, V. Danthurebandara, et al., “Visibility of optic nerve head structures with spectral-domain and swept-source optical coherence tomography,” Journal of glaucoma 26(9), 792–797 (2017). [CrossRef]

27. L. Mendoza, M. Christopher, N. Brye, et al., “Deep learning predicts demographic and clinical characteristics from optic nerve head OCT circle and radial scans,” Invest. Ophthal. Vis. Sci. 62(8), 2120 (2021).

28. A. Geevarghese, G. Wollstein, H. Ishikawa, et al., “Optical coherence tomography and glaucoma,” Annual Review of Vision Science 7(1), 693–726 (2021). [CrossRef]

29. A.G. Roy, S. Conjeti, S.P.K. Karri, et al., “ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks,” Biomed. Opt. Express 8(8), 3627 (2017). [CrossRef]

30. M. Young, E. Lebed, Y. Jian, et al., “Real-time high-speed volumetric imaging using compressive sampling optical coherence tomography,” Biomed. Opt. Express 2(9), 2690–2697 (2011). [CrossRef]

31. S. Pi, T. Hormel, B. Wang, et al., “Volume-based, layer-independent, disease-agnostic detection of abnormal retinal reflectivity, nonperfusion, and neovascularization using structural and angiographic OCT,” Biomed. Opt. Express 13(9), 4889–4906 (2022). [CrossRef]

32. Y. George, B.J. Antony, H. Ishikawa, et al., “Attention-guided 3D-CNN framework for glaucoma detection and structural-functional association using volumetric images,” IEEE J. Biomed. Health Inform. 24(12), 3421–3430 (2019). [CrossRef]

33. Y. George, B. Antony, H. Ishikawa, et al., “Understanding deep learning decision for glaucoma detection using 3D volumes,” Invest. Ophthal. Vis. Sci. 61(7), 2022 (2020).

34. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances In Neural Information Processing Systems30, 30 (2017).

35. M. Guo, T. Xu, J. Liu, et al., “Attention mechanisms in computer vision: A survey,” Computational Visual Media. 8(3), 331–368 (2022). [CrossRef]

36. F. Wang, M. Jiang, C. Qian, et al., “Residual attention network for image classification,” 2017 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) (2017) 6450–6458.

37. W. Wang, S. Zhao, J. Shen, et al., “Salient object detection with pyramid attention and salient edges,” 2019 IEEECVF Conf. Comput. Vis. Pattern Recognit. (CVPR) (2019) 1448–1457.

38. J. Fu, J. Liu, H. Tian, et al., “Dual attention network for scene segmentation,” 2019 IEEECVF Conf. Comput. Vis. Pattern Recognit. (CVPR). (2019) 3141–3149.

39. X. Zhu, D. Chen, Z. Zhang, et al., “an empirical study of spatial attention mechanisms in deep networks,” 2019 IEEECVF Int. Conf. Comput. Vis. (ICCV) (2019) 6687–6696.

40. X. Liu, L. Li, F. Liu, et al., “GAFnet: group attention fusion network for PAN and MS image high-resolution classification,” IEEE Trans. Cybern. 52(10), 10556–10569 (2022). [CrossRef]

41. H. Chen and Z. Shi, “A spatial-temporal attention-based method and a new dataset for remote sensing image change detection,” Remote Sens. 12(10), 1662 (2020). [CrossRef]

42. J. Yu, Z. Lin, J. Yang, et al., “Generative image inpainting with contextual attention,” 2018 IEEECVF Conf. Comput. Vis. Pattern Recognit. (2018) 5505–5514.

43. C. Cao, Q. Dong, and Y. Fu, “Learning prior feature and attention enhanced image inpainting,” Computer Vision - ECCV 2022, PT XV. 13675, (2022) 306–322.

44. Z. Chen, J. Li, H. Liu, et al., “Learning multi-scale features for speech emotion recognition with connection attention mechanism,” Expert Syst. Appl. 214, 118943 (2023). [CrossRef]

45. J. Chorowski, D. Bahdanau, D. Serdyuk, et al., “Attention-based models for speech recognition,” Advances In Neural Information Processing Systems 28, 577–585 (2015). [CrossRef]

46. I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for 3D object detection,” 2021 IEEE/CVF International Conference On Computer Vision (ICCV 2021). (2021) 2886–2897.

47. K. Min, G. Lee, and S. Lee, “Attentional feature pyramid network for small object detection,” Neural Networks. 155, 439–450 (2022). [CrossRef]

48. A. Lin, B. Chen, J. Xu, et al., “DS-TransUNet: Dual swin transformer U-net for medical image segmentation,” IEEE Trans. Instrum. Meas. 71, 4005615 (2022). [CrossRef]

49. R. Gu, G. Wang, T. Song, et al., “CA-Net: comprehensive attention convolutional neural networks for explainable medical image segmentation,” IEEE Trans. Med. Imaging 40(2), 699–711 (2021). [CrossRef]

50. Z. Han, B. Wei, Y. Hong, et al., “Accurate screening of COVID-19 using attention-based deep 3D multiple instance learning,” IEEE Trans. Med. Imaging 39(8), 2584–2594 (2020). [CrossRef]

51. J. Schlemper, O. Oktay, M. Schaap, et al., “Attention gated networks: Learning to leverage salient regions in medical images,” Medical Image Analysis 53, 197–207 (2019). [CrossRef]

52. Y. Tong, Y. Liu, M. Zhao, et al., “Improved U-net MALF model for lesion segmentation in breast ultrasound images,” Biomedical Signal Processing and Control 68, 102721 (2021). [CrossRef]

53. P. Muruganantham and S. Balakrishnan, “Attention aware deep learning model for wireless capsule endoscopy lesion classification and localization,” J. Med. Biol. Eng. 42(2), 157–168 (2022). [CrossRef]

54. S. Ding, Z. Wu, Y. Zheng, et al., “Deep attention branch networks for skin lesion classification,” Computer Methods and Programs in Biomedicine 212, 106447 (2021). [CrossRef]

55. A. Nagrani, S. Yang, A. Arnab, et al., “Attention bottlenecks for multimodal fusion,” Advances In Neural Information Processing Systems 34 (2021).

56. Z. Liu, H. Yin, Y. Chai, et al., “A novel approach for multimodal medical image fusion,” Expert Systems with Applications 41(16), 7425–7435 (2014). [CrossRef]

57. C. Wang, R. Nie, J. Cao, et al., “IGNFusion: an unsupervised information gate network for multimodal medical image fusion,” IEEE J. Sel. Top. Signal Process. 16(4), 854–868 (2022). [CrossRef]

58. C. Playout, R. Duval, M.C. Boucher, et al., “Focused Attention in Transformers for interpretable classification of retinal images,” Méd. Image Anal. 82, 102608 (2022). [CrossRef]

59. Y. Jin, J. Liu, Y. Liu, et al., “A novel interpretable method based on dual-level attentional deep neural network for actual multilabel arrhythmia detection,” IEEE Trans. Instrum. Meas. 71, 1–11 (2022). [CrossRef]

60. H. Lee, S. Yune, M. Mansouri, et al., “An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets,” Nat. Biomed. Eng. 3(3), 173–182 (2019). [CrossRef]

61. T. Lin, P. Goyal, R. Girshick, et al., “Focal loss for dense object detection,” 2017 IEEE International Conference On Computer Vision (ICCV) (2017) 2999–3007.

62. A. Danielyan, V. Katkovnik, and K. Egiazarian, “BM3D Frames and variational image deblurring,” IEEE Trans. on Image Process. 21(4), 1715–1728 (2012). [CrossRef]

63. A.M. Reza, “Realization of the contrast limited adaptive histogram equalization (CLAHE) for real-time image enhancement,” J. VLSI Signal Process. Syst. Signal Image Video Technol. 38(1), 35–44 (2004). [CrossRef]

64. A. Paszke, S. Gross, F. Massa, et al., PyTorch: An Imperative Style, High-Performance Deep Learning Library, in: 2019: p. 8024.

65. R.R. Selvaraju, M. Cogswell, A. Das, et al., “Grad-CAM: visual explanations from deep networks via gradient-based localization,” 2017 IEEE Int. Conf. Comput. Vis. (ICCV) (2017) 618–626.

66. B. Zhou, A. Khosla, A. Lapedriza, et al., “Learning deep features for discriminative localization,” 2016 IEEE Conference On Computer Vision And Pattern Recognition (CVPR) (2016) 2921–2929.

67. Y. Zhang, K. Liu, H. Bao, et al., “AFTR: a robustness multi-sensor fusion model for 3D object detection based on adaptive fusion transformer,” Sensors 23(20), 8400 (2023). [CrossRef]

68. F. Liu, Z. Hua, J. Li, et al., “DBMF: dual branch multiscale feature fusion network for polyp segmentation,” Comput. Biol. Med. 151, 106304 (2022). [CrossRef]

69. X. He, Y. Deng, L. Fang, et al., “Multi-modal retinal image classification with modality-specific attention network,” IEEE Trans. Med. Imaging 40(6), 1591–1602 (2021). [CrossRef]

	Glaucoma	Normal	All
No.	136	1425	1561
Age	73.15 ± 8.71	65.91 ± 9.40	66.94 ± 9.64
Male:Female	75:61	645:780	720:841
OD:OS	122:14	1401:24	1523:38
RNFL	89.67 ± 30.04	108.82 ± 33.31	107.32 ± 33.47
Quality	22.97 ± 4.08	24.86 ± 3.93	24.60 ± 4.01
ART	53.14 ± 22.85	43.92 ± 18.88	44.65 ± 19.38
Region	Scan pattern	Image size (amounts × width × height)
ONH	Circular	1 × 768 × 496
ONH	Radial	6 × 512 × 496
Middle	Single-Line	1 × 1536 × 496
Macular	Radial	6 × 512 × 496
Macular	Volume	9 × 768 × 496

Single scan-pattern-based models	AUC	F1	MCC	G-Mean
Circular-ONH	0.7358 ± 0.0742	0.2511 ± 0.0575	0.1851 ± 0.0836	0.6583 ± 0.0677
Radial-ONH	0.7695 ± 0.0265	0.3039 ± 0.0449	0.2407 ± 0.0517	0.6691 ± 0.0366
Single-line-middle	0.6932 ± 0.0591	0.2141 ± 0.0199	0.1397 ± 0.0660	0.6004 ± 0.0549
Radial-macular	0.6990 ± 0.0861	0.2323 ± 0.0527	0.1518 ± 0.0920	0.6172 ± 0.0932
Volume-macular	0.7458 ± 0.0845	0.2944 ± 0.0965	0.2308 ± 0.1347	0.6650 ± 0.1237

Single region-based models	Fusion Strategy	AUC	F1	MCC	G-Mean
ONH-region-based model	Average	0.7898 ± 0.0231	0.2969 ± 0.0282	0.2563 ± 0.0477	0.7117 ± 0.0383
ONH-region-based model	AMSFM	0.7957 ± 0.0318	0.3146 ± 0.0513	0.2647 ± 0.0579	0.7083 ± 0.0344
Middle-region-based model	\	0.6932 ± 0.0591	0.2141 ± 0.0199	0.1397 ± 0.0660	0.6004 ± 0.0549
Macular-region-based model	Average	0.7474 ± 0.0668	0.2610 ± 0.0456	0.2099 ± 0.0718	0.6647 ± 0.0840
Macular-region-based model	AMSFM	0.7576 ± 0.0638	0.2820 ± 0.0580	0.2329 ± 0.0806	0.6895 ± 0.0714

Different Models	AUC	F1	MCC	G-Mean
Single scan-pattern-based model (Optimal)	0.7695 ± 0.0265	0.3039 ± 0.0449	0.2407 ± 0.0517	0.6691 ± 0.0366
Single region-based model (optimal)	0.7957 ± 0.0318	0.3146 ± 0.0513	0.2647 ± 0.0579	0.7083 ± 0.0344
Two-region-based model (optimal)	0.8197 ± 0.0626	0.3199 ± 0.0624	0.2808 ± 0.0842	0.7261 ± 0.0650
All-three-region-based model	0.8036 ± 0.0282	0.3498 ± 0.0684	0.3067 ± 0.0713	0.7312 ± 0.0326

Two-region-based models	Fusion Strategy	AUC	F1	MCC	G-Mean
Macular-ONH	Average	0.8112 ± 0.0380	0.3038 ± 0.0448	0.2677 ± 0.0562	0.7194 ± 0.0390
Macular-ONH	AMRFM	0.8197 ± 0.0626	0.3199 ± 0.0624	0.2808 ± 0.0842	0.7261 ± 0.0650
Macular-Middle	Average	0.7529 ± 0.0724	0.2765 ± 0.0765	0.2205 ± 0.1002	0.6074 ± 0.2055
Macular-Middle	AMRFM	0.7735 ± 0.0526	0.2976 ± 0.0499	0.2414 ± 0.0645	0.6960 ± 0.1086
ONH-Middle	Average	0.7896 ± 0.0480	0.2948 ± 0.0795	0.2395 ± 0.0847	0.6826 ± 0.0399
ONH-Middle	AMRFM	0.8118 ± 0.0351	0.3339 ± 0.0634	0.2902 ± 0.0684	0.7225 ± 0.0384

Metrics	Fusion Strategy	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean ± std
AUC	Average	0.8163	0.7909	0.8196	0.7932	0.7496	0.7939 ± 0.0250
AUC	AMRFM	0.8148	0.7806	0.8493	0.8055	0.7682	0.8036 ± 0.0282
F1	Average	0.3129	0.2667	0.3129	0.2695	0.2699	0.2863 ± 0.0216
F1	AMRFM	0.3333	0.2734	0.4750	0.3579	0.3099	0.3498 ± 0.0684
MCC	Average	0.2849	0.2182	0.2956	0.2162	0.2352	0.2500 ± 0.0336
MCC	AMRFM	0.2835	0.2212	0.4376	0.3069	0.2847	0.3067 ± 0.0713
G-Mean	Average	0.7381	0.6878	0.7496	0.6849	0.6992	0.7119 ± 0.0267
G-Mean	AMRFM	0.7204	0.6885	0.7872	0.7190	0.7409	0.7312 ± 0.0326

Abstract

1. Introduction

1.2 Related works

1.2.1 OCT-based glaucoma detection models

1.2.2 Attention mechanism-based models

2. Methods

2.1 Overview of the proposed model

2.2 Attention multi-scan-pattern fusion module (AMSFM)

2.3 Attention multi-region fusion module (AMRFM)

2.4 Details of the proposed model

2.5 Training steps and objection function

3. Experiments

3.1 Dataset

3.1.1 Data acquisition and data approval

3.1.2 Image quality control and data annotation

3.1.3 Overview of dataset

3.1.4 Data preprocess

3.2 Experimental configuration

3.3 Evaluation metrics

4. Experimental results

4.1 Results of the single scan-pattern-based models

4.2 Results of single region-based models

4.2.1 Results of single region-based models employing average fusion strategy

4.2.2 Results of single region-based models employing AMSFM module

4.3 Results of multi-region and multi-scan-pattern-based models

4.3.1 Results of the proposed models employing average fusion strategy

4.3.2 Results of the proposed models employing AMRFM module

5. Visualization

5.1 Visualization of feature maps for glaucomatous optic neuropathy

5.2 Visualization of region contribution score

6. Discussion

7. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (8)

Tables (6)

Equations (8)

Biomedical Optics Express