GSNet: a multi-class 3D attention-based hybrid glioma segmentation network

Md Tasnim Jawad; Ashfak Yeafi; Kalyan Kumar Halder

doi:10.1364/OE.499054

1. Introduction

1.1 Problem presentation

Glioma, a form of primary brain cancer, is caused by an abnormal proliferation of glial cells located in the cerebrum or the cerebellum. It is a tumor of the malignant category and can be life-threatening if left undiagnosed and untreated. In a 5-year span, the rate of recovery from brain tumors for individuals under the age of 15 is around 75% whereas, for people aged 15 to 39, the rate is 72% [1]. In 2020, around 308,102 individuals were identified as having a primary central nervous system (CNS) tumor [1]. This year, an estimated 24,810 individuals from the US will be diagnosed with primary malignant CNS tumors [2].

Segmentation is a preliminary technique for early detection and enhanced treatment options. It involves dividing an image into regions and assigning each of them a label or category based on its semantic meaning [3]. Glioma segmentation, like that of many other tumors, is a critical step in treatment planning which makes it possible to precisely assess and quantify tumor shape, size, and location, as well as to differentiate between various types of tissue. Despite the benefits of glioma segmentation, the procedure can pose challenges due to the following reasons [4]:

• On medical images, glioma can appear heterogeneous, with a variety of intensities and textures across separate territories of the tumor. This makes discriminating the tumor from normal brain tissue difficult.
• Glioma can differ in terms of location and size, making it difficult to accurately segment them. Large tumors might be difficult to identify from surrounding healthy brain tissue, whereas small tumors may be obscured by image noise or overlap with other structures in the brain.
• Glioma segmentation can be a time-consuming and labor-intensive process, particularly when carried out manually by a radiologist or medical expert. Access to annotated medical images is critical to the precision of glioma segmentation algorithms.

Automated methods based on computer-aided biomedical image retrieval (CBIR) tools can be used to speed up and increase the efficiency of the segmentation process. The construction of these tools depends on computational capability. However, once properly constructed, the CBIR tool can aid in general usage irrespective of consumer hardware specifications. Section 1.2 discusses some of these computerized methods in more detail.

1.2 Literature review

Computerized segmentation strategies have gained increasing attention in recent years. Despite the difficulties and restrictions associated with traditional segmentation techniques, researchers have persisted in creating new strategies that are intended to generate precise and trustworthy results [5]. This section highlights some of the recent articles that demonstrated different types of computerized segmentation techniques.

A combination of the whale optimization algorithm and fuzzy c-means clustering to achieve optimal segmentation results is proposed by [6]. The authors, along with magnetic resonance imaging (MRI) images, used synthetic samples to validate their proposed system’s performance. They showed the effectiveness of their proposed strategy by introducing artificial noise in their image samples. Their approach gained a minimum mean squared error of $28.36$, $27.26$, and $25.09$ at $7{\% }$, $5{\% }$, and $3{\% }$ exposure to salt and pepper noise. An alternate hyper-heuristic-based approach that combines the benefits of various algorithms and techniques to deliver improved results is proposed by [7]. This method takes into account multiple objectives and constraints to produce a more accurate and efficient segmentation outcome. The authors addressed the limitations of meta-heuristic-based approaches to segmentation and divided their work into two parts. Firstly, they used the genetic algorithm (GA) to define the order of the meta-heuristic approach. Then they deployed the meta-heuristic-based approach based on the sequence achieved from the GA. The authors compared their approach based on various performance metrics, namely, the structural similarity index, CPU time(s), peak signal-to-noise ratio (PSNR), and fitness functions. They conducted their experiment against a collection of original and hybrid meta-heuristic approaches and found promising scores. A modified watershed segmentation (MWS) algorithm was proposed by the authors of [8] to improve upon traditional watershed segmentation. They incorporated modifications that enhanced the algorithm’s ability to accurately segment brain regions of interest (ROIs). Before applying MWS, with the help of a Xilinx Virtex-5 FPGA, they processed their corresponding MRI images through a high-pass filter. This process removed the low-pitched noises and retained the high-pitched details. They also applied the enhanced canny edge detection algorithm to identify the boundaries of the brain regions, gaining $99.31{\% }$ accuracy. Another wavelet transform-based image segmentation method is proposed by [9], where the authors identified the threshold level for segmentation via valley point. The valley point threshold level selection aids in locating the best value for segmentation, which most accurately distinguishes the target object from the background. The authors compared their experiments with conventional segmentation methods namely, the maximum variance segmentation method, the bimodal segmentation method, and the valley threshold segmentation method. In addition to SNR and PSNR, they included the processing time and the value of threshold as evaluation metrics and showed their model’s overall performance accuracy.

In recent years, a significant number of neural network (NN)-based segmentation approaches were seen in various articles [3]. These encoder-decoder-based segmentation networks have gained popularity due to their capacity to handle complicated structures, process massive amounts of data, and shorten the time needed for pre-processing and attribute extraction. Segmentation networks are also capable of learning from the input and extracting features, which makes them perfect for coping with novel datasets and ambiguous structures.

One of the oldest and most common forms of segmentation networks is called UNet, which was first published as a research article in [10]. It contains a contracting path as a feature extractor and an expansive path as a feature generator in which the information is downscaled and upscaled, respectively. The downsampler usually referred to as the encoding path, uses several convolution and max-pooling layers to increase the quota of attribute maps while decreasing the overall spatial resolution of the input image. The upsampler, often referred to as the decoding path, is made up of several transposed convolution layers that lessen the quota of attribute maps while increasing the resolution. The upsampler’s fine details and the downsampler’s contextual information are combined to create a precise segmentation using the UNet architecture [10]. A variation of the UNet-like structure has been implemented by various researchers in different articles where they proposed their version of the model and modified it for specific tasks. One such example is seen in [11], where the authors presented a 2-dimensional (2D) convolutional neural network (CNN) based ensemble of networks for brain tumor segmentation. Initially, the authors extracted the whole tumor from the background using three segmentation networks. Then they processed their results using Growcut which is a cellular automaton-based seed-growing technique. Lastly, they extracted the sub-regions using more ensemble of networks. They obtained a range of $0.74-0.85$ dice similarity coefficient (DSC) accuracy which was comparable to the other proposed approaches. They were unable to conduct experiments on 3-dimensional (3D) networks due to the high computational cost.

Several articles used various types of NN blocks to improve segmentation accuracy. One such example can be found in [12], where the authors used the pre-trained ResNet model in modifying their UNet’s encoder [13]. The authors proposed a shared encoder-based structure, followed by separate decoders for separate classes. They received favorable DSC and demonstrated some complications throughout their experiment. They also stated that their model appeared unsuitable for practical use due to its 3D size. To address some of the limitations of conventional UNet, the authors of [14] proposed Sharp UNet, a novel UNet-type segmentation network. The Sharp UNet improves on the traditional skip connection-based architecture by incorporating a depthwise convolution of the attribute extracted map along with a sharpening kernel filter, making the network more spatially sensitive and capable of recognizing minute details in the input image. Having tested their architecture on six different datasets, the authors obtained excellent validation results. Their model outperformed some of the most advanced baseline structures. Another version of the UNet-like model is seen in [15]. In it, the authors used bottleneck residual blocks and also provided significant details regarding the fine-tuning of their proposed network. They compared their work with a very well-known segmentation network called DeepLabV3+. Despite the 2D nature of their network, it still achieved mean DSC of 0.8673, 0.7514, and 0.7983 on three separate classes of their utilized dataset. Due to a greater number of enhancing areas, the authors faced some difficulties and the model did not perform well across all target classes. A similar but 3D approach was seen in [16] and [17] where the authors proposed their UNet-type networks which were capable of dealing with 3D segments. Each voxel in the volume is classified as belonging to the item of interest or the background using the features that the 3D CNN layers of the networks learn to extract from the given data. In the case of [17], the topology of the segmentation network was more hybridized, with numerous learning modules cascaded one after the other to boost overall efficiency. A fundamentally distinct yet simple version of the segmentation network was proposed by [18], where the network itself was an improvement of the well-known VNet model [19].

The studies discussed above show that deep learning-based segmentation models are critical for tumor segmentation. When applied to the dataset utilized in this article, CNN-based segmentation architectures have produced promising results for glioma segmentation [20]. Furthermore, mathematical or algorithmic methods frequently lack the ability to re-train or learn newer data through adaptation, making validation for the segmentation task less reliable. We, therefore, aim to contribute by proposing a new CNN-based network for extracting ROIs from brain segments. Our focus will be on glioma segmentation, which accounts for a significant portion (33%) [21] of all cancers.

2. Our contribution

We have developed a hybrid end-to-end multi-class 3D CNN-based segmentation network named GSNet (glioma segmentation network) for automating the segmentation process. Our network consists of 5 levels in the encoder and 5 levels in the decoder, forming a 5-stage segmentation network incorporating attention-based skip connections. Our simulation results demonstrate a significant improvement, with GSNet performing on par with, if not surpassing, some of the top-performing segmentation networks developed for glioma segmentation. The overall contribution of our work is stated below:

• Creating an end-to-end segmentation network (GSNet) capable of segmenting glioma segments with high efficiency.
• Developing a robust 3D network capable of pooling both low-level and high-level attributes from MRIs.
• Producing an overall lightweight model, that is able to perform quick segmentation while dealing directly with 3D images, despite training with significantly imbalanced data.
• Achieving high accuracy without utilizing extensive image pre-processing. Obtaining performance scores that are large enough to establish GSNet as a state-of-the-art model.
• Compressing the entire pipeline into a user-friendly GUI application, which can serve as an automated tool enabling practitioners and medical personnel to efficiently segment glioma ROIs.

The remaining sections are presented as follows:

Section 3 deals with the datasets and the methodologies utilized to carry out the experimentations. The simulation results are mentioned in section 4. This section is elaborated into multiple segments representing each dataset and also contains a comparison with other well-known methods involving the primary dataset. The article is finalized in section 5 with the conclusion.

3. Materials and methods

3.1 Datasets

We trained our model, GSNet, using the multimodal brain tumor image segmentation benchmark (BraTS) datasets [22]. We primarily used the BraTS 2020 for both training and validation. We also compared the performance of GSNet through BraTS 2019 and BraTS 2018 datasets, using which we further trained and validated our model. Each dataset comprises MRI scans of brain segments (glioma) with a voxel size of $1\times 1\times 1 \ \text {mm}^{3}$ and an original volume shape of $240\times 240\times 155$ dimension. The datasets include T1-weighted (t1), T1-weighted contrast-enhanced (t1ce), and T2-weighted (t2) scans, as well as fluid attenuation inversion recovery (FLAIR) images. The datasets also contain annotated ground truth values all of which are labeled using four values (0, 1, 2, 4), where 0 denotes no tumor and the rest belong to different regions. For segmentation, our segment classes were tumor core (TC) (label 1,4), whole tumor (WT) (label 1-4), and enhancing tumor core (ET) (label 4) [23]. Here, the WT regions represent the entirety of the infected portion including the peritumoral edema part and the TC regions represent the more aggressive parts. The ET regions highlight a subset of the core portion which indicates increased cell density [22]. However, the class distribution of the datasets was imbalanced. Each sample contains four images and a segmentation mask file, stored in the “.nii” file format. The datasets were extracted from various clinical protocols covering 19 institutions and the images inside were segmented manually and were also approved by experts [23]. The process was laborious and the slices were also carefully revised for proper intensity, texture, and various other morphological parameters [22]. To construct GSNet, for training, we used 176 samples from the “BraTS2020_TrainingData” folder, which is less than the total of 369 samples in the BraTS 2020 dataset. The selection of the train-test split was initially done at random with a 5-fold stratified k-fold approach [24]. Among the 5, one of the folds with 44 samples was kept separate for validation. This entire distribution was later kept fixed for the remainder of our experiments. The BraTS 2019 dataset consists of 259 samples of high-grade glioma (HGG) patients’ MRI scans and 76 samples of low-grade glioma (LGG) patients’ MRI scans. The BraTS 2018 dataset contains 210 HGG and 75 LGG patients’ MRI scans. We only used the HGG patients’ MRI samples from the 2019 and 2018 datasets.

3.2 Data pre-processing

To enhance the learning capability of our network and reduce computational costs, we utilized center cropping on all images. Figure 1 represents an implementation of the center cropping procedure which involved selecting the middle region of each image as the training feed while ignoring the outlying portions. This method focuses on the ROIs within the image, as it is often the case that the areas surrounding the object of interest contain irrelevant information that could negatively impact the network’s performance. In the case of our utilized data, the MRI scans along with the tumor and its corresponding regions are all in the center. This is why with the help of cropping, the network can learn only the essential features, leading to improved segmentation accuracy [25]. We resized the images from their original size to $128\times 128\times 128$ resolution, resulting in a total of 2,097,152 pixels in each voxel image. This size strikes a balance between reducing the computational burden and preserving important image details. Additionally, the process helped us to train our network with limited GPU resources. The right-most column in Fig. 1 is displaying the ground truth for all three classes (WT: cyan, TC: navy blue, ET: magenta). Both the images and their masks are cropped in the same way.

Fig. 1. The $100^{th}$ slice of id ‘BraTS20_Training_140’. The top row is displaying the original slices and the bottom row is displaying the center-cropped versions.

Conv3d	Input Channel 4	Output Channel 16	Kernel Size $(3 \times 3 \times 3)$
InstanceNorm3d	Feature Number 16
Dropout3d	Probability 0.2
ReLU
Conv3d	Input Channel 16	Output Channel 16	Kernel Size $(3 \times 3 \times 3)$
InstanceNorm3d	Feature Number 16
Dropout3d	Probability 0.2
ReLU

Upsample	Scale Factor 2.0	Mode = ‘nearest’
Conv3d	Input Channel 32	Output Channel 16	Kernel Size $(3 \times 3 \times 3)$
InstanceNorm3d	Feature Number 16
ReLU
Conv3d	Input Channel 32	Output Channel 16	Kernel Size $(3 \times 3 \times 3)$
InstanceNorm3d	Feature Number 16
Dropout3d	Probability 0.2
ReLU
Conv3d	Input Channel 32	Output Channel 16	Kernel Size $(3 \times 3 \times 3)$
InstanceNorm3d	Feature Number 16
Dropout3d	Probability 0.2
ReLU

	Training DL	Validation DL	Training DSC	Validation DSC
3-stage architecture	$1.0724$	$1.2134$	$0.7014$	$0.5130$
5-stage architecture	$0.3085$	$0.2526$	$0.8376$	$0.8615$
7-stage architecture	$0.8716$	$0.9269$	$0.7906$	$0.5690$

		WT	TC	ET
	DSC	0.8977	0.8698	0.7907
	IoU	0.8197	0.7820	0.6704
BraTS 2019	Sen	0.9207	0.9288	0.8818
	DSC	0.9048	0.8759	0.7956
	IoU	0.8286	0.7877	0.6747
BraTS 2018	Sen	0.8824	0.8708	0.8580

Name	Method	Validation DSC
		WT	TC	ET
[15]	The authors utilized a 2D convolutional neural network to create a basic encoder-decoder-based UNet-type structure.	0.8674	0.7983	0.7514
[18]	The authors used separate encoders for all 4 types of images (FLAIR, t1, t1ce, and t2) and combined the outputs in a single decoder.	0.7025 (Intact Tumor)	0.8827	0.7386
[49]	The article shows a multiple-stage solution where a 3D dense convolutional block is used in combination with residual inception blocks. The model utilized the entire dataset.	0.8912	0.8474	0.7912
[50]	The authors used a 2D axial path-wise approach for segmentation. Their network contains 2D fully convolutional networks and variants of deep-layer aggregation units stacked together. Their 2D input patches are sized $120 \times 120$ .	0.8858	0.8297	0.7900
[51]	The authors used all 369 cases from the entire dataset. They proposed a new architecture that is made with SANet (scale attention mechanisms) and combined it with a residual squeeze and excitation network utilizing 24 features. Their input contains randomly cropped ( $128 \times 128 \times 128$ ) images. Their proposed model is the 3rd place solution in the BraTS 2020 challenge.	0.8828	0.8433	0.8177
[17]	The authors proposed a parallel multi-scale fusion module and expectation maximization attention mechanism. Overall it is a 2-stage cascaded 3D UNet-type model.	0.9129	0.8546	0.7875
GSNet	We have utilized multi-input attention-based skip connection combined with our proposed 3D convolution-based segmentation network.	0.9239	0.9103	0.8139

Abstract

1. Introduction

1.1 Problem presentation

1.2 Literature review

2. Our contribution

3. Materials and methods

3.1 Datasets

3.2 Data pre-processing

3.3 Methodology

3.3.1 Network fundamentals

3.3.2 Evaluation metrics

3.3.3 Proposed GSNet model

3.4 Hardware and training protocol

4. Results and discussion

4.1 BraTS 2020

4.2 BraTS 2019 and BraTS 2018

4.3 State-of-the-art comparison

4.4 GSNet-based web app

5. Conclusion

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (18)

Tables (5)

Equations (14)

Optics Express