Wavelet attention network for the segmentation of layer structures on OCT images

Cong Wang; Cong Wang; Meng Gan; Meng Gan

doi:10.1364/BOE.475272

1. Introduction

Optical coherence tomography (OCT) imaging offers a non-invasive way to detect lesions on the microscopic scale [1,2], which is of clinical potential in the early diagnosis and treatment of diseases on several organs, such as retina, esophagus and airway [3–7]. In clinical, diagnosis is made by analyzing certain tissues, which often show a layered structure. Therefore, accurate identification of layered structures on OCT images is crucial for related disease diagnosis. However, since OCT devices generate a large volume of images and are sometimes subject to specific confounders such as artifacts and mucus, analysis of OCT images requires experienced ophthalmologists or gastroenterologists and the process is time-consuming and subjective. As a result, automatical segmentation of OCT images is needed to quantify characteristics such as tissue thickness and shape, which is of great significance for the clinical application of OCT devices.

In the early work of OCT image segmentation, traditional image processing techniques like graph search were employed [8–12]. These methods rely on reasonable prior knowledge or hand-crafted features, which makes them less reliable in dealing with variation among images. In contrast, deep learning-based algorithms are more general since they can automatically learn complex features from the input in an end-to-end fashion. Such algorithms are gradually applied to image segmentation tasks, such as UNet [13], UNet++ [14] and ResNet [15]. In OCT image segmentation, deep learning based methods are also considered state-of-the-art. For example, Fang et al. segmented nine tissue layers in retinal OCT images based on the patch classification result of a convolutional neural network [16]. Devalla designed the DRUNET for optic nerve head tissue segmentation in OCT image [17]. Our group proposed an adversarial learned network to segment esophageal images from guinea pigs [18]. Despite the inspiring segmentation performance, they do not take into account the strong dependence between the near-pixel features and the weak dependence between the far-pixel features of the same layers. Consequently, the result often suffers from topological errors such as outlier predictions and disconnected regions [19].

Establishing effective feature representation to reveal the long-range dependence of different pixels is the key to preventing topological errors. Attention mechanism is a newly developed approach to solve the problem with several variants, such as spatial attention, channel attention and self-attention [20–22]. In our previous work, self-attention was employed for esophageal segmentation [19]. Despite achieving satisfactory segmentations, we found that the self-attention mechanism, especially the embedded position-attention module, requires large computational memory. On the contrary, the channel attention that directly attaches weights to different channels seems to be more suitable for OCT layer segmentation due to its simplicity and efficiency in feature modeling. In the widely-used channel attention network SENet [23], the authors employed the global average pooling (GAP) to calculate the importance weight. Nevertheless, the GAP only calculates the lowest frequency components of the input, which makes it hard to well capture complex information for various inputs.

Extracting information of different frequency bands from an image is a classic problem in the field of image processing. The specific implementation methods include Fourier transform, discrete cosine transform (DCT) and wavelet transform. In recent years, several studies have begun to introduce these frequency-domain processing methods into deep learning. The research of Ehrlich and Davis designed a convolutional neural network based on discrete cosine transform to directly classify JPEG compressed images, improving efficiency with little impact on accuracy [24]. Alijamaat et al. used Haar wavelet transform to extract the frequency domain features of MRI images, and based on this, they improved the diagnostic accuracy of specific brain diseases [25]. Qin et al. proposed the FcaNet that employs DCT to improve the performance of SENet and shows the results of insufficient utilization of image information of the conventional channel attention module [26]. Inspired by this work, Su et al. developed the CFCANet that alters the original attention structure and utilizes the complete frequency information of the low-frequency regions to enhance the features [27]. Using discrete wavelet transform (DWT) [28], Li et al. embedded the wavelet transform into the UNet architecture to achieve the purpose of down-sampling and up-sampling [29]. Zhao et al. proposed a multi-scale wavelet network combined with a bidirectional feature fusion and a wavelet-UNet for end-to-end pediatric echocardiographic segmentation [30]. It can be found that the DCT-based network generates numerous frequency bands, of which only a small part contributes to the final task, thus requiring specific-designed selection strategies for different tasks [26,27]. As for existing DWT-based networks, the wavelet transform was applied to replace the down-sampling and up-sampling operation [29], rather than exploring information on different frequency bands.

In summary, challenges for the OCT image segmentation task in this study can be listed as follows: (1) The unique interference such as tissue fluid and tissue folds in OCT images requires high generalization ability of the algorithm, and tissues with insignificant spatial brightness characteristics also make the automatic segmentation algorithm difficult to deal with; (2) The traditional encoding-decoding segmentation network only focuses on the image brightness feature itself and can not leverage the relationship between objects or stuff in a global view [31], often resulting in topological errors such as label discontinuity and layer confusion; (3) The existing attention network is limited by its insufficient frequency domain expression ability, which makes it can be further improved for a better segmentation performance.

In this study, we use the DWT to extract information in different frequencies. The DWT is employed for the following reasons. Firstly, DWT has only four components in one decomposition (the approximation coefficient and details coefficients in horizontal, vertical and diagonal), which makes it much easier to select proper components for a specific task. Secondly, a multi-layer DWT is implemented by downsampling and filtering, which is consistent with the structure of UNet, indicating the wavelet filters can be directly embedded in each stage of UNet to achieve multi-spectral analysis without redundancy. Thirdly, various wavelet bases make DWT more flexible to adapt to different tasks. On the basis of DWT, we proposed the wavelet attention network (WATNet) for layered structure segmentation on OCT images. The new network introduces the wavelet-based attention mechanism to capture long-range dependencies of the pixels in an image, resulting in powerful feature maps. In this case, representative features make the network robust to disturbances such as tissue fluid and folds in OCT images. Moreover, compared to our previous self-attention network [19] that only contains a single attention module, the new wavelet-based attention mechanism can be embedded in each stage of the backbone for more significant improvements. As a result, the network is of the potential to identify target tissues from various disturbances with fewer topological errors. Our main contributions can be summarized as follows:

• We analyzed the theory of DWT and the structure of UNet and generalized the channel attention in frequency domain using wavelet-based attention mechanism.
• We designed the WATNet that embeds wavelet attention module in a simple yet effective way with only a few additional matrix multiplication operations compared with traditional UNet.
• We evaluated the proposed WATNet on a self-collected dataset and two public OCT datasets, and segmentation improvements over several popular deep networks are observed.

The rest of this study is organized as follows. Section 2. describes the related theory and detailed architecture of the proposed WATNet. Section 3. describes the experiment, which shows the segmentation performance of the WATNet and comparisons with other deep networks. Discussions and conclusions are given in Sections 4. and 5, respectively.

2. Methods

2.1 Overview of the WATNet

The overall framework of the proposed WATNet is shown in Fig. 1. We use the UNet combined with residual block as the backbone network, which is embedded with the WAT module to construct the WATNet. The network is consist of the encoding path and the decoding path. The encoding path is designed using a ResNet34 backbone [15] with WAT modules embedded, and the decoding path is composed of the upsampling function and the WAT module. In Fig. 1, “BasicBL” represents the basic block of ResNet34 [15], “C” indicates the concatenate connection [13]. The WAT module is the key component of the proposed method, which is intended to perform the attention mechanism in the frequency domain. With the WAT module, the network is supposed to analyze features in different frequencies and extract more meaningful deep features, thus improving the model’s ability in OCT image segmentation.

Fig. 1. Framework of the proposed WATNet.

Download Full Size | PDF

2.2 WAT module

2.2.1 Channel attention

The channel attention implemented by SENet [23] can be described by Eq. (1),

(1)$$\boldsymbol{a} = \text{sigmoid} (\text{MAP}(\text{GAP}(X)))$$

where $\textbf {X} \in \mathbb {R}^{C \times H \times W}$ is the feature tensor, $\boldsymbol {a} \in \mathbb {R}^C$ is the attention vector, sigmoid is the sigmoid function, MAP represents the mapping operation (such as the fully connection and $1\times 1$ convolution), and GAP denotes the global average pooling as formulated in Eq. (2).

(2)$$\text{GAP} (\textbf{X}) = \frac{1}{WH} \sum_{i=1}^{W}\sum_{j=1}^{H} X_{ij}$$

The feature tensor $\tilde {\textbf {X}}$ optimized by the attention mechanism can be expressed in Eq. (3), where $C$ is the channel number.

(3)$$\tilde{\textbf{X}}_{i, :, :} = \boldsymbol{a}_i \textbf{X}_{i,:,:}, \quad i \in \{1, 2, \ldots, C\}$$

2.2.2 Analysis of the WAT module

The discrete wavelet transform of a 2-D function $f(x, y)$ with size $M \times N$ can be expressed as

(4)$$\begin{aligned} & W_{\phi}(j_0, m, n) = \frac{1}{MN} \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y) \phi_{j_0, m, n} (x, y)\\ & W_{\psi}^i(j, m, n) = \frac{1}{MN} \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y) \psi_{j, m, n}^i (x, y) \end{aligned}$$

where $j_0$ is the beginning scale, $W_\phi$ is the approximate coefficients of $f(x, y)$, $W_{\psi }^i$, $i \in \{H, V, D\}$ denotes the details of three directions (horizontal, vertical and diagonal). $\phi _{j_0, m, n}$ is the 2-D scaling function and $\psi _(j, m, n)^i$, $i \in \{H, V, D\}$ are 2-D wavelet functions in three directions. In general, $j_0 = 0$, $N = M = 2^J$, $j = 0, 1, 2, \ldots, J-1$, $m, n = 0, 1, 2, \ldots, 2^j - 1$, $\psi _{j_0, m, n}$ and $\psi _{j, m, n}^i$ are set according to the type of wavelet.

The scaling coefficients approximate the original input, which represent the low-frequency component. Compared Eq. (4) with the Eq. (2), we can find that they are of the same format with different scaling parameters. As a result, the GAP can be regarded as the low-frequency component of the DWT, and the proposed WAT module is a qualified attention mechanism.

2.2.3 Architecture of the WAT module

The core architecture of the WAT module is shown in Fig. 2. In this figure, GAP is the global average pooling defined in Eq. (2), MAP is the mapping operation implemented by the $1 \times 1$ convolution. Each feature map was decomposed by the DWT bases, which includes the approximation basis (DWT-A) and details for horizontal (DWT-H), vertical (DWT-V) and diagonal (DWT-D), thus generating four feature vectors representing different frequency components (denoted as cA, cH, cV and cD in Fig. 2). The frequency selection operator in Fig. 2 is a pre-defined 0/1 value to control if the corresponding frequency component will be transposed to the subsequent process. The value is determined based on the target tasks, which ensures the frequency combination with the best performance to be used to construct the frequency vector. This frequency vector is employed to generate the attention vector, which is supposed to improve the model’s ability in OCT image segmentation. As shown in Fig. 2, the WAT module is able to extract features in different frequencies, while conventional channel attention networks based on GAP only focus on the lowest frequency component. As a result, the proposed method is of the potential to learn more representative features, thus achieving more accurate segmentation. It is also worth mentioning that since the wavelet function can be pre-defined, the whole process needs only several extra matrix multiplication operations compared with the backbone net, which indicates the high efficiency of the proposed method.

Fig. 2. Architecture of the channel attention in SENet [23] and the proposed WAT.

Download Full Size | PDF

2.2.4 Loss function

The overall loss function of the WATNet can be expressed as Eq. (5).

(5)$$L = L_{\text{CE}} + L_{\text{Dice}}$$

In this equation, $L_{\text {CE}}$ denotes the cross entropy loss which is a measurement for classification accuracy as described by Eq. (6), where $N$ is the pixel number, $g_l(x)$ is the target probability that pixel $x$ belongs to class $l$ with one for the true label and zero entries for the others. $p_l(x)$ is the estimated probability of pixel $x$ belongs to class $l$.

(6)$$L_{\text{CE}} ={-}\sum_{i=1}^{N}g_{l}(x)\log{p_{l}(x)}$$

$L_{\text {Dice}}$ represents the dice loss, which is intended to evaluate the spatial overlap between the predicted mask and the ground truth as defined in Eq. (7),

(7)$$L_{\text{Dice}} = 1-\frac{2\sum_{i=1}^{N}p_l(x)g_l(x)}{\sum_{i=1}^{N}p^2_l(x)+\sum_{i=1}^{N}g^2_l(x)}$$

where the parameters are defined in the same way as those in Eq. (6).

3. Experiments

3.1 Dataset

In order to verify the effectiveness of the proposed algorithm, this paper includes a self-collection esophageal OCT dataset and two publicly available retinal datasets in experiments. An overview of them is listed in Table 1.

Table 1. Overview of datasets.

View Table | View all tables in this article

Dataset #1: The first dataset is a self-collected esophageal OCT dataset. The experiment was approved by the animal science center of Suzhou Institute of Biomedical Engineering and Technology. The dataset includes 100 OCT images from five C57BL mice. These images were collected in vivo using an 800 nm endoscopic OCT system. The annotated labels were generated by two experienced graders using ITK-SNAP. The labels of Grader #1 were used for network training and those of Grader #2 were used for comparison.

Dataset #2: The second is a public retinal OCT image set of 10 OCT volumes from 10 healthy patients acquired with an SD-OCT Spectralis device (Heidelberg Engineering, Heidelberg, Germany) [32]. The volumes in this set contain 10 B-scans, 496 pixels in height, and variable-width ranging from 543 pixels to 644 pixels.

Dataset #3: The third dataset is the HCMS retinal dataset from Johns Hopkins University [33], which includes 14 healthy controls (HC) and 21 patients with multiple sclerosis (MS), where each subject consists of 49 B-scans ($496 \times 1024$) with the annotations of 9 layer boundaries.

3.2 Implementation details

For Datasets #1 and #2, we perform five-fold cross validation for the experiment. For Dataset #1, we choose eight cases (88 images) as the training set, and the remaining two cases (22 images) are the test set for each fold. For Dataset #2, eight cases (80 images) are used for training, and the remaining two cases (20 images) are selected as the test set. Since Dataset #3 has relatively large number of images, the cross validation was not applied. The training set is composed of 7 HC cases (343 images) and 13 MS cases (637 images); the validation set includes 3 HC cases (147 images) and 4 MS cases (196 images); the remaining cases were used for testing, including 4 HC cases (196 images) and 4 MS cases (196 images).

In this study, we employed the patch-based strategy to make the network applicable for various image sizes as shown in Fig. 1. For all the three datasets, we randomly extracted five slices with width 128 from each B-scan. When predicting, the whole image can be fed directly into the network.

All experiments are carried out under the PyTorch framework using an 11 GB Nvidia GeForce RTX 2080Ti GPU with CUDA 9.2 and cuDNN v7. The Haar wavelet bases [34] were used to perform wavelet attention. The network was trained end-to-end using the Adam optimizer with a learning rate of $1 \times 10^{-4}$. The batch size is set at 15, and 100 epochs are needed to accomplish the training.

During the training process, data augmentation is used to improve the network robustness. The data augmentation techniques used in this study include random noise, random flip, random affine and random elastic deformation. The random noise operation adds Gaussian noise with random parameters; the random flip randomly reverses the order of elements in an image along the given axes; random affine applies a random affine transformation and resamples the image; in random elastic deformation, a random displacement is assigned to a coarse grid of control points around and inside the image [35].

3.3 Frequency component selection and ablation studies

To achieve the final WATNet, we need to select useful frequency bands for the OCT segmentation tasks in this study. For an intuitive demonstration, we show the four DWT coefficients (using Haar basis) of an OCT image from Dataset #1 in Fig. 3 [32]. It can be noticed that the approximation image is similar to the original image, which reflects the major structure of the input. The horizontal coefficients detected boundaries of the horizontal-structured tissue layers. As for the vertical and diagonal details, the coefficients can not be intuitively interpreted as the other two components. The results are not surprising, as retinal or esophageal tissue is usually located horizontally on OCT images, which makes the horizontal coefficient of DWT more sensitive to intensity variations in different layers.

Fig. 3. Demonstration of DWT coefficients of an OCT image.

Download Full Size | PDF

Ablation experiments were conducted on Dataset #1 and Dataset #2 to quantitatively verify the performance of the proposed wavelet attention module and select proper frequency components. Instead of using five-fold cross validation, we select one of the five training-testing datasets for more efficient ablation experiments, and the training epoch number is set to 40. The UNet combined with residual structure was used as the backbone network. Then, we added wavelet attention module with different frequency components to observe the segmentation performance. The mean and standard deviation values of accuracy and Dice similarity coefficient (DSC) are listed in were listed in Table 2.

Table 2. Segmentation performance of the WAT module with different frequency components.

View Table | View all tables in this article

It can be found that almost all networks with WAT modules (except backbone+WAT_D for Dataset #2) achieved significantly higher accuracy and DSC ($p < 0.001$) than the backbone, which indicates the advantages of the proposed WAT module. Besides, the backbone+SENet has a similar performance with the backbone+WAT_A. The $p$ values of the $t$-test between these two networks on the listed four metrics in Table 2 were 0.207, 0.301, 0.172, 0.013, respectively, which indicates the metric values can be regarded as equal under a 0.05 significant level except for the last metrics (DSC for Dataset #2). The results further confirm that the original SENet calculates the low frequency component of the image, which is a special case of the proposed WATNet (WAT_A).

For both the two cases, the backbone+WAT_AH that uses the combination of approximation coefficients and the horizontal details achieved the best segmentation result, which achieved significantly higher accuracy and dice coefficient than the second best network. This is in line with the above analysis that the approximation coefficients reflect the major structures of the image, and the horizontal details can detect boundaries of the tissue layers. As a result, a combination of these two wavelet components performed best in segmenting layer structures on OCT images.

The ablation experiment of WAT modules embedded in different positions was also conducted in this study. On the basis of the backbone, we respectively embedded the WAT modules in Block1 (indicated in Fig. 1), Block4 (indicated in Fig. 1) and both to evaluate the segmentation performance. Results were listed in Table 3, where the WAT module embedded in Block1 performs better than Block4, and the network with WAT module in both blocks achieved the best performance. The result showed that the WAT module closer to the input generates more significant improvement, and more WAT modules embedded in the network resulted in a more powerful network. Moreover it can also be found that the accuracies and DSCs of the three networks in Table 3 are higher than the backbone (Table 2) lower than the backbone+WAT_AH (Table 2). As a result, the backbone+WAT_AH with WAT modules embedded in all stages of the network was selected as the final architecture.

Table 3. Segmentation performance of the WAT module embedded in different positions.^a

View Table | View all tables in this article

3.4 Comparisons with the state-of-the-arts

We compared the proposed WATNet with the widely used UNet [13], UNet++ [14], DeepLabV3 [36] and the backbone that combined UNet with residual block [15]. Besides, the ACN [18] and TSANet [19] designed for OCT image segmentation and the CPFNet [37] for general medical image segmentation were also employed in the experiment. Moreover, to clarify the effectiveness of the proposed wavelet attention, the channel-attention model SENet [23] and DCT-based frequency attention model FcaNet [26] with the same backbone network were included in the comparison.

Figure 4 shows the segmentation results of different networks on an OCT image from Dataset #1. The selected image was affected by non-uniform intensities along the horizontal direction, making some areas of tissue difficult to identify. In this case, most networks failed to segment complete tissue layers. The networks with attention mechanism (TSANet, backbone+SENet, backbone+FcaNet, WATNet) generate more consecutive tissue layers. Furthermore, the WATNet exhibits the best performance, which confirms the advantages of the proposed wavelet attention mechanism.

Fig. 4. Visualization of segmentation results of different methods on Dataset #1.

Download Full Size | PDF

In the case of Dataset #2 (Fig. 5), it can be noticed that all the networks are able to identify the major structures of the target tissues. Some of the networks (UNet, UNet++, ACN, TSANet, backbone) were affected by the artifacts and generated outlier predictions. In comparison, the channel attention-based networks on the bottom row of the figure and the CPFNet avoided these errors, and the result can be directly used to locate different tissues without further post-processing.

Fig. 5. Visualization of segmentation results of different methods on Dataset #2.

Download Full Size | PDF

Figure 6 shows the segmentation result of different networks for the case from Dataset #3. The major structure of the image is also successfully identified by all the networks. However, the UNet generates obvious mistakes in the segmentation mask; the DeepLabV3 result shows discontinuity on the top layer and the backbone and backbone+FcaNet generate sharp changes on the segmentation mask. The other networks achieved segmentation results quite close to the ground truth.

Fig. 6. Visualization of segmentation results of different methods on Dataset #3.

Download Full Size | PDF

To quantitatively evaluate the segmentation performance, the mean and standard deviation values of accuracy and DSC are listed in Table 4. It can be found that the overall performance of ACN and TSANet designed for OCT image segmentation and the CPFNet that explores context information is better than the general UNet, UNet++, and DeepLabV3 in the tested cases. Besides, the three channel attention-based models (backbone+SENet, backbone+FCANet, and WATNet) achieved better results than the backbone ($p<0.001$), which confirmed the effectiveness of attention mechanism. Moreover, the proposed WATNet generated the best segmentation result in all cases, which achieved significantly higher accuracy and DSC than the second best network ($p < 0.001$), indicating the proposed method is quite competitive in esophageal OCT image segmentation.

Table 4. Segmentation performance of different networks on three datasets.

View Table | View all tables in this article

4. Discussion

Automatical segmentation of clinical-relevant tissues is a critical technique in OCT image processing. However, the process is often affected by speckle noise, motion artifacts or unfavorable image qualities. An effective solution to this problem is extracting more powerful feature maps for the deep networks. In this study, we proposed the WATNet, which introduces the wavelet-based attention mechanism to capture long-range dependencies of the pixels from the image. Comparisons with other popular segmentation networks confirmed the advantages of the proposed WATNet with higher accuracy and DSC.

Topological errors are common in OCT image segmentation because disturbances such as speckle noise and artifacts often make target tissues difficult to identify. This study focuses on using attention mechanisms to capture intra- and inter-class relationships to improve the segmentation performance. Researchers also developed other strategies to address these issues. For example, DeepLab and PSPNet [36,38] aggregate multi-scale contexts via combining feature maps generated by different dilated convolutions and pooling operations. GAN-based methods train the segmentation network in an adversarial way to generate more “real” masks [18,39]. The recent Bicon-Net effectively models inter-pixel relations and object saliency using connection masks and saliency masks as labels [6,40]. Experiments on five salient object detection benchmark datasets and an esophagus OCT image dataset showed the Bicon-CE outperformed several widely used neural networks and reduced common topological prediction issues. Since our WAT module can be easily plugged into existing frameworks, in our future work, we will try to combine our method with different strategies to further improve the automatic segmentation performance on OCT images.

The good performance of the WATNet leverages on the following advantages of the proposed wavelet attention mechanism. Firstly, it covers more frequency bands, indicating a more comprehensive feature representation than the conventional channel attention structure. Secondly, the DWT used in this network has only four frequency components in one decomposition, which makes it easier to select useful frequency bands. Thirdly, the selected DWT components focus more on vertical intensity variation, which is beneficial for detecting boundaries of the horizontal structural tissue layers. Finally, the lightweight architecture enables the WAT module to be embedded in different stages of the network for better segmentation performance. As a result, the WATNet achieved the best performance among the test networks on segmenting layer structured tissues from OCT images.

This study used three datasets to verify the effectiveness of the proposed method. In Fig. 4 selected from Dataset #1, most of the methods failed in the blurry or shadowing region. In this case, we further employed a simple median filter to denoise the image and use the backbone without WAT module for segmentation. The achieved DSC is $86.87 \pm 4.72$, which is equal to the backbone result without denoising (Table 4, $86.89 \pm 4.63$) under a 0.05 significant level ($p = 0.32$). The segmentation performance was not improved by simple denoising methods for this case. In Datasets #2 and #3, the images have more clear boundaries and less noise. Most deep networks segmented the major structures of the layered tissues as shown in Figs. 5 and 6. However, outlier predictions can be found in the results of UNet, UNet++, and backbone in Fig. 5 for Dataset #2. Discontinuous prediction and tissue sharp changes can be found in the results of UNet, DeepLabV3, and backbone in Fig. 6 for Dataset #3. In comparison, the proposed WATNet is able to alleviate these errors and generate segmentation masks closer to the ground truth.

In the current study, the DWT is implemented using the Haar basis [34]. There are various other bases that can be used for this framework, such as the Daubechies wavelets, Coiflets and Symlets [41]. Moreover, the discrete wavelet transform is not limited to extracting information from horizontal, vertical, and diagonal. Directional wavelets such as curvelets and contourlet [42] are able to provide information at different angles, which may be useful for more complex segmentation tasks.

5. Conclusion

In this study, we proposed the WATNet for layer-structured tissue segmentation on OCT images. Segmenting OCT images often suffers from outlier prediction or label disconnection problems due to the specific imaging disturbance such as tissue fluid and tissue folds. To detect layer structures in a global view and explore useful information in different frequencies, the WATNet introduces the wavelet attention mechanism to capture multi-spectral features, which has the advantages of simple frequency selection process, no redundant frequency components and various wavelet bases. Moreover, the network is easy to implement with only a few additional matrix multiplications compared to UNet. Experiments on a self-collected esophageal dataset and two public retinal OCT datasets show that the WATNet achieved segmentation results superior to several popular segmentation networks, which confirms the effectiveness and practical significance of the proposed method.

Funding

Natural Science Foundation of Shandong Province (ZR2021QF068, ZR2021QF105); Natural Science Foundation of Jiangsu Province (BK20200216).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results related to Dataset #1 presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request. Data underlying the results related to Dataset #2 presented in this paper are available in Dataset 1, Ref. [32]. Data underlying the results related to Dataset #3 presented in this paper are available in Dataset 2, Ref. [33].

References

1. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, C. A. Puliafito, and J. G. Fujimoto, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

2. G. J. Tearney, M. E. Brezinski, B. E. Bouma, S. A. Boppart, C. Pitris, J. F. Southern, and J. G. Fujimoto, “In vivo endoscopic optical biopsy with optical coherence tomography,” Science 276(5321), 2037–2039 (1997). [CrossRef]

3. L. Qi, K. B. Zheng, X. P. Li, Q. J. Feng, Z. P. Chen, and W. F. Chen, “Automatic three-dimensional segmentation of endoscopic airway oct images,” Biomed. Opt. Express 10(2), 642–656 (2019). [CrossRef]

4. R. Rasti, M. J. Allingham, P. S. Mettu, S. Kavusi, K. Govind, S. W. Cousins, and S. Farsiu, “Deep learning-based single-shot prediction of differential effects of anti-vegf treatment in patients with diabetic macular edema,” Biomed. Opt. Express 11(2), 1139–1152 (2020). [CrossRef]

5. H. Stegmann, R. M. Werkmeister, M. Pfister, G. Garhofer, L. Schmetterer, and V. A. Dos Santos, “Deep learning segmentation for optical coherence tomography measurements of the lower tear meniscus,” Biomed. Opt. Express 11(3), 1539–1554 (2020). [CrossRef]

6. Z. Y. Yang, S. Soltanian-Zadeh, K. K. Chu, H. R. Zhang, L. Moussa, A. E. Watts, N. J. Shaheen, A. Wax, and S. Farsiu, “Connectivity-based deep learning approach for segmentation of the epithelium in in vivo human esophageal oct images,” Biomed. Opt. Express 12(10), 6326–6340 (2021). [CrossRef]

7. I. Cabeza-Gil, M. Ruggeri, Y. C. Chang, B. Calvo, and F. Manns, “Automated segmentation of the ciliary muscle in oct images using fully convolutional networks,” Biomed. Opt. Express 13(5), 2810–2823 (2022). [CrossRef]

8. Y. Boykov and G. Funka-Lea, “Graph cuts and efficient n-d image segmentation,” Int. J. Comput. Vis. 70(2), 109–131 (2006). [CrossRef]

9. S. J. Chiu, X. T. Li, P. Nicholas, C. A. Toth, J. A. Izatt, and S. Farsiu, “Automatic segmentation of seven retinal layers in sdoct images congruent with expert manual segmentation,” Opt. Express 18(18), 19413–19428 (2010). [CrossRef]

10. J. L. Zhang, W. Yuan, W. X. Liang, S. Y. Yu, Y. M. Liang, Z. Y. Xu, Y. X. Wei, and X. D. Li, “Automatic and robust segmentation of endoscopic oct images and optical staining,” Biomed. Opt. Express 8(5), 2697–2708 (2017). [CrossRef]

11. M. Gan, C. Wang, T. Yang, N. Yang, M. Zhang, W. Yuan, X. D. Li, and L. R. Wang, “Robust layer segmentation of esophageal oct images based on graph search using edge-enhanced weights,” Biomed. Opt. Express 9(9), 4481–4495 (2018). [CrossRef]

12. C. Wang, M. Gan, N. Yang, T. Yang, M. Zhang, S. H. Nao, J. Zhu, H. Y. Ge, and L. R. Wang, “Fast esophageal layer segmentation in oct images of guinea pigs based on sparse bayesian classification and graph search,” Biomed. Opt. Express 10(2), 978–994 (2019). [CrossRef]

13. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, Pt Iii, vol. 9351 (2015), pp. 234–241.

14. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, D. Stoyanov, Z. Taylor, G. Carneiro, T. Syeda-Mahmood, A. Martel, L. Maier-Hein, J. M. R. Tavares, A. Bradley, J. P. Papa, V. Belagiannis, J. C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, and A. Madabhushi, eds. (Springer International Publishing, Cham, 2018), pp. 3–11.

15. K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), (2016), pp. 770–778.

16. L. Y. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. T. Li, and S. Farsiu, “Automatic segmentation of nine retinal layer boundaries in oct images of non-exudative amd patients using deep learning and graph search,” Biomed. Opt. Express 8(5), 2732–2744 (2017). [CrossRef]

17. S. K. Devalla, P. K. Renukanand, B. K. Sreedhar, G. Subramanian, L. Zhang, S. Perera, J. M. Mari, K. S. Chin, T. A. Tun, N. G. Strouthidis, T. Aung, A. H. Thiery, and M. J. A. Girard, “Drunet: a dilated-residual u-net deep learning network to segment optic nerve head tissues in optical coherence tomography images,” Biomed. Opt. Express 9(7), 3244–3265 (2018). [CrossRef]

18. C. Wang, M. Gan, M. Zhang, and D. Y. Li, “Adversarial convolutional network for esophageal tissue segmentation on oct images,” Biomed. Opt. Express 11(6), 3095–3110 (2020). [CrossRef]

19. C. Wang and M. Gan, “Tissue self-attention network for the segmentation of optical coherence tomography images on the esophagus,” Biomed. Opt. Express 12(5), 2631–2646 (2021). [CrossRef]

20. K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in 32nd International Conference on Machine Learning, vol. 37 (2015), pp. 2048–2057.

21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (Nips 2017), vol. 30 (2017).

22. L. Mou, Y. Zhao, L. Chen, J. Cheng, Z. Gu, H. Hao, H. Qi, Y. Zheng, A. Frangi, and J. Liu, “Cs-net: Channel and spatial attention network for curvilinear structure segmentation,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, eds. (Springer International Publishing, Cham, 2019), pp. 721–730.

23. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr), (2018), pp. 7132–7141.

24. M. Ehrlich and L. Davis, “Deep residual learning in the jpeg transform domain,” in 2019 Ieee/Cvf International Conference on Computer Vision (Iccv 2019), (2019), pp. 3483–3492.

25. A. Alijamaat, A. NikravanShalmani, and P. Bayat, “Multiple sclerosis identification in brain mri images using wavelet convolutional neural networks,” Int. J. Imaging Syst. Technol. 31(2), 778–785 (2021). [CrossRef]

26. Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 783–792.

27. B. Su, J. Liu, X. Su, B. Luo, and Q. Wang, “Cfcanet: A complete frequency channel attention network for sar image scene classification,” IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 14, 11750–11763 (2021). [CrossRef]

28. S. G. Mallat, “A theory for multiresolution signal decomposition - the wavelet representation,” IEEE Trans. Pattern Anal. Machine Intell. 11(7), 674–693 (1989). [CrossRef]

29. Y. Li, Y. Wang, T. Leng, and W. Zhijie, “Wavelet u-net for medical image segmentation,” in Artificial Neural Networks and Machine Learning – ICANN 2020, I. Farkaš, P. Masulli, and S. Wermter, eds. (Springer International Publishing, Cham, 2020), pp. 800–810.

30. C. Zhao, B. Xia, W. L. Chen, L. B. Guo, J. Du, T. F. Wang, and B. Y. Lei, “Multi-scale wavelet network algorithm for pediatric echocardiographic segmentation via hierarchical feature guided fusion,” Appl. Soft Comput. 107, 107386 (2021). [CrossRef]

31. J. Fu, J. Liu, H. J. Tian, Y. Li, Y. J. Bao, Z. W. Fang, and H. Q. Lu, “Dual attention network for scene segmentation,” in 2019 Ieee/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr 2019), (2019), pp. 3141–3149.

32. J. Tian, B. Varga, G. M. Somfai, W. H. Lee, W. E. Smiddy, and D. C. DeBuc, “Real-time automatic segmentation of optical coherence tomography volume data of the macular region,” PLoS One 10(8), e0133908 (2015). [CrossRef]

33. Y. F. He, A. Carass, S. D. Solomon, S. Saidha, P. A. Calabresi, and J. L. Prince, “Retinal layer parcellation of optical coherence tomography images: Data resource for multiple sclerosis and healthy controls,” Data Brief 22, 601–604 (2019). [CrossRef]

34. M. G. Albanesi, I. Delotto, and L. Carrioli, “Image compression by the wavelet decomposition,” Eur. Trans. Telecomm. 3(3), 265–274 (1992). [CrossRef]

35. C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data 6(1), 60 (2019). [CrossRef]

36. L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” CoRRabs/1706.05587 (2017).

37. S. Feng, H. Zhao, F. Shi, X. Cheng, M. Wang, Y. Ma, D. Xiang, W. Zhu, and X. Chen, “Cpfnet: Context pyramid fusion network for medical image segmentation,” IEEE Trans. Med. Imaging 39(10), 3008–3018 (2020). [CrossRef]

38. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 6230–6239.

39. P. Isola, J. Y. Zhu, T. H. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in 30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017), (2017), pp. 5967–5976.

40. Z. Yang, S. Soltanian-Zadeh, and S. Farsiu, “Biconnet: An edge-preserved connectivity-based approach for salient object detection,” Pattern Recognit. 121, 108231 (2022). [CrossRef]

41. S. Sahu and N. Rayavarapu, “Performance comparison of sparsifying basis functions for compressive speech enhancement,” Int. J. Speech Technol. 22(3), 769–783 (2019). [CrossRef]

42. V. M. Kamble, P. Parlewar, A. G. Keskar, and K. M. Bhurchandi, “Performance evaluation of wavelet, ridgelet, curvelet and contourlet transforms based techniques for digital image denoising,” Artif. Intell. Rev. 45(4), 509–533 (2016). [CrossRef]

Datasets	Description	Cross validation	Resolution	Characteristic
Dataset #1	110 in total	Five-fold	$512 \times 1000$	Healthy Esophageal image
	88 for training			4 tissue layers
	22 for validation
Dataset #2 [32]	100 in total	Five-fold	$496 \times (543 \sim 644)$	Healthy retinal image
	80 for training			7 tissue layers
	20 for validation
Dataset #3 [33]	1715 in total	N/A	$496 \times 1000$	Healthy retinal image
	980 for training			Patient with multiple sclerosis
	343 for validation			8 tissue layers
	392 for testing

Methods	Dataset #1		Dataset # 2
	Accuracy	DSC	Accuracy	DSC
Grader #2	$0.9716 \pm 0.0152$	$0.8395 \pm 0.0924$	$0.9733 \pm 0.0030$	$0.8721 \pm 0.0161$
Backbone [15]	$0.9793 \pm 0.0126$	$0.8463 \pm 0.0790$	$0.9821 \pm 0.0015$	$0.9156 \pm 0.0086$
Backbone+SENet [23]	$0.9819 \pm 0.0053$	$0.8679 \pm 0.0383$	$0.9820 \pm 0.0038$	$0.9192 \pm 0.0077$
Backbone+WAT_A	$0.9818 \pm 0.0065$	$0.8670 \pm 0.0376$	$0.9827 \pm 0.0025$	$0.9219 \pm 0.0093$
Backbone+WAT_H	$0.9829 \pm 0.0057$ ^b	$0.8735 \pm 0.0263$ ^b	$0.9828 \pm 0.0028$	$0.9222 \pm 0.0080$
Backbone+WAT_V	$0.9787 \pm 0.0137$	$0.8414 \pm 0.0879$	$0.9828 \pm 0.0016$	$0.9202 \pm 0.0073$
Backbone+WAT_D	$0.9813 \pm 0.0085$	$0.8640 \pm 0.0433$	$0.9715 \pm 0.0264$	$0.8994 \pm 0.0542$
Backbone+WAT_AH	$0.9832 \pm 0.0033$ ^a	$0.8762 \pm 0.0191$ ^a	$0.9840 \pm 0.0019$ ^a	$0.9268 \pm 0.0085$ ^a
Backbone+WAT_AV	$0.9808 \pm 0.0099$	$0.8610 \pm 0.0568$	$0.9831 \pm 0.0020$	$0.9219 \pm 0.0101$
Backbone+WAT_AD	$0.9819 \pm 0.0084$	$0.8695 \pm 0.0443$	$0.9835 \pm 0.0018$	$0.9241 \pm 0.0088$
Backbone+WAT_AHV	$0.9804 \pm 0.0109$	$0.8569 \pm 0.0629$	$0.9838 \pm 0.0025$	$0.9263 \pm 0.0091$
Backbone+WAT_AHVD	$0.9813 \pm 0.0058$	$0.8597 \pm 0.0319$	$0.9838 \pm 0.0021$ ^b	$0.9264 \pm 0.0091$ ^b

Position	Dataset #1		Dataset #2
	Accuracy	DSC	Accuracy	DSC
Block1	$0.9808 \pm 0.0121$	$0.8514 \pm 0.0421$	$0.9832 \pm 0.0031$	$0.9189 \pm 0.0102$
Block4	$0.9797 \pm 0.0119$	$0.8470 \pm 0.0685$	$0.9828 \pm 0.0029$	$0.9162 \pm 0.0123$
Block1+Block4	$0.9816 \pm 0.0052$	$0.8639 \pm 0.0232$	$0.9835 \pm 0.0023$	$0.9226 \pm 0.0091$

Network	Dataset #1		Dataset #2		Dataset #3
	Accuracy (%)	Dice (%)	Accuracy (%)	Dice (%)	Accuracy (%)	Dice (%)
UNet [13]	$98.19 \pm 0.87$	$86.65 \pm 3.08$	$98.20 \pm 0.32$	$92.10 \pm 1.06$	$98.45 \pm 0.69$	$89.89 \pm 4.23$
UNet++ [14]	$98.18 \pm 0.45$	$86.09 \pm 2.09$	$98.05 \pm 0.19$	$91.20 \pm 0.98$	$98.18 \pm 0.69$	$88.43 \pm 4.33$
DeepLabV3 [36]	$97.95 \pm 1.34$	$85.15 \pm 7.74$	$98.01 \pm 0.58$	$91.04 \pm 1.35$	$97.80 \pm 0.67$	$86.38 \pm 4.37$
ACN [18]	$97.34 \pm 1.62$	$85.65 \pm 6.41$	$98.13 \pm 0.49$	$92.07 \pm 1.29$	$97.94 \pm 0.71$	$87.51 \pm 4.15$
TSANet [19]	$98.20 \pm 0.56$	$86.94 \pm 3.37$	$98.25 \pm 0.63$	$92.21 \pm 1.20$	$98.40 \pm 0.66$	$89.71 \pm 3.48$
CPFNet [37]	$98.34 \pm 0.28$	$87.32 \pm 1.66$	$98.14 \pm 0.83$	$92.10 \pm 1.53$	$98.41 \pm 0.74$	$89.69 \pm 4.44$
Backbone [15]	$98.19 \pm 0.65$	$86.89 \pm 4.63$	$98.36 \pm 0.23$	$92.62 \pm 0.90$	$98.38 \pm 0.59$	$89.62 \pm 3.63$
Backbone+SENet [23]	$98.22 \pm 0.72$	$86.97 \pm 2.94$	$98.44 \pm 0.16$ ^b	$92.65 \pm 0.83$	$98.44 \pm 0.39$ ^b	$89.79 \pm 2.62$ ^b
Backbone+FCANet [26]	$98.39 \pm 0.27$ ^b	$87.84 \pm 1.48$ ^b	$98.42 \pm 0.20$	$92.77 \pm 0.96$ ^b	$98.41 \pm 0.69$	$89.63 \pm 4.16$
WATNet	$98.40 \pm 0.27$ ^a	$88.04 \pm 1.51$ ^a	$98.49 \pm 0.18$ ^a	$93.09 \pm 0.88$ ^a	$98.51 \pm 0.50$ ^a	$90.30 \pm 3.21$ ^a

Datasets	Description	Cross validation	Resolution	Characteristic
Dataset #1	110 in total	Five-fold	$512 \times 1000$	Healthy Esophageal image
	88 for training			4 tissue layers
	22 for validation
Dataset #2 [32]	100 in total	Five-fold	$496 \times (543 \sim 644)$	Healthy retinal image
	80 for training			7 tissue layers
	20 for validation
Dataset #3 [33]	1715 in total	N/A	$496 \times 1000$	Healthy retinal image
	980 for training			Patient with multiple sclerosis
	343 for validation			8 tissue layers
	392 for testing

Wavelet attention network for the segmentation of layer structures on OCT images

Abstract

1. Introduction

2. Methods

2.1 Overview of the WATNet

2.2 WAT module

2.2.1 Channel attention

2.2.2 Analysis of the WAT module

2.2.3 Architecture of the WAT module

2.2.4 Loss function

3. Experiments

3.1 Dataset

3.2 Implementation details

3.3 Frequency component selection and ablation studies

3.4 Comparisons with the state-of-the-arts

4. Discussion

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (6)

Tables (4)

Equations (7)

Biomedical Optics Express