LDADN: a local discriminant auxiliary disentangled network for key-region-guided chest X-ray image synthesis augmented in pneumoconiosis detection

Li Fan; Zelin Wang; Jianguang Zhou

doi:10.1364/BOE.461888

1. Introduction

Pneumoconiosis is a dominant occupational lung disease triggered by long-term inhalation of mineral dust which draws more attention with increasing prevalence and severity. It often occurs in the workplace, when workers are excessively exposed to dust (e.g., asbestos, silica, coal and mixed fine dust), the retention in the lungs will gradually cause the diffuse fibrosis of lung tissue [1–2]. Because of its irreversibility with no cure for this disease, the high treatment costs of pneumoconiosis result in substantial economic losses. On the basis of the national occupational disease report in China, by the end of 2019, there were reports of more than a total of 990,000 occupational diseases, which contains more than 885,000 cases of pneumoconiosis, accounting for about 90% of the total. From 2017 to 2019, over 20,000 new cases were reported each year. Relevant investigation suggests that for coal workers’ pneumoconiosis in China, the direct economic burden per capita was 24108.05 yuan and the indirect burden was about 35977.36 yuan per capita [3]. In the meantime, with the growth of the age, the medical cost, indirect costs and health risks increase with each passing year [4]. Regular screening of potentially at-risk populations is the key to early intervention and prevention of pneumoconiosis due to its inapparent progression of signs.

At present, chest X-ray radiographs (CXR) are used as the primary evidence for the clinical diagnosis of pneumoconiosis. In China, the diagnosis of pneumoconiosis still relies on radiologists in clinical practice. However, manual radiograph reading has drawbacks in many aspects. First, lack of accuracy. CXR diagnosis of pneumoconiosis requires experienced and well-trained radiologists to recognize the imperceptible graphic patterns and features. This process is unstable owing to crucial inter- and intra-observer variations. For instance, In the United States, the pneumoconiosis diagnosis concordance is between 85% and 90% among professional radiologists and around 80% in general medical practitioners [5]. The consistency of screening programs is even lower, especially in remote and rural regions. Second, short of stability. Because of the laborious work, radiologists may overlook subtle lesions, such as small pulmonary nodules and inconspicuous microcalcifications. Third, it is still a job of high workload. According to relevant surveys, one-third of industrial workers in China were exposed to dust to various degrees, and occupational health examinations (physical examinations and CXR) were carried out on exceeding 10 million people among those workers every year. After massive initial suspected case screenings, these CXRs are referred to the occupational diagnosis agencies for further diagnosis and confirm the subtype. Therefore, developing a fast, efficient, and accurate automated computer-aided diagnostic system to help radiologists in decision-making needs to be addressed.

Relying on high-performance computing and large datasets, deep learning approaches have demonstrated enormous potential in various image processing applications. The most well-known deep learning model is the convolutional neural network (CNN), it offers a solution with end-to-end training for prominent feature extraction and performances superior to the traditional benchmark in almost all image tasks. Since the idea of CNN took root, researchers have proposed VGGNet [6], GoogleNet [7], ResNet [8], MobileNet [9], DenseNet [10] and many other CNN-based architectures, which are frequently referred to deep learning models. Recent studies have shown that these models have obtained great achievements in computer-aided diagnosis (CAD) for human disease detection [11–13]. Bharati et al [14]. established a hybrid deep learning framework based on VGG and spatial transformer network for lung disease detection, which outperformed existing methods in terms of assessment metrics. Abhange et al [15]. investigated the capability of accelerating the detection process of COVID-19 and classification of other lung diseases with chest X-rays using GoogleNet Inception-V3 architecture. Their approach demonstrated the reliability of Inception-V3 and the extensibility of this model in other lung complications. Yan et al [16]. combine LSTM and DenseNet to automatically annotate unlabeled images and classify the abnormality in the chest X-rays. The results showed that their framework improved the performance of CXR-based disease diagnosis.

Although these approaches have made promising results, the performance of the deep learning networks often depends highly on vast amounts of annotated data which often seem scarce in the medical industry. More importantly, the CXR imaging data, especially for positive patients, is still absent due to technical, legal, data ownership, and privacy challenges. The shortage of positive data referred to the class imbalance problem in the classification tasks. It may lead to the sub-optimal performance on the minority class (i.e., positive class). Under this condition, researchers began using synthesized CXR images for data augmentation, but there are still several problems in CXR image synthesis. One of the most significant challenges in CXR image synthesis is that some approaches synthesize images by giving an input image and producing a single output. Thus, the outputs lack diversity. The image resolution generated by those methods can hardly meet the increasing research requirement. Moreover, as we aim to extract features in the lesion areas and generate images with fine local details, existing approaches with a single discriminator cannot transfer high-frequency details of a particular area. Besides, generative adversarial networks usually have a highly under-constrained and unstable training process. How to make a large complex network converge to a relatively good state is also a crucial problem confronting us.

This study aims at tackling the issues described above by presenting a local discriminant auxiliary disentangled network (LDADN). An advanced generative adversarial network architecture, based on disentangled representation, uses feature disentanglement and gated feature fusion to synthesize diverse and high-quality CXR images. Then, the CXR images generated by LDADN are joined to the original dataset to train the classifiers and improve detection performance.

The main contributions of this study are shown as follows:

1. For the synthesis of CXR images, a local discriminant auxiliary disentangled network is proposed as a new framework for unpaired medical image translation tasks. LDADN combines disentangled representation and adversarial learning. Adding skip connections between encoders and generators, the model can capture high and low frequency components of the desired target modality and significantly reduce mode collapse. The generator can effectively synthesize realistic pneumoconiosis CXR images with diversity and high quality through given attributes or random noise vectors.
2. For the network architecture, mutually independent local discriminators cooperated with adversarial loss and Laplacian filter loss are employed. Inspired by multi-scale discriminators [17], the local discriminator differentiates images in 3 scales with an identical network structure. As the input local image patches are fed into the discriminators, the generators are encouraged to transfer high-frequency details and synthesize high resolution and crisp output. It is mainly used to improve the lesion area’s synthetic image quality.
3. For the pneumoconiosis detection, since the detection performance can be better than those models learned from scratch, transfer learning is adopted in the construction process of the classification model. While adding the synthesized pneumoconiosis CXR images into the training set, the detection result can be further improved.
4. For the whole methodology shown in Fig. 1, through this process, we demonstrate that LDADN as a data generation procedure can synthesize visually satisfying results and achieve satisfactory performance in pneumoconiosis detection. This framework is also fit for other insufficient label or class imbalanced medical image data analysis.

Fig. 1. Module diagram of the proposed process and methodology.

Download Full Size | PDF

2. Related work

To alleviate the data imbalance problem, researchers attempted to use data augmentation techniques to help the classifier improve accuracy and avoid overfitting. The most common data augmentation methods include some simple modifications of the dataset images like shift, rotation, flip, and rescaling. However, slight modifications cannot introduce too much additional information. Besides, they may cause overfitting on the minority class which is being oversampled [18]. In addition, in many applications such as medical image analysis, because images contain a great deal of senior semantic information, the changes of the position and noise may alter the label of the image. Even the most widespread used data augmentation method like mirror flip will change the distribution of the original data [19]. By contrast, high-quality sample synthesis is a new and sophisticated data augmentation method. Samples of synthetic data learned by using generative models dramatically promote image diversity and enable the augmented dataset to improve model generalization ability. One of the most promising approaches is generative adversarial networks (GANs) [20]. Inspired by game theory, GANs synthesize high-quality natural images by optimizing the adversarial loss. This kind of network has shown excellent performance and attracted broad attention in medical imaging, and many works about medical image synthesis for data augmentation were proposed [21–26].

However, in the training process, the original GANs require paired data from the source and target domain. In other words, it is hard to obtain both healthy and anomalous data from the same patient. Thus, this has restrained the application in the medical imaging field to a great extent. In order to avoid the paired data issue, some unpaired image-to-image translation approaches were presented. This method often utilizes the inter-domain variation of the data distribution for image synthesis by learning mappings of the inter-domain and synthesizing samples in the underrepresented domain from ones in the overrepresented domain. The standard methods using this kind of model for data augmentation include CycleGAN [27], DualGAN [28], UNIT, MUNIT [29], and DRIT [30]. They combine class conditioning with cycle consistency and adversarial training to learn a mapping relation between a source and a target domain. These unpaired image-to-image translation frameworks have also been applied to medical image analysis [31]. Chartsias et al [32]. demonstrated the potential for unpaired image-to-image translation to synthesis pairs of cardiac MR images and cardiac CT images using CycleGAN architecture. The authors had shown that the performance could be increased by 16% when synthetic data trained the segmentation model. Han et al [33]. successfully leveraged a two-step data augmentation method that combined PGGAN with MUNIT to generate and further refine the MR images with/without tumors. Based on this model, the sensitivity can be increased from 93.67% to 97.48% compared to the classic tumor detection method. Tang et al [34]. introduced a disentangled generative network that can simultaneously generate residue maps for disease and normal CXR images by the abnormal one. Comparing with a synthetic normal CXR, they decomposed disease regions from the abnormal image. The authors showed that this framework was able to improve disease classification or detection performance. Nevertheless, there is still a lack of systematic study on the X-ray image synthesis of pneumoconiosis based on previous literature.

3. Materials and methods

3.1 Raw data

The raw data collected for this task contains two parts: Health and Pneumoconiosis. All the cases were acquired from 2015 to 2018 using digital radiography (DR) as the X-ray unit: a GMM CALYPSO multifunctional DR ceiling system, with the following parameters: 115 KVp tube voltage, 2.2 mAs tube current, and 14 ms exposure time. Six invited experienced radiologists determined all the CXR raw data to diagnose cases as data annotation. To enhance the diagnosis accuracy, we divided them into three groups: two in group A, and two in group B, with the other two in group C. Both group A and group B read all the CXR images independently. If the annotation results differed, the images were handed in to group C for an independent review and final ruling. All the abnormal images were discussed to achieve consensus categorization. These radiologists are from three different national institutions and have more than ten years’ experience in pneumoconiosis. In particular, the two radiologists in group C are directors of the departments in first-class hospitals in China. The diagnoses were according to “Diagnosis of occupational pneumoconiosis GBZ70-2015”, the newest diagnostic standard of pneumoconiosis based on CXR image in China. This annotation procedure was applied to all the data and all the pneumoconiosis CXR images used in this study were hospital-confirmed cases. The raw data did not conduct any pre-processing and refining process, in case extra processing may lead to image distortion and color deviation in the synthesis procedure which can seriously affect classification accuracy.

3.2 Datasets

The datasets consist of 2432 CXR images (3056 × 3056 pixels), including 904 pneumoconiosis and 1528 health. Firstly, CXR image files were converted from DICOM format to jpg format. This transformation made it easier to display and process images by the algorithm. After that, we removed low-quality or unsure images and eliminated omissive or incorrect annotated images from the dataset.

Second, we central cropped the images with the scale of 1:1 and downsized them to 512 × 512 pixels. This procedure reduces the graphical memory footprint and increases training speed in the meantime while retaining the details of the original image to the maximum extent.

Finally, all the images from pneumoconiosis and health were randomly separated into training and test datasets. The training set included 1000 health (negative) samples and 574 pneumoconiosis (positive) samples, while the test set consisted of 217 health (negative) samples and 217 pneumoconiosis (positive) samples. All the datasets were collected under information consent for radiologists involved in the study. Patients’ informed consent was waived due to the retrospective study design.

3.3 Problem formulation

Let image domains of $X \subset {\mathrm{\mathbb{R}}^{H \times W \times 3}}$ and $Y \subset {\mathrm{\mathbb{R}}^{H \times W \times 3}}$ represent images from health and pneumoconiosis domains respectively. We also have ${\{{{x_i}} \}_{i = 1, \cdots ,M}},{x_i} \in X$ to denote health examples and ${\{{{y_j}} \}_{j = 1, \cdots ,N}},{y_j} \in Y$ to denote pneumoconiosis examples, where $i, j$ stand for the identities of CXR images. The target of our health to pneumoconiosis transfer problem refers to learning a mapping function: ${\mathrm{\Phi }_Y}:{x_i},{y_j} \to {\tilde{y}_i}$, where ${\tilde{y}_i}$ contains the pneumoconiosis attribute from ${\tilde{y}_j}$ and the health content from ${x_i}$. As well as the pneumoconiosis to health transfer problem can be formulated as ${\mathrm{\Phi }_X}:{x_i},{y_j} \to {\tilde{x}_j}$. Both problems can be summarized as a conditioned cross-domain image translation issue by modeling the factors of data variations through learning disentangled representation.

3.4 Network architecture

Recently, researchers have paid more attention to the output variations of cross-domain image translation. Studies have shown that learning disentangled representation of latent variables has achieved massive success in medical images [35]. In the context of potential disease regions transfer or removal, our goal is to firstly separate the pneumoconiosis lesions style latent variable from health features. For instance, infiltrate or consolidation usually shows the area of white lung on the CXR image, because the healthy tissues are replaced by the pathological structures which are more radio-opaque. Under this assumption, we can synthesize new CXR images by recombining those disentangled latent codes. Therefore, the disentanglement architecture is able to help eliminate the false correlation between pneumoconiosis and health features. Consequently, the attribute space can be defined as space A and content space C which consists of the content features. Therefore, the basic framework contains content encoders $\{{E_X^c,E_Y^c} \}$, attribute encoders $\{{E_X^a,E_Y^a} \}$, generators $\{{{G_X},{G_Y}} \}$, global domain discriminators $\{{D_X^{global},D_Y^{global}} \}$ for both domains, and a content discriminator ${D^{\textrm{content}}}$ which is similar to DRIT.

As illustrated in Fig. 2, we encode $E_X^c({{x_i}} )= {C_i}$, $E_Y^c({{y_j}} )= {C_j}$, and $E_X^a({{x_i}} )= {A_i}$, $E_Y^a({{y_j}} )= {A_j}$ and extract the content and attribute from a health image and a pneumoconiosis image, after that, the content and attribute encodings from each domain are recombined and fed into the generators to synthesize the health output ${\tilde{x}_j}$ and pneumoconiosis output ${\tilde{y}_i}$. This process is shown as:

(1)$${G_X}({{A_i},{C_j}} )= {\tilde{x}_j}\textrm{and}{G_Y}({{A_j},{C_i}} )= {\tilde{y}_i}.$$

The structure of the generators is constructed as an approximate U-Net architecture. We concatenated the latent codes from A and C spaces at the bottleneck and the skip connections were added between content encoders and generators to transfer more low-frequency information and preserve more realistic components from the source domain in the synthesized image. On the other hand, we use two global domain discriminators $\{{D_X^{global},D_Y^{global}} \}$ for two domains to distinguish images between synthesized and real samples, and encourage the generators to produce images with correct global structure and authentic details. Besides, we introduce a content discriminator ${D^{\textrm{content}}}$ which helps encode content representations from each domain that can hardly be differentiated by the content discriminator ${D^{content}}$.

Fig. 2. The training framework of LDADN. The model is able to learn the mappings between domain X and Y with unpaired data. Note that the red lines denote the skip connections which are added between content encoders and generators.

Download Full Size | PDF

3.5 Local discriminator

In this task, multiple overlapping local discriminators are applied to encourage transference of features in the lesion area, mainly containing realistic structures and high-frequency details. In contrast to the global discriminators, implementing this measure helps retain the contextual information of the local patch as much as possible and transfer the lesion details without missing or blurring any of them.

To cope with synthesizing high-resolution local patches at a low cost of memory footprint, we employ a multi-scale discriminator which consists of 3 discriminators that share an identical network architecture but act on different image scales. The structure of our local multi-scale discriminator is exhibited in Fig. 3. In order to enlarge the receptive field of convolution, we create a 3-scale image pyramid by downsampling the real image by a factor of 2 and 4. In this way, the real and synthesized images can be distinguished at three scales by an identical structure making it easier to improve the generator’s synthetic quality. Given corresponding patches in the pneumoconiosis example ${y_j}$, synthesized pneumoconiosis image ${\tilde{y}_i}$, the whole image resolution of 512 × 512 pixels and each local discriminator demands a local image patch of 80 × 80 pixels. According to the overlapping design of the discriminators and the pre-trained image registration, the accurate position of the local discriminators is relatively unimportant.

Fig. 3. Local patches with the structure of local discriminators. The local patches $p_k^Y$ and $\tilde{p}_k^Y$ are cropped separately from the pneumoconiosis reference CXR image and synthesized image as the inputs of the multi-scale discriminator. Each multi-scale discriminator contains three sub-discriminators, and each one contains four 3 × 3 convolutional layers with matched spectral normalization, and leaky ReLU layer. After the last spectral normalization and a 1 × 1 convolutional layer, every sub-discriminator produces a vector as 5 × 5 × 1, 2 × 2 × 1, and 1 × 1 × 1 as the outputs for discrimination.

Download Full Size | PDF

For the numbers of local discriminators, given ${\{{D_k^{\textrm{local}}} \}_{k = 1, \cdots ,K}}$ at each landmark inside the lung, a local patch from pneumoconiosis example $p_k^Y$ and the corresponding local patch of the synthesized image $\tilde{p}_k^Y$ are cropped from the whole image and fed into the local discriminator $D_k^{local\textrm{ }}$, the discriminators can learn to differentiate $p_k^Y$ and $\tilde{p}_k^Y$. In this task, we trained the network output under conditions of K = 10 to provide maximal coverage in lung areas.

3.6 Training

The training process is started with a GAN-based architecture. The basic framework of GAN contains two deep CNNs, where the discriminator D is aimed to distinguish whether the image sample is real or synthetic, while the generator G is attempted to synthesize realistic results. To achieve the goal, our network can be trained by optimizing the different following objective functions that we designed:

Global adversarial loss. This loss is designed to help the generator synthesize CXR images that are unable to distinguish from real ones. The least squares loss is adopted to facilitate a quick convergence of the model and has demonstrated a stronger robustness than binary cross entropy [36]. The global adversarial loss can be defined as $L_{adv\textrm{ }}^{global} = L_X^{adv} + L_Y^{adv}$, where

L_X^{adv} = \frac{1}{2}{\mathrm{\mathbb{E}}_{x \sim {P_X}}}[{{{({D_X}(x )- 1)}^2}} ]+ \frac{1}{2}{\mathrm{\mathbb{E}}_{\tilde{x} \sim {G_X}}}[{{{(({{D_X}({\tilde{x}} )} ))}^2}} ],

(2)$$L_Y^{adv} = \frac{1}{2}{\mathrm{\mathbb{E}}_{y \sim {P_Y}}}[{{{({D_Y}(y )- 1)}^2}} ]+ \frac{1}{2}{\mathrm{\mathbb{E}}_{\tilde{y} \sim {G_Y}}}[{{{(({{D_Y}({\tilde{y}} )} ))}^2}} ].$$

Content adversarial loss. This loss aims to enable the content features to be encoded into the shared common latent space. Under this constraint, it promotes the content encoders learning to encode the same content variables with the same information and cannot differentiate by the content discriminator.

(3)$$\begin{array}{l} L_{\textrm{adv}}^{content} = {\textrm{E}_{x \sim {P_X}}}\left[ {\frac{1}{2}\textrm{log}{D^{content}}\left( {E_X^c\left( x \right)} \right) + \frac{1}{2}\textrm{log}\left( {1 - {D^{content}}\left( {E_X^c\left( x \right)} \right)} \right)} \right]\\ + {\textrm{E}_{y \sim {P_Y}}}\left[ {\frac{1}{2}\textrm{log}{D^{content}}\left( {E_Y^c\left( y \right)} \right) + \frac{1}{2}\textrm{log}\left( {1 - {D^{content}}\left( {E_Y^c\left( y \right)} \right)} \right)} \right] \end{array}$$

Reconstruction loss. The reconstruction loss of our model consists of two parts, as ${A_i}$, ${C_i}$ were put into ${G_X}$ to synthesize $\tilde{x}_i^{\textrm{self}}$, similarly, ${A_j}$ and ${C_j}$ were fed into ${G_Y}$ to synthesize $\tilde{y}_j^{\textrm{self}}$, both of the synthesized images at this step are supposed to be the same as ${x_i}$ and ${y_j}$ respectively and this procedure can be defined as self-reconstruction. On the other hand, if we continue to extract the attributes representations and change the content representations from the synthesized results ${\tilde{x}_j}$ and ${\tilde{y}_i}$ and generate $\hat{x}_i^{\textrm{cross}}$ and $\hat{y}_j^{\textrm{cross}}$, which are also expected to be as the same as ${x_i}$ and ${y_j}$. This reconstruction can be defined as cross-cycle reconstruction. Therefore, we formulate the reconstruction loss using L1 loss in order to achieve self and cycle consistency, the reconstruction loss ${L^{recon\textrm{ }}}$ is defined as:

(4)$$L^{recon }= \parallel {x_i} - \tilde{x}_{i}^{\textrm{self}}\parallel_{1}+\parallel {x_i} - \tilde{x}_{i}^{\textrm{cross}}\parallel_{1}+\parallel {y_i} - \tilde{y}_{j}^{\textrm{self}}\parallel_{1}+\parallel{y_j} - \tilde{y}_{yj}^{\textrm{cross}}\parallel_{1}$$

KL loss. To achieve stochastic sampling in testing, the attribute representation $\{{{A_i},{A_j}} \}$ which encoded by $\{{E_X^a,E_Y^a} \}$ were used to be approximate to a prior Gaussian distribution. Hence, we adopt KL loss ${L^{KL}} = L_i^{KL} + L_j^{KL}$, where ${D_{KL}}({p\parallel q} )= \smallint p(x )\textrm{log}\left( {\frac{{p(x )}}{{q(x )}}} \right)dx$, and

L_i^{KL} = \mathrm{\mathbb{E}}[{({{D_{KL}}({{A_i}\parallel N({0,1} )} )} ]},

(5)$$L_j^{KL} = \mathrm{\mathbb{E}}[{({{D_{KL}}({{A_j}\parallel N({0,1} )} )} ]} .$$

Latent regression loss. ${L^{latent\textrm{ }}}$ are used to set up an invertible mapping between image and the latent space, similar to BicycleGAN [37]. A random latent vector z which stands for the attribute representation was drawn from the distribution of prior Gaussian. With the latent vectors, we purpose to recovering them by ${\tilde{z}_i} = E_X^a({{G_X}({{C_j},{z_i}} )} )$ and ${\tilde{z}_j} = E_Y^a({{G_Y}({{C_i},{z_j}} )} )$. Therefore, we apply latent regression loss ${L^{latent}}$ as:

(6)$${L^{latent}} = \parallel{z_i} - {\tilde{z}_i}\parallel_1 + \parallel{z_j} - {\tilde{z}_j}\parallel_1$$

Local adversarial loss. This loss is designed to help the local discriminators to distinguish whether the local patches are real or synthesized. Accordingly, the generator is guided to generate a more realistic result ${\tilde{y}_i}$, which is identical to the reference sample in pneumoconiosis ${y_j}$. In this condition, the local adversarial learning process is constructed with the local adversarial loss defined as $L_{adv}^{local} = \mathop \sum \nolimits_k {\eta _k}L_k^{local}$, where ${\eta _k}$ denotes the weights for local patches, $L_k^{local}$ is shown as:

(7)$$L_k^{local\textrm{ }} = {\mathrm{\mathbb{E}}_{{y_j} \sim {P_Y}}}[{\textrm{log}D_k^{local\textrm{ }}({p_k^Y} )} ]+ {\mathrm{\mathbb{E}}_{{{\tilde{y}}_i} \sim {G_Y}}}[{\textrm{log}({1 - D_k^{local}({\tilde{p}_k^Y} )} )} ]$$

Local Laplacian filter loss. As we aim to transfer the lesion feature from the abnormal to normal image, only using local adversarial loss still leaves problems in this task. One of the most notable problems is that the lesion area contains a quantity of small-scale high-frequency components, such as infiltrate, nodules, and fibrosis, which are different from other contextual information. These kinds of detailed information can hardly be transferred or even get lost. In this case, we employ Laplacian filters to encourage high-frequency details transfer and help disentangle the lesion style.

As shown in Fig. 4, we apply Laplacian filters to ${y_j}$ and ${\tilde{y}_i}$ for extracting high-frequency information. After this, we crop the filtered image into local patches $f_k^Y$ and $\tilde{f}_k^Y$ that have the same location as the local patches used in the local discriminators. The local Laplacian filter loss is defined as:

(8)$$L_{lap}^{local} =\mathop \sum \limits_k \eta_k \parallel f_k^Y- \tilde{f}{_k^Y\parallel_1}$$

where ${\eta _k}$ stands for the weight of local patches. In this way, we can stably capture high-frequency texture around the key regions. This loss works in tandem with local adversarial loss and other losses to facilitate disentanglement of the pneumoconiosis latent code.

Fig. 4. Flow diagram of Laplacian filter loss. Taking image from pneumoconiosis ${y_j}$ and synthesized ${\tilde{y}_i}$ as the inputs passing through the Laplacian filter f, and finally obtaining local patches $f_k^Y$ and $\tilde{f}_k^Y$ from filtered images ${f_j}$, ${\tilde{f}_i}$ for calculating the L1 loss.

Download Full Size | PDF

Total loss. The final objective loss is

(9)$$\begin{aligned} {L^{total}} = &{\lambda _{global}}L_{adv}^{global} + {\lambda _{content}}L_{\textrm{adv}}^{content} + {\lambda _{recon}}{L^{recon}} + {\lambda _{KL}}{L^{KL}}\\& + {\lambda _{latent}}{L^{latent}} + {\lambda _{local}}L_{adv}^{local} + {\lambda _{lap}}L_{lap}^{local} \end{aligned}$$

where the hyper-parameters ${\lambda _{global}}$, ${\lambda _{content\textrm{ }}}$, ${\lambda _{recon\textrm{ }}}$, ${\lambda _{KL}}$, ${\lambda _{latent}}$, ${\lambda _{local}}$, ${\lambda _{lap}}$ are the weights to adjust the balance of the objectives.

For training details, we set ${\lambda _{global}}$=1, ${\lambda _{content}}$=1, ${\lambda _{recon}}$=10, ${\lambda _{KL}}$=0.01, ${\lambda _{latent}}$=10, ${\lambda _{local}}$=1, ${\lambda _{lap}}$=3. We finally incorporate K = 10 as the number of the local discriminators and use $\eta $ to adjust the importance of each patch. The identical positions of left and right lung in CXR images share the same weight. Hence, 10 local patches constrained by 5 $\eta $s, ${\eta _1}$=1.5, ${\eta _2}$=3, ${\eta _3}$=6, ${\eta _4}$=3, ${\eta _5}$=1.5. We trained the network for 1000 epochs and adopt an Adam optimizer. The learning rate was setup to 0.001 and the hyper-parameters (${\beta _1}$, ${\beta _2}$) = (0.5, 0.999). The image size of the input and output is set for 512 × 512 pixels, with the batch size of 1 owning to the limit of the GPU memory.

The network training was operated based on PyTorch 1.1.0 backend. The server we used in this task was NVIDIA Quadro GV100 graphical processing unit (GPU) which contains 32GB maximum memory.

4. Results

In this section, we outline the experimental studies including visualization results of the network outputs and the Grad-CAM heat map of discriminators, the performance of t-SNE embeddings, ablation studies on optimizations, image generation assessment and data augmentation results and evaluation. These studies were aimed at demonstrating the performance of the proposed LDADN in CXR image synthesis and the subsequent promotion of pneumoconiosis detection.

4.1 Visualization results

Because of the disentangled representation and the KL loss with the latent regression loss we applied to regularize the attribute vectors, we were able to synthesize images based on either randomly-sampled vectors from the attribute space or the attribute vectors extracted from the reference images. The two testing modes are displayed in Fig. 5.

Fig. 5. Testing method. In testing, the network can synthesize images based on random attributes (left) or reference attributes (right).

Download Full Size | PDF

We prove the attribute transfer outputs in Fig. 6. As the content and attribute representations have been disentangled, we can transfer the attribute utilizing images of desired attributes. In addition, since the content space is shared, we are also capable of synthesizing images based on the content feature extracted from images in either domain.

Fig. 6. Visualization result of LDADN. The left column shows the network input from health and pneumoconiosis domains respectively. The right part exhibits the outputs of the network which interchanged the attributes of the two inputs CXR images and synthesized by a random vector and the given reference separately.

Download Full Size | PDF

4.2 Grad-CAM heat map of discriminators

On the other hand, the internal mechanisms and the reasoning behind the discriminators’ specific decision-making process can be interpreted by the visualization technique like Gradient-weighted Class Activation Mapping (Grad-CAM) [38]. Grad-CAM utilizes the target class’s gradients and fills into the network’s last convolutional layer to create an activation map of the target class by highlighting the crucial regions for network prediction. A CXR image example with Grad-CAM visualization is presented in Fig. 7.

Fig. 7. The Grad-CAM heat maps on the proposed local discriminators and global discriminator. Original CXR image (left), Grad-CAM heat maps on local patches (middle), and Grad-CAM heat map on the global image (right). The total Grad-CAM provided by all the discriminators can be viewed as the weighted sum of the local discriminators plus the Grad-CAM from the global discriminator. It can be regarded as a visualization of the discriminators’ total gradient for decision-making.

Download Full Size | PDF

In contrast to the common use of the classifiers, we implement Grad-CAM on our global and local discriminators to obtain both heat maps of global and local discriminators. The heat map of the global discriminator has already indicated that the global discriminator is well trained and the components in the lung regions play a dominant role in the decision-making of the global discriminator. Furthermore, the heat maps from local discriminators reveal the improvements and additions for global discriminator, especially in subtle high-frequency details. Each heat map displays independent decision-making information. This method is beneficial for interpreting how the discriminators are trained to make decisions and ultimately guide the generator to synthesize realistic high-quality CXR images.

4.3 t-SNE embedding

We further confirmed the results by applying the t-SNE [39] visualization as it plays a vital role in data analysis and interpretation. The t-SNE algorithm enables high-dimensional data to be embedded into a low dimensional space (2D or 3D) through dimension reduction and retains the overall structure of the data simultaneously. When initialized with appropriate embedding, t-SNE is able to preserve most global and local structures of the data. The two-dimensional t-SNE visualizations of the features are exhibited in Fig. 8 shows the difference between the original and synthesized data.

Fig. 8. Two-dimensional t-SNE embeddings: (a) health vs. pneumoconiosis; (b) health vs. pneumoconiosis + Random; (c) health vs. pneumoconiosis + Reference.

Download Full Size | PDF

Note that the synthetic samples are all filled in the pneumoconiosis distribution. It reveals a decision boundary between the two classes. Figure 8(a) shows the original examples of data distributions for health and pneumoconiosis CXRs, which contains some outliers. As we embed the synthetic samples from + Random and + Reference into the feature space, respectively in Fig. 8(b) and Fig. 8(c), either the distribution of the synthetic images from + Random or + Reference is more similar to pneumoconiosis distribution, which also maintains an appropriate decision boundary. Moreover, the distribution of + Reference exhibits higher inner-class diversity and fewer outliers than + Random. In general, the t-SNE embedding shows a better dimension reduction and separation capability. When used for data visualization, it provides new evidence and intuitions for the increase in synthetic quality and disease detection performance.

4.4 Ablation study

We also prove the efficacy of the optimization process through an ablation study in Table 1. The optimization of the network considers DRIT as the baseline. It is mainly embodied in 3 aspects, including replacing global adversarial loss from binary cross entropy to least squares loss (LSL), affiliating local discriminators (LD) and corresponding losses and adding skip connections (SC) between content encoders and generators, -Random and -Reference stand for the outputs conditioned on random attributes and reference attributes respectively. In this study, we remove the optimization steps successively in the network, and train the network on the same dataset to evaluate how each optimization step affects network performance. It is clear that from the Fréchet Inception Distance (FID) score in Table 1, the least squares loss can dramatically improve the quality of the output by enabling a better convergence. The local discriminators and skip connections also positively impact the overall network outputs. All the evaluation scores are steadily improved except for Inception Score (IS). As a result, all these optimization steps are worth preserving.

Table 1. The results of ablation experiment. Values in the table are formatted with mean ± std. dev.^a

View Table | View all tables in this article

4.5 Image generation assessment

The image generation quality evaluation of the generative models has always been a high priority in current research. It is mainly attributed to the lack of unified and precise evaluation metrics. The evaluation metrics for generative models like GANs are broadly divided into two major categories, qualitative and quantitative. Qualitative methods mainly depend on the visual perception of human observers. Thus, it is often limited to the evaluation of a single image. From this perspective, it overlooks the assessment of output’s intra-domain diversity, overfitting and mode collapse. In this task, we quantitatively evaluate the proposed model with several chosen metrics consisting of Inception Score (IS) [40], Fréchet Inception Distance (FID) [41], Kernel Inception Distance (KID) [42], and Learned Perceptual Image Patch Similarity (LPIPS) [43]. Using these metrics, the image generation quality of the models can be compared and evaluated objectively.

We trained the LDADN model on the dataset that has been described in the previous section. For comparison, several references reported standard data augmentation methods, such as CycleGAN, DualGAN, UNIT, MUNIT and DRIT were also trained on the same dataset to synthesize CXR images. The quantitative assessment results of image generation quality in Table 2 demonstrate that the proposed LDADN method outperforms all the benchmark approaches with FID of 0.650, KID of 5.168 ± 1.000, and LPIPS of 0.375 ± 0.050. It can be observed that the FID scores of the rest models are all greater than 0.776 and KID scores of the rest models are greater than 6.29. Compared with those methods, the LDADN model has greatly improved in the two key evaluation metrics, which indicates the significance of our method. On the other hand, our model does not get the highest score in IS. The possible reason is that IS measures the samples with natural images on ImageNet. The high IS often embodies that the sample image is similar to a specific ImageNet category. As a result, although IS score is still an essential indicator of synthetic image quality, it is not inclined to accurately measure the distribution similarity between generated samples and the target domain. By contrast, FID, KID and LPIPS can serve as better measures for generative models because of their robustness, discriminability, and capability to detect mode collapse. In general, from the quantitative assessment results shown in Table 2, we can see that the LDADN outperforms other advanced models in the values of FID, KID and LPIPS. It reveals that LDADN can synthesize images with higher quality and finer details. More importantly, the synthesized images’ distribution is constrained to the pneumoconiosis domain.

Table 2. The comparison results of LDADN and other standard methods on CXR image synthesis. Values in the table are formatted with mean ± std. dev.^a

View Table | View all tables in this article

4.6 Data augmentation results and evaluation

The datasets consisted of 2432 CXR images (3056 × 3056 pixels), including 904 pneumoconiosis and 1528 health. First, we converted these CXR image files from DICOM format to jpg format, and this transformation made it easier to display and process images by the algorithm. After that, we removed low-quality or unsure images and eliminated omissive or incorrectly annotated images from the dataset.

When LDADN is applied to pneumoconiosis detection or classification as a data augmentation method, 426 pneumoconiosis CXR images were synthesized by using random attributes and reference attributes separately. These images were denoted as + Random and + Reference. We demonstrate the effectiveness of the augmentation sets using five classification models, include Inception V3, VGG16, ResNet50, MobileNet and DenseNet121. The assessment metrics include accuracy, sensitivity, specificity and area under the receiver operating characteristic (ROC) curve (AUC) score.

According to previous studies, pretraining boosts the CXR classification task [44]. Therefore, we applied the transfer learning method by first loading parameter weights from each of the ImageNet-pretrained models, then freezing the first three layers of the pre-trained networks. In this way, both the number of trainable parameters and training time can be reduced. Finally, we fine-tuned the classifiers with our dataset. In the experiments, firstly, the classifiers were trained with the original imbalanced dataset which contains 1000 health CXR images and 574 pneumoconiosis CXR images. Secondly, we demonstrated data augmentation by adding 426 generated pneumoconiosis CXR images from set + Random and + Reference to the original training set separately to make the image amount of health and pneumoconiosis by the proportion of 1:1. The number of the test set was set to 217 pneumoconiosis and 217 health. A total of three times training was conducted, each for 50 epochs with 0.001 for the learning rate. While the training process finished, the results of the chosen metrics performed by different classifiers on the test set were recorded for comparison.

Based on Table 3, it is apparent that: 1) All the evaluated models with augmented CXR image data from either + Random or + Reference performs better than training with the original imbalanced dataset. For example, the classification accuracy of DenseNet121 increases from 92.40% to 99.31% after 426 generated images added into the training set. The average growth rate of accuracy among these methods rises by 3.5% with + Random and 6.45% with + Reference; 2) Compared to + Random, +Reference provides a more significant increase in classifying performance. We consider that it is attributed to the given reference from the desired domain enabling the model to learn better attribute representations for pneumoconiosis which is more constrained than random vectors; 3) The relative improvement of the classification models tends to decrease with the increase of model complexity. It is because of the problem of over-parametrization, as the training dataset seems comparatively small. Once we add the synthetic data for the training set, the performance can be increased, but the improvement may mitigate when the complexity of the model architecture increases; 4) The model DenseNet121 outperforms all the other baseline methods. It suggests that some discriminative information can be provided by the residue maps, while densely connected with these residue maps in the network architecture, the network can be more efficient in extracting features for classification. Overall, the results shown in Table 3 demonstrate the effectiveness of the data augmentation method by our proposed LDADN for pneumoconiosis detection.

Table 3. The results of pneumoconiosis detection performance on health vs. pneumoconiosis test set when training following models with original dataset; +Random synthesized data; and + Reference synthesized data.^a

View Table | View all tables in this article

Following the same process as pneumoconiosis detection, the original CXR and the synthetic images generated by CycleGAN, DualGAN, UNIT, MUNIT and DRIT were filled into DenseNet121 for health and pneumoconiosis classifications. Table 4 describes the classification results using different data augmentation methods input configurations for pneumoconiosis detection in terms of accuracy, sensitivity, specificity, and AUC. As shown Table 4, LDADN outperforms other methods in all assessment metrics. For example, the accuracy has gone up by 3.92% above the other methods. When using the synthetic images produced by other standard methods (i.e., CycleGAN, DualGAN), due to the generation quality, the detection performance becomes even worse. The results listed in Table 4 clearly demonstrate the effectiveness of LDADN for augmenting pneumoconiosis detection and implicitly indicated a higher image generation quality of our method.

Table 4. Accuracy, sensitivity, specificity and AUC results of DenseNet121 models using different data augmentation methods for pneumoconiosis detection

View Table | View all tables in this article

5. Conclusion

This study established the framework of CXR image analysis using our proposed LDADN augmentation method to improve pneumoconiosis detection. For the synthesis of CXR images, a local discriminant auxiliary disentangled network is proposed based on disentangled representation and adversarial learning. By utilizing the local multi-scale discriminators and Laplacian filter with corresponding objective loss functions at the lesion area, the proposed model is able to disentangle the lesion style from the abnormal CXR images and transfer the style to normal ones to generate new pneumoconiosis images with high-quality and diversity. The CXR image generation quality outperforms other advanced methods in different quantitative assessment metrics. Moreover, the synthetic images can further used to augment pneumoconiosis detection. With the transfer learning aided classification models, the accuracy of pneumoconiosis detection is significantly improved to 99.31%. In the future, the whole process of this study can also be applied to analyze other types of medical images, such as data with insufficient label or class imbalance.

Funding

State Key Laboratory of Industrial Control Technology (ICT1806); Key Research and Development Program Projects of Zhejiang Province (2018C03G2011156).

Acknowledgements

We would like to extend our sincere gratitude to Shandong Academy of Medical Sciences and Shandong Provincial Chest Hospital for the medical data and expert guidance.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Y. Shekarian, E. Rahimi, N. Shekarian, M. Rezaee, and P. Roghanchi, “An analysis of contributing mining factors in coal workers’ pneumoconiosis prevalence in the United States coal mines, 1986-2018,” International Journal of Coal Science Technology 8, 12272021). [CrossRef]

2. D. Mandrioli, V. Schlünssen, B. Ádám, R. A. Cohen, C. Colosio, W. Chen, A. Fischer, L. Godderis, T. Göen, I. D. Ivanov, N. Leppink, S. Mandic-Rajcevic, F. Masci, B. Nemery, F. Pega, A. Prüss-Üstün, D. Sgargi, Y. Ujita, S. van der Mierden, M. Zungu, and P. T. J. Scheepers, “WHO/ILO work-related burden of disease and injury: Protocol for systematic reviews of occupational exposure to dusts and/or fibres and of the effect of occupational exposure to dusts and/or fibres on pneumoconiosis,” Environ. Int. 119, 174–185 (2018). [CrossRef]

3. L. Zhang, L. Zhu, Z. Li, J. Li, H. Pan, S. Zhang, W. Qin, and L. He, “[Analysis on the disease burden and its impact factors of coal worker's pneumoconiosis inpatients],” Journal of Peking University Health Sciences 46, 226–231 (2014).

4. J. M. Mazurek, J. Wood, D. J. Blackley, and D. N. Weissman, “Coal workers’ pneumoconiosis-attributable years of potential life lost to life expectancy and potential life lost before age 65 years: United States, 1999-2016,” MMWR Morb. Mortal. Wkly. Rep. 67(30), 819–824 (2018). [CrossRef]

5. S. Binay, P. Arbak, A. A. Safak, E. G. Balbay, C. Bilgin, and N. Karatas, “Does periodic lung screening of films meets standards?” Pak J Med Sci. 32(6), 1506–1511 (2016). [CrossRef]

6. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv e-prints, 1409–1556 (2014).

7. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” 2016 IEEE Conference On Computer Vision And Pattern Recognition (CVPR), 2818–2826 (2016).

8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv e-prints, 1512–3385 (2015).

9. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: efficient convolutional neural networks for mobile vision applications,” arXiv e-prints. 1704–4861 (2017).

10. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” arXiv e-prints.2261–2269 (2016).

11. D. S. Kermany, M. Goldbaum, W. Cai, C. C. S. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, J. Dong, M. K. Prasadha, J. Pei, M. Y. L. Ting, J. Zhu, C. Li, S. Hewett, J. Dong, I. Ziyar, A. Shi, R. Zhang, L. Zheng, R. Hou, W. Shi, X. Fu, Y. Duan, V. A. N. Huu, C. Wen, E. D. Zhang, C. L. Zhang, O. Li, X. Wang, M. A. Singer, X. Sun, J. Xu, A. Tafreshi, M. A. Lewis, H. Xia, and K. Zhang, “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell (Cambridge, MA, U. S.) 172(5), 1122–1131 (2018). [CrossRef]

12. E. Okumura, I. Kawashita, and T. Ishida, “Computerized classification of pneumoconiosis on digital chest radiography artificial neural network with three stages,” J Digit Imaging 30(4), 413–426 (2017). [CrossRef]

13. Y. Luo, Q. Xu, R. Jin, M. Wu, and L. Liu, “Automatic detection of retinopathy with optical coherence tomography images via a semi-supervised deep learning method,” Biomed. Opt. Express 12(5), 2684 (2021). [CrossRef]

14. S. Bharati, P. Podder, and M. R. H. Mondal, “Hybrid deep learning for detecting lung diseases from X-ray images,” Informatics in Medicine Unlocked 20(100391), 100391 (2020). [CrossRef]

15. N. Abhange, S. Gat, and S. Paygude, “COVID-19 detection using convolutional neural networks and InceptionV3,” 2021 2nd Global Conference for Advancement in Technology (GCAT) (2021), 1–5.

16. F. Yan, X. Huang, Y. Yao, M. Lu, and M. Li, “Combining LSTM and DenseNet for automatic annotation and classification of chest x-ray images,” IEEE Access 7, 74181–74189 (2019). [CrossRef]

17. T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional GANs,” 2018 IEEE/CVF Conference On Computer Vision And Pattern Recognition (CVPR) (2018), pp. 8798–8807.

18. C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data 6(1), 60 (2019). [CrossRef]

19. Z. Lin, J. Sun, A. Davis, and N. Snavely, Visual Chirality (IEEE, 2020).

20. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Commun. ACM 63(11), 139–144 (2020). [CrossRef]

21. X. Yi, E. Walia, and P. Babyn, “Generative adversarial network in medical imaging: a review,” Med. Image Anal. 58, 101552 (2019). [CrossRef]

22. S. Kazeminia, C. Baur, A. Kuijper, B. V. Ginneken, N. Navab, S. Albarqouni, and A. Mukhopadhyay, “GANs for Medical Image Analysis,” Artif. Intell. Med. 109, 101938 (2018). [CrossRef]

23. S. Kora Venu and S. Ravula, “Evaluation of deep convolutional generative adversarial networks for data augmentation of chest x-ray images,” Future Internet 13(1), 8 (2020). [CrossRef]

24. M. Gan and C. Wang, “Esophageal optical coherence tomography image synthesis using an adversarially learned variational autoencoder,” Biomed. Opt. Express 13(3), 1188 (2022). [CrossRef]

25. Y. He, J. Li, S. Shen, K. Liu, K. K. Wong, T. He, and S. T. C. Wong, “Image-to-image translation of label-free molecular vibrational images for a histopathological review using the UNet+/seg-cGAN model,” Biomed. Opt. Express 13(4), 1924 (2022). [CrossRef]

26. M. Sommersperger, A. Martin-Gomez, K. Mach, P. L. Gehlbach, M. Ali Nasseri, I. Iordachita, and N. Navab, “Surgical scene generation and adversarial networks for physics-based iOCT synthesis,” Biomed. Opt. Express 13(4), 2414 (2022). [CrossRef]

27. .J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” 2017 IEEE International Conference On Computer Vision (ICCV), 2242–2251 (2017).

28. Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: unsupervised dual learning for image-to-image translation,” 2017 IEEE International Conference On Computer Vision (ICCV), 2868–2876 (2017).

29. X. Huang, M. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” COMPUTER VISION - ECCV 2018, PT III. 11207, 179-196 (2018).

30. H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang, “Diverse Image-to-Image Translation via Disentangled Representations,” Computer Vision - ECCV 2018, PT I. 11205, 36-52 (2018).

31. C. Qin, B. Shi, R. Liao, T. Mansi, D. Rueckert, and A. Kamen, Unsupervised Deformable Registration for Multi-Modal Images via Disentangled Representations (Springer, 2019).

32. A. Chartsias, T. Joyce, R. Dharmakumar, and S. A. Tsaftaris, Adversarial Image Synthesis for Unpaired Multi-modal Cardiac Data (Springer, 2017).

33. C. Han, L. Rundo, R. Araki, Y. Nagano, Y. Furukawa, G. Mauri, H. Nakayama, and H. Hayashi, “Combining Noise-to-Image and Image-to-Image GANs: Brain MR Image Augmentation for Tumor Detection,” IEEE Access 7, 156966–156977 (2019). [CrossRef]

34. Y. Tang, Y. Tang, Y. Zhu, J. Xiao, and R. M. Summers, “A disentangled generative model for disease decomposition in chest X-rays via normal image synthesis,” Medical Image Analysis 67(101839), 101839 (2021). [CrossRef]

35. C. Chen, Q. Dou, Y. Jin, H. Chen, J. Qin, and P. A. Heng, “Robust Multimodal Brain Tumor Segmentation via Feature Disentanglement and Gated Fusion,” arXiv:2002.09708 (2020).

36. X. Mao, Q. Li, H. Xie, R. Lau, and S. P. Smolley, “Least Squares Generative Adversarial Networks,” (IEEE, 2017).

37. J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward Multimodal Image-to-Image Translation,” Advances in Neural Information Processing Systems 30 (NIPS 2017).

38. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,” 2017 IEEE International Conference On Computer Vision (ICCV), 618–626 (2017).

39. V. D. M. Laurens and G. Hinton, “Visualizing Data using t-SNE,” J Mach Learn Res. 9, 2579–2605 (2008).

40. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved Techniques for Training GANs,” arXiv e-prints.1606–3498 (2016).

41. M. Hensel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” advances in neural information processing systems 30 (NIPS 2017). 30, 2017).

42. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” arXiv e-prints. 1401–1801 (2018).

43. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018 IEEE/CVF Conference On Computer Vision And Pattern Recognition (CVPR).586–595.(2018).

44. A. Ke, W. Ellsworth, O. Banerjee, A. Y. Ng, and P. Rajpurkar, “CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation,” arXiv:2101.06871 (2021).

^Model	^IS	^FID	^KID	^LPIPS
^DRIT-Random	^1.265±^0.201	°.⁹⁷⁰	^9.991±^2.281	^0.381±^0.055
^{DRIT-Reference}	^1.318±^0.159	°.⁹¹³	^9.438±^2.375	^0.378±^0.047
^+LSL-Random	^1.778±^0.190	°.⁷⁶⁴	^7.060±^0.838	^0.384±^0.044
^{+LSL-Reference}	^1.836±^0.216	°.⁷²⁸	^5.365±^0.568	^0.385±^0.046
^{+LSL+LD-Random}	^1.684±^0.134	°.⁷²³	^5.672±^1.035	^0.381±^0.048
^{+LSL+LD-Reference}	^1.793±^0.246	°.⁶⁸³	^5.763±^0.710	^0.378±^0.047
^{+LSL+LD+SC-Random}	^1.648±^0.247	°.⁷¹⁷	^5.655±^0.862	^0.378±^0.049
^{+LSL+LD+SC-Reference}	^1.554±^0.138	°.⁶⁵⁰	^5.168±^1.000	^0.375±^0.050

^Method	^IS	^FID	^KID	^LPIPS
^CycleGAN[27]	^1.782±^0.165	¹.¹¹⁰	^12.958±^2.297	^0.401±^0.037
^DualGAN[28]	^1.337±^0.066	².²⁰⁵	^31.473±^1.821	^0.628±^0.028
^UNIT[29]	^1.643±^0.181	°.⁷⁷⁶	^6.290±^0.845	^0.389±^0.046
^MUNIT[29]	^1.716±^0.193	°.⁸⁸⁸	^7.231±^0.680	^0.396±^0.048
^DRIT-Random[30]	^1.265±^0.201	°.⁹⁷⁰	^9.991±^2.281	^0.381±^0.055
^{DRIT-Reference}[30]	^1.318±^0.159	°.⁹¹³	^9.438±^2.375	^0.378±^0.047
^LDADN-Random	^1.648±^0.247	°.⁷¹⁷	^5.655±^0.862	^0.378±^0.049
^{LDADN-Reference}	^1.554±^0.138	°.⁶⁵⁰	^5.168±^1.000	^0.375±^0.050

Model	Accuracy (%)			Sensitivity (%)			Specificity (%)			AUC (%)
Model	Real	+RD	+ RF	Real	+RD	+RF	Real	+RD	+RF	Real	+RD	+RF
Inception V3	93.55	95.85	98.39	95.45	96.86	99.09	91.59	94.79	97.67	93.52	95.82	98.38
VGG16	90.32	94.24	96.77	93.06	95.50	98.19	87.61	92.92	95.31	90.34	94.21	96.75
ResNet50	91.71	94.70	97.24	93.36	95.52	98.64	89.90	93.84	95.79	91.63	94.68	97.22
MobileNet	82.26	87.33	90.78	93.53	94.81	95.43	72.53	80.18	85.91	83.03	87.50	90.67
DenseNet121	92.40	95.62	99.31	96.41	96.85	99.54	88.15	94.34	99.07	92.28	95.59	99.30

Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC (%)
None	92.40	96.41	88.15	92.28
+ CycleGAN	89.86	94.54	85.05	89.80
+ DualGAN	86.64	91.74	81.48	86.61
+ UNIT	95.39	96.86	93.84	95.35
+ MUNIT	92.86	95.56	89.95	92.75
+ DRIT	94.01	95.98	91.90	93.94
+ LDADN	99.31	99.54	99.07	99.30

^Model	^IS	^FID	^KID	^LPIPS
^DRIT-Random	^1.265±^0.201	°.⁹⁷⁰	^9.991±^2.281	^0.381±^0.055
^{DRIT-Reference}	^1.318±^0.159	°.⁹¹³	^9.438±^2.375	^0.378±^0.047
^+LSL-Random	^1.778±^0.190	°.⁷⁶⁴	^7.060±^0.838	^0.384±^0.044
^{+LSL-Reference}	^1.836±^0.216	°.⁷²⁸	^5.365±^0.568	^0.385±^0.046
^{+LSL+LD-Random}	^1.684±^0.134	°.⁷²³	^5.672±^1.035	^0.381±^0.048
^{+LSL+LD-Reference}	^1.793±^0.246	°.⁶⁸³	^5.763±^0.710	^0.378±^0.047
^{+LSL+LD+SC-Random}	^1.648±^0.247	°.⁷¹⁷	^5.655±^0.862	^0.378±^0.049
^{+LSL+LD+SC-Reference}	^1.554±^0.138	°.⁶⁵⁰	^5.168±^1.000	^0.375±^0.050

LDADN: a local discriminant auxiliary disentangled network for key-region-guided chest X-ray image synthesis augmented in pneumoconiosis detection

Abstract

1. Introduction

2. Related work

3. Materials and methods

3.1 Raw data

3.2 Datasets

3.3 Problem formulation

3.4 Network architecture

3.5 Local discriminator

3.6 Training

4. Results

4.1 Visualization results

4.2 Grad-CAM heat map of discriminators

4.3 t-SNE embedding

4.4 Ablation study

4.5 Image generation assessment

4.6 Data augmentation results and evaluation

5. Conclusion

Funding

Acknowledgements

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (4)

Equations (11)

Biomedical Optics Express