Polarization-driven semantic segmentation via efficient attention-bridged fusion

Kaite Xiang; Kailun Yang; Kaiwei Wang; Kaiwei Wang

doi:10.1364/OE.416130

1. Introduction

With the development of deep learning, outdoor scene perception and understanding has become a popular topic in the area of autonomous vehicles, navigation assistance systems for vulnerable road users like visually impaired pedestrians and mobile robotics [1]. Semantic Segmentation (SS) is a task to assign semantic labels to each pixel of the images, i.e., object classification task at the pixel level, which is promising for outdoor perception applications [2]. A multitude of SS neural networks have been proposed following the trend of deep learning like FCN [3], U-Net [4], ERFNet [5], SwiftNet [6] and so on.

However, the networks mentioned above are mainly focused on the segmentation of RGB images, which makes it hard to fully perceive complex surrounding scenes because of the limited color information. A lot of works concerning domain adaptation have been presented to cope with SS in conditions without enough optical information [7,8]. Yet, a high-level security needs to be guaranteed for outdoor scene perception to support safety-critical applications like autonomous vehicles, where merely algorithm advancement is insufficient. Incorporating heterogeneous imaging techniques, multimodal semantic segmentation is of great necessity to be researched, which can leverage various optical information like depth, infrared and event-based data [9,10]. In this paper, we employ polarization information as the supplement sensor information to advance the performance of RGB-based SS considering its optical characteristics for robust representation of diverse materials. The polarization information are promising to advance the segmentation of objects which possess polarization features in the outdoor. With the rationale, this work advocates polarization-driven multimodal SS, which is rarely explored in the literature.

To better explain the necessity of multimodal semantic segmentation that merely RGB sensors can not cope with complex outdoor scene perception, we conduct a preliminary investigation in an unexpected obstacle detection scenario. In outdoor scenes, many unexpected obstacles like tiny animals, boxes and so on are risk factors for secure driving. We choose Lost and Found dataset [11] to perform an experiment. The dataset is acquired by a pair of cameras with a baseline distance of 23cm in 13 challenging outdoor traffic scenes by setting up 37 different categories of tiny obstacles, which possesses three types of data as shown in Fig. 1, i.e., RGB image, disparity image and ground-truth label. The dataset contains 3 categories, i.e., coarse annotations of passable areas, fine-grained annotations of unexpected tiny obstacles and background, whose resolution is 1024$\times$2048. Among them, 1036 images are selected as the training set, while the remaining 1068 images are selected as the validation set. Considering the fact that outdoor scene perception application demands high efficiency, we select a real-time network SwiftNet [6] to conduct the experiment. We only take the RGB images as the input information to train the network, where other training implementations will be described in Section 4.1. The detailed results are shown in Table 3. The precision of obstacle and passable area segmentation are 26.4% and 85.4%. Their recall rate are 49.8% and 63.9%, and their Intersection over Union (IoU) are 20.9% and 56.2%, respectively. In addition, the qualitative results of the experiment are illustrated in Fig. 2. The results show that severe over-fitting has appeared, and we find that the model trained merely with RGB images can not satisfactorily detect small, unexpected obstacles.

Fig. 1. An example of Lost and Found dataset: (a) is the RGB image. (b) is the disparity image. (c) is the label, where the blue area denotes the obstacles, the purple area denotes the passable area and the black area denotes the background.

Download Full Size | PDF

Fig. 2. The effect of SwiftNet [6] trained merely with RGB images.

Download Full Size | PDF

According to the toy experiment above, the model’s performance is unacceptable when trained only with RGB images. Thereby, we consider it is necessary to incorporate additional sensor information for semantic segmentation to perceive outdoor traffic scenes. As mentioned above, we select polarization as the complementary information, whose potential has been shown in our previous works [12,13] for water hazard detection. In this work, we leverage a novel single-shot RGB-P imaging sensor, and investigate polarization-driven semantic segmentation. To sufficiently fuse RGB-P information, we propose the Efficient Attention-bridged Fusion Network (EAFNet), enabling adaptive interaction of cross-modal features. In summary, we deliver the following contributions:

• Addressing polarization-driven semantic segmentation, we propose EAFNet, an efficient attention-bridged fusion network, which fuses multimodal sensor information with a lightweight fusion module, advancing many categories’ accuracy, especially categories with polarization characteristics like glass, whose IoU is lifted to 79.3% from 73.4%. The implementations and codes will be made available at https://github.com/Katexiang/EAFNet.
• With a single-shot polarization imaging sensor, we present an RGB-P outdoor semantic segmentation dataset. To the best of our knowledge, this is the first RGB-P outdoor semantic segmentation dataset, which will be made publicly accessible at http://www.wangkaiwei.org/download.html.
• We conduct a series of experiments to demonstrate the effectiveness of EAFNet with comprehensive analysis, along with a supplementary experiment that verifies EAFNet’s generalization capability for fusing other sensing data besides polarization information.

2. Related work

2.1 From accurate to efficient semantic segmentation

Convolutional Neural Networks (CNNs) have been the mainstream solution to semantic segmentation since Fully Convolutional Networks (FCNs) [3] approached the dense recognition task in an end-to-end way. SegNet [14] and U-Net [4] presented encoder-decoder architectures, which are widely used in the following networks. Benefiting from deep classification models like ResNets [15], PSPNet [16] and DeepLab [17] constructed multi-scale representations and achieved significant accuracy improvements. Inspired by the channel attention method proposed in SENet [18], EncNet [19] encoded global image statistics, while HANet [20] explored height-driven contextual priors. ACNet [21] leveraged attention connections and bridged multi-branch ResNets to exploit complementary features. In another line, DANet [22] and OCNet [23] aggregated dense pixel-pair associations. These works have pushed the boundary of segmentation accuracy and attained excellent performances on existing benchmarks.

In addition to accuracy, the efficiency of segmentation CNNs is crucial for real-time applications. Efficient networks were designed such as ERFNet [5] and SwiftNet [6]. They were built on techniques including early downsampling, filter factorization, multi-branch setup and ladder-style upsampling. Some efficient CNNs [24,25] also leveraged attention connections, trying to improve the trade-off between segmentation accuracy and computation complexity. With these advances, semantic segmentation can be performed both swiftly and accurately, and thereby has been incorporated into many optical sensing applications such as semantic cognition system [26] and semantic visual odometry [24,27].

2.2 From RGB-based to multimodal semantic segmentation

While ground-breaking network architectural advances have been achieved in single RGB-based semantic segmentation on existing RGB image segmentation benchmarks such as Cityscapes [28], BDD [29] and Mapillary Vistas [30], in some complex environments or under challenging conditions, it is necessary to employ multiple sensing modalities that provide complementary information of the same scene. Comprehensive surveys on multimodal semantic segmentation were presented in [2,9]. In the literature, researchers explored RGB-Depth [21,25], RGB-Infrared [31,32], RGB-Thermal [33,34], GRAY-Polarization [35,36] and Event-based [10,37] semantic segmentation to improve the reliability of surrounding sensing and the applicability towards real-world applications. For example, RFNet [25] fused RGB-D information on heterogeneous datasets, improving the robustness of SS in road-driving scenes with small-scale, unexpected obstacles.

In this work, we focus on RGB-P semantic segmentation by using a single-shot polarization camera. Traditional polarization-driven dense prediction frameworks were mainly dedicated to the detection of water hazards [38,39] or the perception in indoor scenes [40,41]. In our previous works, we investigated the impact of loss functions on water hazard segmentation [42], followed by a comparative study on high-recall semantic segmentation [43]. Inspired by [40], dense polarization maps were predicted from RGB images through deep learning [1]. Instead, current polarization imaging technique makes it possible to sense pixel-wise polarimetric information in a single shot and has been integrated on perception platforms for autonomous vehicles [13]. Following this line, we present a multimodal semantic segmentation system with single-shot polarization sensing. Notably, we found previous collections [35,36,44] of polarization images were mainly gray images without providing RGB information that are critical for segmentation tasks. Besides, they were limited in terms of data diversity and entailed careful calibration between different cameras. In contrast, we are able to bypass the complex calibration and naturally obtain multimodal data with single-shot polarization imaging. As an important contribution of this work, a novel outdoor traffic scene RGB-P dataset is collected and densely annotated, which covers not only specular scenes but also diverse unstructured surroundings. The dataset will be made publicly available to the community to foster polarimetry-based semantic segmentation. Moreover, our work is related to transparent object segmentation [45,46].

In addition, some polarization-driven objection detection methods have been explored in [47,48]. The designed modules re-encode raw polarization images for a better representation of polarization information. It may be beneficial when the input images are all of the same type like polarization raw images. After all, they are RGB images and possess similar data distribution. However, it can not be generalized into other sensors easily like RGB and disparity images, which have a distinct difference in data distribution. Unlike them, our work is to build a network architecture which can be flexibly adapted to different sensors besides RGB-polarization information.

3. Methodology

In this section, we derive the polarization image formation process and explain why polarization images contain rich information to complement RGB images for semantic segmentation. Then, we make a brief introduction of our integrated multimodal sensor and the novel RGB-P dataset. Finally, we present the Efficient Attention-bridged Fusion Network (EAFNet) for polarimetry-based multimodal scene perception.

3.1 Polarization image formation

Polarization is a significant characteristic of electromagnetic waves. When optical flux is incident upon a surface or medium, three processes occur: reflection, absorption and transmission. Analyzing the polarization of reflected light is possible to determine the optical proprieties of a given surface or medium. We illustrate the importance of polarization according to the Fresnel equation:

(1)$$\begin{aligned} &r_{s}=\frac{n_{1} \cos \theta_{i}-n_{2} \cos \theta_{t}}{n_{1} \cos \theta_{i}+n_{2} \cos \theta_{t}} \, ,\quad t_{s}=\frac{2 n_{1} \cos \theta_{i}}{n_{1} \cos \theta_{i}+n_{2} \cos \theta_{t}} \, , \\ &r_{p}=\frac{n_{2} \cos \theta_{i}-n_{1} \cos \theta_{t}}{n_{2} \cos \theta_{i}+n_{1} \cos \theta_{t}} \, ,\quad t_{p}=\frac{2 n_{1} \cos \theta_{i}}{n_{2} \cos \theta_{i}+n_{1} \cos \theta_{t}} \, , \end{aligned}$$

where r$_{s}$ and t$_{s}$ are the reflected and refracted portion of incoming light, the subscript label s and p represent perpendicular polarization and parallel polarization, n$_{1}$ and n$_{2}$ are refractive indexes of the two media material, and the $\theta _{i}$ and $\theta _{t}$ are the angle of incident light and refracted light, respectively. Inferred from Eq. (1), we find that the surface material’s optical characteristics can affect the intensity of the two orthogonally polarized light. Therefore, the orthogonally polarized light can partially reflect the surface material.

The polarization image formation can be reducible to the model shown in Fig. 3. In outdoor scenes, the light source is mainly sunlight. When the sunlight shines on the object like cars, polarized reflection occurs. Then, the reflected light with orthogonally polarized portion enters the camera with a polarization sensor, and the optical information with polarized characteristics are recorded by the sensor. The reason why the photoelectric sensor can record the polarized information is that the sensor’s surface is covered by a polarization mask layer with four different polarization directions, and only the light with the same polarization direction can pass the layer.

Fig. 3. Polarization image formation model.

Download Full Size | PDF

Here, we make a brief introduction of polarization parameters like the Degree of Linear Polarization (DoLP) and the Angle of Linear Polarization (AoLP). They are the key elements that contribute to the advancement of multimodal semantic segmentation. They are derived by Stokes vectors S, which are composed of four parameters, i.e., S$_{0}$, S$_{1}$, S$_{2}$ and S$_{3}$. More precisely, S$_{0}$ stands for the total light intensity, S$_{1}$ stands for the parallel polarized portion’s superiority against the perpendicular polarized portion, and S$_{2}$ stands for 45$^{\circ }$ polarized portion’s superiority against 135$^{\circ }$ polarized portion. S$_{3}$, associated to circularly polarized light, is not involved in our work on multimodal semantic segmentation. They can be derived by:

(2)$$\left\{\begin{array}{l} S_{0}=I_{0}+I_{90}=I_{45}+I_{135} \, , \\ S_{1}=I_{0}-I_{90} \, , \\ S_{2}=I_{45}-I_{135} \, , \end{array}\right.$$

where I$_{0}$, I$_{45}$, I$_{90}$ and I$_{135}$ are the optical intensity values at the certain polarization direction, i.e., 0$^{\circ }$, 45$^{\circ }$, 90$^{\circ }$ and 135$^{\circ }$. Here, DoLP and AoLP can be formulated as:

(3)$$D o L P=\frac{\sqrt{S_{1}^{2}+S_{2}^{2}}}{S_{0}} \, ,$$

(4)$$A o L P=\frac{1}{2} \times \arctan \left(\frac{S_{1}}{S_{2}}\right) .$$

According to Eq. (3), the range of DoLP is from 0 to 1. For partially polarized light, DoLP $\in (0,1)$. For completely polarized light, DoLP $=1$. Namely, DoLP stands for the degree of Linear Polarization. For AoLP, it ranges from 0$^{\circ }$ to 180$^{\circ }$. AoLP can reflect object’s silhouette information, because objects with the same material normally possess similar AoLP. Therefore, AoLP is a natural scene segmentation mask. In other words, objects of the same category or with the same material have similar AoLP. Different from RGB-based sensors whose output can be influenced by various outdoor environments like foggy weather or dust, polarization information from RGB-P sensor still keeps stable according to Eqs. (2), (3) and (4). Because the RGB images with four polarization directions from RGB-P sensor suffer similar degradation in the process of reaching the polarization filter in RGB sensor and the degradation will be cancelled out in the derivation of DoLP and AoLP. Therefore, the polarization stability against various outdoor environments is beneficial for scene perception. We generate a visualization of a set of DoLP and AoLP polarization images, as shown in Fig. 4. We find that the glass area and vegetation area are of high DoLP, but other areas are of low DoLP, which offers limited information, merely focused on the area with polarized characteristics. Besides, the left part of the vegetation and sky can not be distinguished merely depending on DoLP. On the contrary, for AoLP, we find areas of the same category show proper continuity of polarization information, which indicates great spatial priors for SS. AoLP offers a better representation of spatial information, which keeps a consistent distribution on the same category or materials like vegetation, sky, road and glass. In Section 4, we will further analyze the AoLP’s great potential in providing extra spatial information for SS over DoLP with extensive experiments.

Fig. 4. Display of polarization characteristics: (a) RGB images at four polarization directions; (b) Polarization image.

Download Full Size | PDF

3.2 Integrated multimodal sensor and ZJU-RGB-P dataset

The RGB-P outdoor scene dataset is captured by an integrated multimodal sensor for autonomous driving [13], as shown in Fig. 5. The sensor is a highly integrated system which is a combination of polarization sensor, RGB sensor, infrared sensor and depth sensor. The sensor captures polarization information with an RGB-based imaging sensor LUCID_PHX050S. The difference between LUCID_PHX050S and Gray-Polarization sensors is that the former is covered with an extra Bayer array besides the polarization mask. Intensity and luminance are important for the detection quality with image sensors, but the use of polarization (eg., linear polarization ) filters naturally results in losing less of half of input intensity coming to the RGB sensor. The reason why the collected RGB-polarization images can keep the consistency against the changing outdoor environment is that we utilize the data acquisition and processing program, i.e., Arena, that can dynamically adjust the gain factor and exposure time according to the outdoor environment. In addition, the multimodal sensor integrates an embedded system which combines hardware and software, by which we can attain various types of information like semantic information, infrared information, stereo depth information [49], monocular depth information [50] and surface normal information [51] by utilizing relevant estimation algorithms. While the sensor provides diverse modalities, this work focuses on using RGB and polarization information. Figure 5(b) shows some examples of the sensor’s output information. The highly integrated sensor can broaden RGB-based sensor’s application scenarios [13]. The infrared information can assist nighttime semantic segmentation, and the polarization-RGB-infrared multimodal sensor can offer precise depth information by pairing the sensors with different baselines. We leverage the multimodal sensor to attain pixel-aligned polarization and RGB images, and the main purpose of this work to adapt RGB-based SS to Polarization-driven multimodal SS.

Fig. 5. Our integrated multimodal vision sensor for capturing polarization information. (a) The integrated multimodal sensor; (b) The output of the sensor.

Download Full Size | PDF

RGB-Polarization outdoor scene SS dateset is scarce in the literature. Some research groups have realized the importance of polarization information for outdoor perception. The Polabot dataset [35,36], a GARY-Polarization outdoor scene SS dataset, consists of around 180 pairs of images at a low resolution of 230$\times$320. The limited image number and the low resolution make it hard to train a robust SS network for outdoor scenes. In addition, the dataset is short of RGB information which provide important texture features for classification tasks.

Addressing the scarcity, we build the first RGB-P outdoor scene dataset which consists of 394 annotated pixel-aligned RGB-Polarization images. We collect the images with abundant and complex scenes at Yuquan Campus, Zhejiang University, as shown in Fig. 6. The scenes of the dataset cover road scenes around teaching building area, canteen area, library and so on to provide diverse scenes for reducing the risk of over-fitting in training SS models.

Fig. 6. Diverse scenes in our ZJU-RGB-P dataset.

Download Full Size | PDF

The resolution of our dataset is 1024$\times$1224, which makes it possible to apply data augmentation like random crop and random rescale, which are crucial for improving data diversity and attaining robust segmentation [26]. We label the dateset with 9 classes at the pixel level, i.e., Building, Glass, Car, Road, Vegetation, Pedestrian, Bicycle and Background using LabelMe [52]. Here, we make a statistics of category distribution at the pixel level and draw a histogram as shown in Fig. 7. We can learn from the histogram that the dataset has a diversity of categories, and the categories with a low pixel proportion will be the difficult categories for SS like glass and pedestrian. An example of the dataset is shown in Fig. 8, which consists of four pixel-aligned RGB images at four polarization directions and a SS label. But AoLP and DoLP are the ultimate polarization representations integrated into SS, so the four polarized RGB images need to be operated according to Eq. (3) and Eq. (4). We utilize the average of the four polarized RGB images as the RGB images to feed into EAFNet. In fact, the average of any two orthogonal directions RGB images can represent images captured by a conventional RGB sensor. Finally, we select 344 images as the training set, and the other 50 images as the validation set. We name it ZJU-RGB-P dataset.

Fig. 7. Histogram of ZJU-RGB-P dataset’s category distribution.

Download Full Size | PDF

Fig. 8. An example of ZJU-RGB-P dateset including RGB images at different polarization directions and the pixel-wise semantic segmentation label.

Download Full Size | PDF

3.3 Efficient Attention-bridged Fusion Network

In order to combine RGB and polarization features, we present EAFNet, an Efficient Attention-bridged Fusion Network to exploit multimodal complementary information, whose architecture is shown in Fig. 9. Inspired by SwiftNet [6] and our previous SFN [43] with an U-shape encoder-decoder structure, EAFNet is designed to keep a similar architecture with downsampling paths to extract features and an upsampling module to restore the resolution, together with EAC modules to fuse features from RGB and polarization images. Here, we make a brief overview of EAFNet according to Fig. 9. EAFNet is designed to have a three-branch structure with downsampling paths of the same type. They are the RGB branch, the polarization branch and the fusion branch. In order to advance computation efficiency, we employ ResNet-18 [15], a light-weight encoder to extract and fuse features. After attaining the downsampled and fused features, SPP module, a spatial pyramid pooling module [6,53] is leveraged to enlarge valid receptive field. Then, a series of upsampling modules are leveraged to restore the feature resolution. Like SwiftNet, EAFNet employs a series of convolution layers with kernel size of 1$\times$1 to connect features between shallow layers and deep layers. The key innovation, here, lies in that EAFNet possesses the carefully designed fusion module, namely EAC Module, with inspiration gathered from the Efficient Channel Attention Network [54]. With this architecture, EAFNet is a real-time network, whose inference speed on GTX 1080Ti reaches 24 FPS (Frame Per Second) at the resolution of 512$\times$1024.

Fig. 9. Overview of EAFNet. RGB and polarization images are input to the network for extracting features separately. The EAC modules adaptively fuse the features.

Download Full Size | PDF

EAC Module is an efficient attention complementary module which is designed for extracting informative RGB features and polarization features, as shown in Fig. 10. It is an efficient version of the Attention Complementary Module (ACM) [21], which replaces fully connection layers with 1$\times$1 convolution layers whose kernel sizes are adaptively determined according to the channel number of the corresponding feature maps. On the one hand, the structure reduces computation complexity compared with ACM due to the use of local cross-channel interactions other than all channel-pair interactions. On the other hand, local cross-channel interaction effectively avoids the problem of losing information caused by the dimension reduction in learning channel attention.

Fig. 10. The Efficient Attention Complementary module (EAC module). A is the input feature map, B is the average global feature vector, C is the feature vector after a convolution layer with an adaptive kernel size. D is the attention weights, i.e. the vector after activation function of C, and E is the adjusted feature map.

Download Full Size | PDF

Assuming the input feature map is $\boldsymbol {A} \in \mathbb {R}^{H \times W \times C}$, we first apply a global average pooling layer to process $\boldsymbol {A}$, where $\boldsymbol {H}$, $\boldsymbol {W}$ and $\boldsymbol {C}$ are the height, width and channel number of the input feature map, respectively. Then, we obtain a feature vector $B=\left [B_{1}, B_{2}, \ldots \ldots , B_{C}\right ] \in \mathbb {R}^{1 \times C}$, where the subscript label represents the sequence number of features’ channel. The k-th ($k \in [1, C]$) element of $\boldsymbol {B}$ can be expressed as:

(5)$$B_{k}=\frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} A_{(i, j)}^{k} .$$

Then, the vector $\boldsymbol {B}$ needs to be reorganized by a convolution layer with an adaptive kernel size K to obtain a more meaningful vector $C=\left [C_{1}, C_{2}, \ldots \ldots , C_{C}\right ] \in \mathbb {R}^{1 \times C}$. K is the key point to attain the local cross-channel interaction attention weights, which can be acquired using t:

(6)$$\mathrm{t}=\operatorname{int}\left(\frac{\mathrm{abs}\left(\log _{2}(\mathrm{C})+\mathrm{b}\right)}{\gamma}\right) \, ,$$

where b and $\gamma$ are hyper-parameters set as 1 and 2 in our experiments, respectively. If t is divisible by 2, K is equal to t, otherwise K is equal to t plus 1. With the growing of channel depth, the EAC module can attain interaction among more channels. To limit the range of $\boldsymbol {C}$, sigmoid activation function $\sigma (\cdot )$ is applied to it. $\sigma (\cdot )$ can be expressed as:

(7)$$\sigma(x)=\frac{1}{1+e^{{-}x}}.$$

Then, we can get the final attention weights $D=\left [D_{1}, D_{2}, \ldots \ldots , D_{C}\right ] \in \mathbb {R}^{1 \times C}$. All the elements of $\boldsymbol {D}$ are in the range of 0 and 1. In other words, each element of $\boldsymbol {D}$ can be viewed as the key weight of the corresponding channel of the input feature map. Finally, we perform an outer product of $\boldsymbol {A}$ and $\boldsymbol {D}$ to get the adjusted feature map $\boldsymbol {E} \in \mathbb {R}^{H \times W \times C}$. Thereby, RGB features and polarization features can be adjusted dynamically by the EAC module.

Fusion module is leveraged to fuse the adjusted feature maps from RGB branch and polarization branch following the EAC modules. As mentioned above, the fusion branch is the same as the RGB branch and the polarization branch. The main difference lies in the inputting feature flow. Assuming at the i-th dowmsampling stage, the RGB branch’s feature map is $y_{R G B}^{i} \in \mathbb {R}^{H_{i} \times W_{i} \times C_{i}}$, and the polarization branch’s feature map is $y_{P}^{i} \epsilon \mathbb {R}^{H_{i} \times W_{i} \times C_{i}}$. Figure 11 illustrates one layer of the fusion branch for the fusion process. The left part is the RGB feature, and the right part is the polarization feature, while the feature $m^{i}$ flowing through the center arrow is the fusion feature from the previous fusion stage. Then, the fused feature $m^{i+1}$ at the current stage can be expressed as:

(8)$$m^{i+1}=y_{R G B}^{i} * E A C_{R G B}^{i}\left(y_{R G B}^{i}\right)+y_{P}^{i} * E A C_{P}^{i}\left(y_{P}^{i}\right)+m^{i}\, ,$$

where $m^{i+1}$ working as the fusion feature is passed into the fusion branch to extract higher-level features. It should be noted that at the first fusion stage of our EAFNet architecture, it only has RGB feature and polarization feature as input information.

Fig. 11. The structure of Fusion Module.

Download Full Size | PDF

4. Experiments and analysis

In this section, the implementation details and a series of experiments with comprehensive analysis are presented.

4.1 Implementation details

The experiments concerning polarization fusion are performed on ZJU-RGB-P dataset, while the preliminary experiment detailed in Section 1 and the supplement experiment detailed in Section 4.5 are performed on the Lost and Found dataset [11]. The remaining implementation details are the same for all of the experiments.

For data augmentation, we first scale the images with random factors between 0.75 and 1.25, then we randomly crop the images with a crop size of 768$\times$768, followed with a random horizontal flipping. It is worth noting that AoLP’s random horizontal flipping has a critical difference from DoLP and RGB images. According to Eq. (4), when the RGB images at four polarized directions are applied horizontal flipping, the AoLP will be:

(9)$$A o L P^{'}=180^{{\circ}}-A o L P\, ,$$

where $A o L P^{'}$ is the ultimate horizontal flipped AoLP image, and $A o L P$ is merely the spatially horizontal flipped version of the initial AoLP image. After all the data augmentation, all the processed images are normalized to the range between 0 and 1.

We use Tensorflow and an NVIDIA GeForce GTX 1080Ti GPU to implement EAFNet and perform training. We use Adam optimizer [55] with an initial learning rate of 4$\times$10$^{-4}$. We decay the learning rate with cosine annealing to the minimum 2.5$\times$10$^{-3}$ of the initial learning rate until the final epoch. To combat over-fitting, we use the L2 weight regularization with a weight decay of 1$\times$10$^{-4}$. Unlike prior works [6,25], we have not adopted any pre-trained weights in order to investigate the effectiveness of multimodal SS fairly, with the aim to reach high performance even with limited pairs polarization images. We utilize the cross entropy loss to train all the models with a batch size of 8. We evaluate with the standard Intersection over Union (IoU) metric.

4.2 Results and analysis

Both of AoLP and DoLP can represent polarization information of scenes, but which is the better to be fused into polarization-driven SS remains an open question.

We have made a brief analysis of the superiority of AoLP over DoLP for polarization-driven SS in Section 3.1 intuitively. As preliminary investigation, according to Fig. 4, we find that AoLP’s distribution has a remarkable difference to that of DoLP. Further, we present the statistics of the value distributions of DoLP and AoLP on the ZJU-RGB-P training set, as shown in Fig. 12. The majority of all pixels of the training set are with a small DoLP ranging from 0 to 0.4, while the portion of pixels whose DoLP values are larger than 0.4 is rather low, indicating that DoLP offers limited information, merely on categories with highly polarized characteristics. Different from DoLP, AoLP offers a uniform distribution. In other words, nearly all pixels of AoLP images possess meaningful features that are useful for SS. With the different distributions of DoLP and AoLP, it can be expected that their information are emphasized on different polarization characteristics. Inspired by this observation, we perform a series of experiments to investigate the effectiveness of EAFNet to fuse polarization and RGB information, and whether AoLP is superior to DoLP in offering complementary information for RGB-P image segmentation.

Fig. 12. The value distribution of ZJU-RGB-P training set’s DoLP and AoLP: (a) The DoLP distribution; (b) The AoLP distribution. All the values are normalized to the range between 0 and 1.

Download Full Size | PDF

For the basic control experiment, we first train the RGB-only SwiftNet on ZJU-RGB-P dataset as our Baseline. Then, four sets of training are performed for comparison. As shown in Fig. 9, EAFNet is a two-path network, where RGB images and polarization information are input into different paths. To explore the better polarization feature, we select AoLP (marked as EAF-A) and DoLP (marked as EAF-D) as the input polarization information, respectively. Considering the fact that both AoLP and DoLP can offer polarization information, we also concatenate AoLP and DoLP images along channel to build a polarization representation for training a variant model of EAFNet (marked as EAF-A/D).

Finally, we build a three-path version of EAFNet, where we select one path as the RGB path and the other two paths as polarization paths. AoLP and DoLP are passed into the two polarization paths (marked as EAF-3Path).

All quantitative results of the experiments are shown in Table 1. It can be seen that models combined with polarization information can advance the segmentation of objects with polarization characteristics like glass (73.4% to 79.3%), car (91.6% to 93.7%) and bicycle (82.5% to 86.0%). In addition, we observe that not only the IoU of classes with polarization characteristics are advanced, but other classes’ IoU have been improved by a great extent when combined with polarization information, especially pedestrian (36.1% to 63.8%). Meanwhile, the mIoU is lifted to 85.7% from 80.3%.

Table 1. Accuracy analysis on ZJU-RGB-P including per-class accuracy in IoU (%).

View Table | View all tables in this article

Further, we compare and discuss among the groups that fuse polarization-based features. As shown in Table 1, EAF-A is the optimal setting, while EAF-3Path is the worst group. Here, we analyze from the view of data distribution and model complexity. Our main focus is on the classes like glass and car, as the initial motivation of this study is to lift the segmentation performance of objects with polarization characteristics. Making a comparison between EAF-A and EAF-D, we find that the former can better advance the IoU of glass and car than the latter. It is the data distribution that counts. The analysis of the different distributions in previous sections shows that AoLP offers a better spatial representation like contour information than DoLP, while DoLP only offers meaningful information on areas with high polarization information. In this sense, AoLP provides richer priors and complementary information for RGB-P segmentation. The reason why EAF-A/D can attain higher IoU values on glass and car than EAF-D is that AoLP complements the spatial features of DoLP. However, it reaches a lower IoU on glass than EAF-A, because the interference between DoLP and AoLP features occurs, bringing some side effects and losing some serviceable information. EAF-3Path is the worst group as the complex architecture of this model prevents the model from exploiting the most informative features. Besides, the 3-path structure impairs the capacity of RGB features, which is critical for outdoor scene perception. Eventually, we conclude from the quantitative analysis that utilizing AoLP images to feed in the polarization path can greatly advance polarization-driven segmentation performance. For qualitative results, we use the Baseline RGB-only model and the EAF-A polarization-driven model, i.e., SwiftNet and our EAFNet fed with AoLP on the ZJU-RGB-P validation set to produce a series of visualization examples, as illustrated in Fig. 13. We find that SwiftNet wrongly segments the pedestrians into cars in the first row of Fig. 13 where EAFNet detects them correctly. In the second row, EAFNet successfully distinguishes glass from the car, while SwiftNet can not segment the full glass area. Moreover, SwiftNet even segments part of the car into road and part of the pedestrian into vegetation according to the last row of results in Fig. 13. The wrong segmentation results in outdoor traffic scenes can lead to terrible situations and even accidents once the model is selected to guide autonomous vehicles or assisted navigation [10,26]. In addition, we can learn from the first row that EAFNet can keep a good performance in low luminance. The area to the fore are in the shade with low luminance, where limited RGB information makes it hard for our eyes to perceive the scene, while AoLP offers sufficient spatial information to complement it, thereby advancing the performance of EAFNet. On the contrary, the area in the distance is illuminated by sun with high intensity annotated as background, where the complementary AoLP aids the segmentation of EAFNet, leading to accurate perception in such challenging scenarios. It is obvious that polarization-driven SS can complement the missing information merely based on RGB images. Therefore, multimodal SS is beneficial for semantic understanding in pursuit of robust outdoor scene perception.

Fig. 13. Qualitative result comparison between the RGB-only baseline and our EAFNet.

Download Full Size | PDF

4.3 Analysis of EAC module

EAC module is the key module of EAFNet, which can extract attention weights of RGB branch and AoLP branch. To better demonstrate the effect of EAC module, we visualize the fourth downsampling block’s feature maps of RGB and AoLP branches, and their attention weights of EAC module as shown in Fig. 14. We only visualize feature maps of the former 16 channels. Here, (i, j) denotes the position at the i-th row and the j-th col of the feature map, which corresponds to the attention weight one by one, where some insightful results can be found. In the RGB branch, we find that the car and glass area have low responses in the feature maps. On the contrary, the corresponding area of the AoLP branch have high responses, especially at (2, 4), (3, 2) and (4, 4). Then, their EAC module extracts their attention weights, respectively. Taking (3, 2) as an example to illustrate the complement process, this channel’s attention weights are 0.5244 and 0.4214 for RGB and AoLP branches, respectively. Then, the corresponding feature map will be multiplied by the attention weight. Finally, the adjusted feature map are added up to build the ultimate feature map, and it can be clearly seen that the feature maps spotlight the area of car perfectly.

As it can be seen in Fig. 14, the attention weights of RGB are higher than those of AoLP in most cases. Here, we evaluate the weights generated by EAC Module at all levels and illustrate the average of them as shown in Fig. 15. According to the curve, it can be easily observed that the RGB branch possesses higher weights than the AoLP branch at Layer1, Layer2, Layer3 and Layer4. On the contrary, the AoLP branch has a higher weight than the RGB branch at the first downsampling block, i.e., conv0. As mentioned before, AoLP offers a representation of spatial information and rich priors. Therefore, at the beginning of EAFNet, AoLP offers more distinguished features than RGB. With the features flowing into deeper layers, RGB branch becomes overwhelming. In addition, both of the weight curves keep a similar variation trend and reach the highest at Layer3.

Fig. 14. The EAC module’s role in fusing features: An RGB image and an AoLP image are fed into EAFNet, then we visualize the fourth downsampling block’s feature maps. Following that, EAC module extracts RGB and polarization branches’ attention weights. Finally, the feature maps are adjusted by the attention weights and fused.

Download Full Size | PDF

Fig. 15. The average attention weights of EAFNet at all levels.

Download Full Size | PDF

4.4 Ablation study

To better illustrate that fusing polarization information is beneficial and to verify EAFNet’s powerful fusion capacity, we have performed three extra training. First, we directly utilize AoLP images to train SwiftNet, denoted as SN-AoLP. Second, we utilize the concatenation of RGB and AoLP images to train SwiftNet, denoted as SN-RGB/A. Finally, we get rid of the EAC Moudle of EAFNet, and element-wise addition is utilized to fuse RGB features and AoLP features instead, denoted as EAF-wo-A. We also select Baseline and EAF-A of the previous experiment for the ablation study, as shown in Table 2.

Table 2. Accuracy analysis of the ablation study on ZJU-RGB-P (%).

View Table | View all tables in this article

We find that SN-AoLP can reach a decent performance, which is benefited from the spatial priors of AoLP. It can be learned from the comparison between Basline and SN-AoLP that RGB possesses more distinguished features than AoLP, while AoLP offers meaningful information as well. Besides, SN-RGB/A can indeed advance the segmentation of classes with polarization characteristics like glass (73.4% to 75.6%), but it does not yield remarkable benefits for all classes in contrast to our EAF-A. In addition, SN-RGB/A even causes a slight mIoU degradation to 80.2% from 80.3%. It is the difference between RGB and AoLP distributions that accounts for this degradation. The interference between RGB and AoLP has an adverse impact on the extraction of distinguishable features. We can learn from the comparison between EAF-wo-A and EAF-A that attention mechanism of EAFNet is highly effective. Combining with EAC Module, all classes have a remarkable elevation like glass (75.4% to 79.3%), pedestrian (40.6% to 60.4%) and so on. Making a comparison among all the groups, we draw a conclusion that EAF-A reaches the highest accuracy on all classes, indicating the effectiveness of our EAFNet and the designed polarization fusion strategy.

4.5 Generalization to other sensor fusions

To prove the flexibility of EAFNet to be adapted to other sensor combination scenarios besides polarization information, we utilize disparity images with EAFNet to investigate in the unexpected obstacle detection scenario, i.e., the preliminary investigation mentioned in Section 1.

According to Fig. 1, we can find that disparity images reflect the contours of the tiny obstacles. Thereby, combing RGB and disparity images with EAFNet is hopeful to address the devastating results of using merely RGB data. Considering the similar distribution between disparity and AoLP images, we train the EAFNet fed with pixel-aligned disparity and RGB images (marked as EAFNet-RGBD). We mark the baseline group as SwiftNet-RGB. All the training strategies are set according to Section 4.1. From the results in Table 3, we observe a remarkable elevation on the performance with the aid of EAFNet and complementary disparity images, where the precision and IoU of obstacle segmentation are lift to 76.2% from 26.5% and 52.7% from 20.9%, respectively. It indicates that combing disparity images with EAFNet bears fruit. To have a better realization of EAFNet-RGBD’s effects intuitively, we visualize an example as shown in Fig. 16, where EAFNet-RGBD segments obstacles and most of the road areas successfully, but SwiftNet-RGB ignores most of the road and obstacles. Therefore, it is essential to perform multimodal semantic segmentation with complementary sensing information like polarization-driven and depth-aware features to have a reliable and holistic understanding of outdoor traffic scenes.

Fig. 16. Qualitative comparison results of SwiftNet-RGB and EAFNet-RGBD.

Download Full Size | PDF

Table 3. Accuracy analysis of the supplement experiment on Lost and Found (%).

View Table | View all tables in this article

5. Conclusion and future work

In this paper, we propose EAFNet by fusing features of RGB and polarization images. We build ZJU-RGB-P with our integrated multimodal vision sensor, which is the first RGB-polarization semantic segmentation dateset to the best of our knowledge. EAFNet dynamically extracts attention weights of RGB and polarization branches, adjusts and fuses multimodal features, significantly advancing the segmentation performance, especially on classes with highly polarized characteristics like glass and car. Extensive experiments are conducted to prove the effectiveness of EAFNet for incorporating features from different sensing modalities and the flexibility to be adapted to other sensor combination scenarios like RGB-D perception. Therefore, EAFNet is a multimodal SS model that can be utilized in diverse real-world applications.

In the future, there are two research paths that can be explored. One is to build more kinds of multimodal dataset based on the integrated multimodal vision sensor like RGB-Infrared dataset to address nighttime scene understanding. Another line is to enlarge the categories of ZJU-RGB-P to cope with the detection of transparent objects, ice and water hazards.

Funding

ZJU-Sunny Photonics Innovation Center (2020-03); Bundesministerium für Arbeit und Soziales (01KM151112).

Acknowledgments

This research was supported in part by Hangzhou SurImage Technology Company Ltd.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

References

1. K. Yang, L. M. Bergasa, E. Romera, X. Huang, and K. Wang, “Predicting polarization beyond semantics for wearable robotics,” in 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), (IEEE, 2018), pp. 96–103.

2. D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Trans. Intelligent Trans. Syst. (to be published). [CrossRef]

3. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2015), pp. 3431–3440.

4. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

5. E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Trans. Intell. Transport. Syst. 19(1), 263–272 (2018). [CrossRef]

6. M. Oršic, I. Krešo, P. Bevandic, and S. Šegvic, “In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2019), pp. 12599–12608.

7. E. Romera, L. M. Bergasa, K. Yang, J. M. Alvarez, and R. Barea, “Bridging the day and night domain gap for semantic segmentation,” in 2019 IEEE Intelligent Vehicles Symposium (IV), (IEEE, 2019), pp. 1312–1318.

8. L. Sun, K. Wang, K. Yang, and K. Xiang, “See clearer at night: towards robust nighttime semantic segmentation through day-night image conversion,” in Artificial Intelligence and Machine Learning in Defense Applications, vol. 11169 (International Society for Optics and Photonics, 2019), p. 111690A.

9. Y. Zhang, D. Sidibé, O. Morel, and F. Mériaudeau, “Deep multimodal fusion for semantic image segmentation: A survey,” Image Vision Comp. 105, 104042 (2020). [CrossRef]

10. J. Zhang, K. Yang, and R. Stiefelhagen, “Issafe: Improving semantic segmentation in accidents by fusing event-based data,” arXiv preprint arXiv:2008.08974 (2020).

11. P. Pinggera, S. Ramos, S. Gehrig, U. Franke, C. Rother, and R. Mester, “Lost and found: detecting small road hazards for self-driving vehicles,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE, 2016), pp. 1099–1106.

12. K. Yang, K. Wang, R. Cheng, W. Hu, X. Huang, and J. Bai, “Detecting traversable area and water hazards for the visually impaired with a prgb-d sensor,” Sensors 17(8), 1890 (2017). [CrossRef]

13. D. Sun, X. Huang, and K. Yang, “A multimodal vision sensor for autonomous driving,” arXiv preprint arXiv:1908.05649 (2019).

14. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). [CrossRef]

15. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2016), pp. 770–778.

16. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2017), pp. 6230–6239.

17. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). [CrossRef]

18. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 7132–7141.

19. H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 7151–7160.

20. S. Choi, J. T. Kim, and J. Choo, “Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 9373–9383.

21. X. Hu, K. Yang, L. Fei, and K. Wang, “Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation,” in 2019 IEEE International Conference on Image Processing (ICIP), (IEEE, 2019), pp. 1440–1444.

22. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2019), pp. 3141–3149.

23. Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,” arXiv preprint arXiv:1809.00916 (2018).

24. K. Yang, X. Hu, H. Chen, K. Xiang, K. Wang, and R. Stiefelhagen, “Ds-pass: Detail-sensitive panoramic annular semantic segmentation through swaftnet for surrounding sensing,” in 2020 IEEE Intelligent Vehicles Symposium (IV), (IEEE, 2020), pp. 457–464.

25. L. Sun, K. Yang, X. Hu, W. Hu, and K. Wang, “Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images,” IEEE Robotics Autom. Lett. 5(4), 5558–5565 (2020). [CrossRef]

26. K. Yang, L. M. Bergasa, E. Romera, and K. Wang, “Robustifying semantic cognition of traversability across wearable rgb-depth cameras,” Appl. Opt. 58(12), 3141–3155 (2019). [CrossRef]

27. H. Chen, K. Wang, W. Hu, K. Yang, R. Cheng, X. Huang, and J. Bai, “Palvo: visual odometry based on panoramic annular lens,” Opt. Express 27(17), 24481–24497 (2019). [CrossRef]

28. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2016), pp. 3213–3223.

29. F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 2636–2645.

30. G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in 2017 IEEE International Conference on Computer Vision (ICCV), (IEEE, 2017), pp. 5000–5009.

31. A. Valada, G. L. Oliveira, T. Brox, and W. Burgard, “Deep multispectral semantic scene understanding of forested environments using multimodal fusion,” in International Symposium on Experimental Robotics, (Springer, 2016), pp. 465–477.

32. G. Choe, S.-H. Kim, S. Im, J.-Y. Lee, S. G. Narasimhan, and I. S. Kweon, “Ranus: Rgb and nir urban scene dataset for deep scene parsing,” IEEE Robotics Autom. Lett. 3(9), 1808–1815 (2018). [CrossRef]

33. C. Li, W. Xia, Y. Yan, B. Luo, and J. Tang, “Segmenting objects in day and night: Edge-conditioned cnn for thermal image semantic segmentation,” IEEE Trans. Neural Networks Learning Sys (to be published). [CrossRef]

34. J. Vertens, J. Zürn, and W. Burgard, “Heatnet: Bridging the day-night domain gap in semantic segmentation with thermal images,” arXiv preprint arXiv:2003.04645 (2020).

35. Y. Zhang, O. Morel, M. Blanchon, R. Seulin, M. Rastgoo, and D. Sidibé, “Exploration of deep learning-based multimodal fusion for semantic road scene segmentation,” in Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, (SciTePress, 2019), pp. 336–343.

36. M. Blanchon, O. Morel, Y. Zhang, R. Seulin, N. Crombez, and D. Sidibé, “Outdoor scenes pixel-wise semantic segmentation using polarimetry and fully convolutional network,” in Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, (SciTePress, 2019), pp. 328–335.

37. I. Alonso and A. C. Murillo, “Ev-segnet: Semantic segmentation for event-based cameras,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (IEEE, 2019), pp. 1624–1633.

38. K. Yang, L. M. Bergasa, E. Romera, J. Wang, K. Wang, and E. López, “Perception framework of water hazards beyond traversability for real-world navigation assistance systems,” in 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), (IEEE, 2018), pp. 186–191.

39. X. Han, C. Nguyen, S. You, and J. Lu, “Single image water hazard detection using fcn with reflection attention units,” in Proceedings of the European Conference on Computer Vision (ECCV), (Springer, 2018), pp. 105–120.

40. X. Huang, J. Bai, K. Wang, Q. Liu, Y. Luo, K. Yang, and X. Zhang, “Target enhanced 3d reconstruction based on polarization-coded structured light,” Opt. Express 25(2), 1173–1184 (2017). [CrossRef]

41. K. Berger, R. Voorhies, and L. H. Matthies, “Depth from stereo polarization in specular scenes for urban robotics,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), (IEEE, 2017), pp. 1966–1973.

42. K. Xiang, K. Wang, and K. Yang, “Importance-aware semantic segmentation with efficient pyramidal context network for navigational assistant systems,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC), (IEEE, 2019), pp. 3412–3418.

43. K. Xiang, K. Wang, and K. Yang, “A comparative study of high-recall real-time semantic segmentation based on swift factorized network,” in Artificial Intelligence and Machine Learning in Defense Applications, vol. 11169 (International Society for Optics and Photonics, 2019), p. 111690C.

44. F. Wang, S. Ainouz, C. Lian, and A. Bensrhair, “Multimodality semantic segmentation based on polarization and color images,” Neurocomputing 253, 193–200 (2017). [CrossRef]

45. E. Xie, W. Wang, W. Wang, M. Ding, C. Shen, and P. Luo, “Segmenting transparent objects in the wild,” arXiv preprint arXiv:2003.13948 (2020).

46. A. Kalra, V. Taamazyan, S. K. Rao, K. Venkataraman, R. Raskar, and A. Kadambi, “Deep polarization cues for transparent object segmentation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 8602–8611.

47. R. Blin, S. Ainouz, S. Canu, and F. Meriaudeau, “Road scenes analysis in adverse weather conditions by polarization-encoded images and adapted deep learning,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC), (IEEE, 2019), pp. 27–32.

48. Y. Wang, Q. Liu, H. Zu, X. Liu, R. Xie, and F. Wang, “An end-to-end cnn framework for polarimetric vision tasks based on polarization-parameter-constructing network,” arXiv preprint arXiv:2004.08740 (2020).

49. H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2 (IEEE, 2005), pp. 807–814.

50. K. Zhou, K. Wang, and K. Yang, “A robust monocular depth estimation framework based on light-weight erf-pspnet for day-night driving scenes,” in Journal of Physics: Conference Series, vol. 1518 (IOP Publishing, 2020), p. 012051.

51. B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2015), pp. 1119–1127.

52. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a database and web-based tool for image annotation,” Int. J. Comput. Vis. 77(1-3), 157–173 (2008). [CrossRef]

53. K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015). [CrossRef]

54. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 11534–11542.

55. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

Model	Building	Glass	Car	Road	Vegetation	Sky	Pedestrian	Bicycle	mIoU
Baseline	83.0	73.4	91.6	96.7	94.5	84.7	36.1	82.5	80.3
EAF-A	87.0	79.3	93.6	97.4	95.3	87.1	60.4	85.6	85.7
EAF-D	86.4	76.9	93.0	97.1	95.5	86.1	62.1	86.0	85.4
EAF-A/D	87.1	77.7	93.7	97.4	95.2	84.9	63.8	83.9	85.4
EAF-3Path	83.8	74.4	92.0	95.9	95.1	86.5	59.2	80.2	83.4

Model	Building	Glass	Car	Road	Vegetation	Sky	Pedestrian	Bicycle	mIoU
Baseline	83.0	73.4	91.6	96.7	94.5	84.7	36.1	82.5	80.3
SN-A	74.0	66.6	87.1	94.7	91.1	76.1	32.9	65.5	73.5
SN-RGB/A	83.9	75.6	91.6	96.9	94.4	78.2	41.0	79.9	80.2
EAF-wo-A	85.2	75.4	92.2	96.8	94.8	84.4	40.6	82.9	81.6
EAF-A	87.0	79.3	93.6	97.4	95.3	87.1	60.4	85.6	85.7

Group	SwiftNet-RGB		EAFNet-RGBD
Class	Road	Obstacle	Road	Obstacle
Precision	85.4	26.5	88.2	76.2
Recall	63.9	49.8	75.9	63.0
IoU	56.2	20.9	68.9	52.7

Model	Building	Glass	Car	Road	Vegetation	Sky	Pedestrian	Bicycle	mIoU
Baseline	83.0	73.4	91.6	96.7	94.5	84.7	36.1	82.5	80.3
EAF-A	87.0	79.3	93.6	97.4	95.3	87.1	60.4	85.6	85.7
EAF-D	86.4	76.9	93.0	97.1	95.5	86.1	62.1	86.0	85.4
EAF-A/D	87.1	77.7	93.7	97.4	95.2	84.9	63.8	83.9	85.4
EAF-3Path	83.8	74.4	92.0	95.9	95.1	86.5	59.2	80.2	83.4

Model	Building	Glass	Car	Road	Vegetation	Sky	Pedestrian	Bicycle	mIoU
Baseline	83.0	73.4	91.6	96.7	94.5	84.7	36.1	82.5	80.3
SN-A	74.0	66.6	87.1	94.7	91.1	76.1	32.9	65.5	73.5
SN-RGB/A	83.9	75.6	91.6	96.9	94.4	78.2	41.0	79.9	80.2
EAF-wo-A	85.2	75.4	92.2	96.8	94.8	84.4	40.6	82.9	81.6
EAF-A	87.0	79.3	93.6	97.4	95.3	87.1	60.4	85.6	85.7

Polarization-driven semantic segmentation via efficient attention-bridged fusion

Abstract

1. Introduction

2. Related work

2.1 From accurate to efficient semantic segmentation

2.2 From RGB-based to multimodal semantic segmentation

3. Methodology

3.1 Polarization image formation

3.2 Integrated multimodal sensor and ZJU-RGB-P dataset

3.3 Efficient Attention-bridged Fusion Network

4. Experiments and analysis

4.1 Implementation details

4.2 Results and analysis

4.3 Analysis of EAC module

4.4 Ablation study

4.5 Generalization to other sensor fusions

5. Conclusion and future work

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (16)

Tables (3)

Equations (9)

Optics Express