Attention-based fusion network for human eye-fixation prediction in 3D images

Ying Lv; Wujie Zhou; Wujie Zhou; Jingsheng Lei; Lv Ye; Ting Luo

doi:10.1364/OE.27.034056

1. Introduction

When facing natural scenes, the human visual system has the ability to search for and locate objects of interest quickly and to process areas of interest while ignoring other areas [1]. This visual attention mechanism is important for people to process visual information in daily life. It utilizes two strategies: bottom-up and top-down [2]. The bottom-up strategy represents the visual attention caused by the essential features of images, which is driven by the underlying perception data, for example, a series of image features such as color, brightness, and orientation. According to the underlying image data, there are strong feature differences in different regions; the significance of the image region can be calculated by judging the difference between the target region and its surrounding pixels. The top-down strategy is based on the task-driven attentional saliency mechanism, which drives visual attention based on task experience and anticipates the target saliency area of the current image based on knowledge. For example, in a certain area, when you are searching for your friend who is wearing a black hat, you will first notice the prominent feature of the black hat.

With the growth of the Internet resulting in the availability of large volumes of data, quickly obtaining specific information from vast amounts of image and video data has become a key problem in the field of computer vision. Visual detection of significance through object recognition [3], 3D display [4], visual comfort evaluation [5], and 3D visual quality measurement [6] has important application value in this regard.

Numerous deep neural networks (DNNs) have been proposed to predict human visual attention in natural 2D scenes. For instance, SALICON [7] uses a weight-sharing DNN to extract features from multi-scale input; pretrained parameters are loaded onto ImageNet for the image classification task, and the multi-scale features are integrated in a later stage. Deepfix [8] uses deep architectures of VGG [9] to extract complex context information and the inception block of GoogleNet [10] to simultaneously carry out the different size operations, and contains dilated convolutions [11] in location-based convolution layers with a central bias. Pan et al. [12] proposed a network named SalGAN, which takes advantage of an encoder-decoder generative adversarial network (GAN) [13] to generate pixel level saliency maps and introduces the binary cross-entropy loss function. Wang et al. [14] adopted VGG as the backbone network to carry out deconvolution operations on multiple scales and conduct multi-scale joint supervision, which benefited from this supervision method and demonstrated excellent prediction performance. Cornia et al. [15] used a dilated convolutional network to expand the reception field of the network so that the network can identify more contextual information. An attentive convolutional LSTM sequentially enhances saliency features because of its attentive recurrent mechanism. In order to obtain multiple scale features simultaneously, Kroner et al. [16] used a module with multiple convolutional layers at different dilation rates, called the ASPP module, to enhance contextual information containing advanced visual features at multiple spatial scales. Che et al. [17] introduced a GAN based on U-Net as a GazeGAN generator that combines the classic “skip connection” with the “center surround connection (CSC)” to exploit multilevel features.

Although the 2D DNNs mentioned above have achieved great success, most models are not applicable in 3D scenarios, where the spatial location of the matrix is more emphasized. Recent advances in depth sensors, such as Microsoft's Kinect, Intel's RealSense, and the iPhone, illustrate how to overcome these challenges by detecting depth data pairs in salient objects that are easy to capture, light independent, and can provide geometric clues that improve significance predictions. Because there is a complementary relationship between RGB and depth, it is of great help to effectively integrate the two for visual significance detection. However, there are also considerable challenges in the fusion of the two, and several methods of combining RGB and depth images for saliency detection have been proposed previously [18–21]. Zhang et al. [18] used a convolutional neural network (CNN) to extract RGB and depth features, respectively, and linear fusion strategy was used to integrate the color salient and depth salient graphs. Although this allows for the incorporation of depth information, the fusion approach is too simple to consider the differences between cross-modal characteristics. Li et al. [19] manually produced the features of four modes, not only RGB and depth, to compensate for the shortcomings of CNN and conducted reasoning and fusion through the most advanced technical graph neural network. Wang et al. [20] proposed a novel visual attention-driven model, where a module mimicking human attention behavior in a dynamic setting is used as a supervised neural attention module to guide the subsequent module for fine-grained video object segmentation. For other 3D-induced tasks conducted with CNNs, Jiang et al. [21] introduced an attention mechanism to allocate weights on multi-level RGB and depth features, obtaining the final fusion feature graph by linear combination. In this study, based on the above model features, rather than a simple linear fusion, an integration strategy based on attention mechanism was adopted, making full use of multi-scale and multi-level features. The final coarse saliency map was obtained using the enhanced saliency map after VGG-16 gradual fusion, and the final refined salient graph was obtained using a refinement block.

Compared to the relevant previous works, our architecture has the following three major advantages:

1) A three-stream network is proposed to solve the problem of cross-modal information fusion to a large extent, as well as to extract multi-level and multi-scale features of RGB and depth via two bottom-top streams, which are favorable conditions for network reasoning.
2) The channel attention mechanism is proposed to enhance these layered features and suppress the inconsistency of feature space, and thus make the network focus more on important targets.
3) The refinement block makes further empirical inferences for each coarse map and corrects the network's own identification errors based on the acquired knowledge, so as to convert the prediction map from coarse to fine.

2. Proposed approach

Our proposed model is composed of four parts: the first two parts are divided into RGB bottom-up stream and depth bottom-up stream, the third part is the cross-modal feature fusion stream embedded with the attention mechanism, and the fourth part is the refinement stream from the top-down. Figure 1 shows a pictorial description of the proposed network architecture. We designed model blocks that could be integrated into any basic network. To compete fairly with other state-of-the-art models, we used the ResNet-18 network [22] as the backbone network of the first part and the bottom-up stream of the second part and the VGG-16 network [9] as the backbone network of the feature fusion stream of the third part, where the embedded attention mechanism is our separate design in order to extract the RGB and depth clear details of the channel features in the process of bottom-up under two types of cross-modal, in particular, we remove the ResNet-18 and VGG-16 network average pooling layer and full connection layer and retain their five convolution blocks. In addition, the fourth part of the refining distillation is designed to gradually refine the coarse saliency map after fusion and restore the resolution of the predicted saliency map, so making it a top-down process.

Fig. 1. Architecture of the proposed human attention prediction network.

Download Full Size | PDF

2.1 Multi-scale and multi-level feature extraction

For the bottom-up RGB stream and depth stream, we adjusted the input pair of the image resolution to 448×448 and the input of the pre-trained ResNet-18 model on ImageNet [23]. In the reasoning stage, the input images were resized to 244×244 in the VGG-16 network, and the top-down fusion path embedded attention mechanism was used to adaptively select and combine cross-modal cross-level information supplements and realize multi-modal and multi-level prediction. The entire network can be trained end-to-end. Previous work has proved that multi-scale features can be used to obtain better significance detection results. This encourages us to select the M layers from the RGB and depth bottom-up, respectively, learn the multi-scale and multi-level prominent features Fm (m = 1, 2, 3, 4), and then input both to the attention mechanism to generate the attention mask Am, achieving cross-modal fusion is through the fusion layer.

2.2 Attention mechanism

The attention mechanism mentioned above is applied to the existing features (both low-level and high-level) of the remodeling process, reducing the traditional convolution in the feature extraction of the inconsistencies on the channel dimension. In order to improve smoothness, it contains a 1 × 1 convolution layer, as well as some arithmetic operations, as shown in Fig. 2. For a given a set of feature vectors F_m∈R^{C × H × W}, its H × W dimension is reshaped into N dimensions, and then the matrix multiplication operation is utilized with its transpose to direct channel attention to X_m∈R^{C × C}, defined as

(1)$${\textbf{X}_m} = {R_e}({\textbf{F}_m}) \otimes {R_e}{({\textbf{F}_m})^T}$$

where ⊗ denotes the matrix multiplication operation, R_e represents the reshape operation by which F_m∈R^{C × H × W} is reshaped into R^{C × N} (N = H × W), and T is the transpose operation to transpose an input matrix.

Fig. 2. Illustration of channel-wise attention mechanism

Download Full Size | PDF

In order to enhance the saliency features, a threshold function Max(, −1) is used to subtract itself, after which the softmax layer is used to obtain the channel-wise weight map W_m. The detailed operation process is as follows:

(2)$${\textbf{W}_m} = \delta [Max({\textbf{X}_m}, - 1) - {\textbf{X}_m}]$$

where Max(X_m, −1) can find significant pixels and suppress insignificant ones, such as background and local details, and δ denotes the softmax function. Then, the generated channel-wise weight map W_m is used for adaptive screening of the original input features, to allow the network to learn more prominent targets and enhances the original features. The enhanced feature graph A_m formula is as follows:

(3)$${\textbf{A}_m} = ({\textbf{W}_m} \otimes {\textbf{F}_m}) + \gamma \ast {\textbf{F}_m}$$

where ⊗ denotes the matrix multiplication operation, and γ is a learnable parameter that is initialized to zero here. When training the model, manual supervision need not be added; the network will automatically learn this parameter to obtain an optimal value. During the test, the optimal parameters are loaded directly for later calculation. Element-wise multiplication is denoted by *. Further details are given in the lower left corner of Fig. 1.

As shown at the top of Fig. 1, the intermediate channel in our proposed network is the fusion layer, whose function is to effectively fuse the obtained multi-level and multi-scale salient features A_m to obtain the final set of fusion eigenvectors V. Although this has led to considerable improvement in prediction performance and the ability to pinpoint exactly what the human eye is looking at, it is clearly not sufficiently sophisticated, and our goal is to obtain more subtle and predominant predictions.

2.3 Refinement

Therefore, the refinement block designed by us, called the recurrent convolutional upsample layer (RCUL), can correct the network’s own identification errors based on the acquired knowledge and adopt the recurrent neural network structure to compensate for the inherent lack of feature extraction in the ordinary CNN. The RCUL consists of an ordinary convolution layer, a recurrent convolution layer, and two up-sampling layers. All the convolution layers are followed by a ReLU and Batchnormal layer, as shown in Fig. 3. The ordinary convolution layer is used to reduce the dimension of the salient features A_m; there are 32 convolution kernels, with size set to 3×3, and stride set to 1. We define the reduced dimension features as

(4)$$Z = \beta ({\sigma ({Conv({{\textbf{A}_m}} )} )} )$$

where Z∈R^{32 × 14 × 14}, Conv denotes an ordinary convolution layer, σ denotes a nonlinear activation function ReLU [24], and β denotes a Batchnormal [25] layer.

Fig. 3. Structure of recurrent convolutional upsample layer (RCUL).

Download Full Size | PDF

An RCUL with T time steps can be expanded into a T subnetwork, and we set T = 3; as shown at the bottom right of Fig. 1, the blue convolution layers represent the unfolded recurrent convolution layer. In order to obtain a resolution of the same size as the input image, the upper sampling layer is embedded into the RCUL in a hierarchical manner. The refinement salient features S^t+1 are defined as follows:

(5)$${\textbf{S}^{t + 1}} = \beta ({\sigma ({\mu (\textbf{Z} )+ \mu ({{R_e}({\textbf{S}^t})} )} )} )$$

where t is the state of the time steps and t ∈ T, indicating that the output at this time is related to the input at the previous time; R denotes the recurrent convolution layer, and μ denotes the upsample operation, where the scale factor is set to 2; σ denotes the nonlinear activation function ReLU, and β denotes the Batchnormal layer.

In summation, the refined feature map passes through our simple output layer to produce the final saliency prediction map M. A 1×1 convolution layer is introduced, followed by a Sigmoid activation layer. The calculation process can be represented as

(6)$$\textbf{M} = sig({Conv({\textbf{S}^3})} )$$

where Conv denotes the convolution layer and sig denotes the Sigmoid activation layer.

2.4 Hybrid loss

In this study, we used MSE combined with the correlation coefficient (CC) metric as the loss function between the final saliency prediction map and the ground truth. However, the CC metric is slightly modified into the representation of dissimilarity without requiring empirical coefficients. The modified loss mimics the behavior of cross-entropy, which is widely used in image classification, approaching zero when there are no errors.

The CC computes the linear correlation between two distributions. The range of the CC is [−1,1], where the positive value indicates that the two distributions are completely correlated, and vice versa. To apply root mean square prop (RMSprop) more efficiently, the CC metric was simply modified as follows:

(7)$$CC^{\prime}({P,Q} )= 1 - \frac{{\sigma ({P,Q} )}}{{\sigma (P )\times \sigma (Q )}}$$

CC’ was used to represent the modified metric; it converts the similarity metric into dissimilarity in the range of [0, 2]. To formulate our final loss function, we simply take the summation of MSE and CC’.

(8)$$L = \frac{1}{N}{\sum\limits_{i = 1}^n {||{P - Q} ||} ^2} + 1 - \frac{{\sigma ({P,Q} )}}{{\sigma (P )\times \sigma (Q )}}$$

2.5 Implementation

For data processing, we randomly sampled 420 images from the NUS [26] and 332 images from the NCTU [27] datasets as training sets, and 60 images from the NUS and 48 images from the NCTU as verification sets; and 95 images from the NUS and 120 images from the NCTU as test sets. These two datasets contain a large number of images, some from real scenes and some from 3D movie scenes, with rich visual content, including semantic content and complex backgrounds. In order to accelerate the convergence speed and improve the computing performance of the network, pre-calculated parameters were used before input into the model to normalize each image area with the mean value as the center along the RGB channel into unit variance.

Our model was trained by loading the pre-trained VGG-16 model onto the ImageNet dataset. A batch size of one image was used in each iteration, and the learning rate was initialized as 1×10⁻⁴. The network parameters were learned by back-propagating the loss function using RMSprop. An early termination algorithm was employed to prevent overfitting of the network. The performance of our model on the training set was exceptionally good, while it was poor on the verification sets; this indicates that our model has undergone overfitting. In order to improve the generalization ability of our model, early termination [28], which (early stopping) is a simple and effective method, can be used. When the performance of the model on the verification set fails to be comparable to that of the that on training set, training should be stopped and the parameters in the previous iteration result should be taken as the final parameters of the model. Note that it takes approximately 80 epochs to train our model. The experiments were performed on the publicly available Pytorch 1.1.0 [29] framework using a workstation equipped with a TITAN V GPU with 12 GB memory.

3. Results and discussion

3.1 Datasets

We conducted experiments on two public representative salient object detection datasets: NUS3D-Saliency [26] and NCTU-3DFixation [27]. The NUS3D-Saliency dataset (denoted NUS) contains 600 2D versus 3D image pairs viewed by 80 participants. This dataset provides color stimuli, depth maps, smooth depth maps, and 2D and 3D fixation maps. The depth information was directly obtained from Kinect depth camera, and eye-tracking data were collected in 2D and 3D free viewing experiments, respectively. The NCTU-3DFixation dataset (denoted NCTU) consists of 475 3D images and their depth maps with a resolution of 1920×1080. The images in this dataset are mainly from various scenes in existing 3D movies or video, including left and right view maps, disparity maps, and fixation maps.

3.2 Evaluation metrics

We used the following metrics to evaluate the eye-fixation prediction results: linear CC, AUC Borji, normalized scanpath saliency (NSS), and Kullback-Leibler divergence (KL-Div). CC and KL-Div metrics are used to compare the predicted saliency map with the ground truth map generated by fixation point, while the other two indexes need to compare the predicted saliency map with the ground truth graph after binarization.

1) CC: The CC is a statistical method that generally measures the dependence or correlation between two variables. CC can be used to interpret fixation and saliency maps. It can be expressed as the following equation, where G and S are the two random variables, the relationship between which is estimated using CC: $(9)$$CC = \frac{{{\mathop{\rm cov}} ({S,G} )}}{{\sigma (S )\times \sigma (G )}}$$$ where cov(S, G) denotes the covariance of S and G. It ranges between −1 and + 1.
2) NSS: NSS is a metric specifically designed for saliency map evaluation. Given S and Q, NSS can be calculated as follows: $(10)$$NSS = \frac{1}{N}\sum\limits_{\textrm{i} = 1}^N {\bar{S}(i )} \times Q(i )$$$ $where \mathop {}\limits^{} N = \sum\nolimits_i {Q(i )\mathop {}\limits^{} }\, and \,\mathop {}\limits^{} \bar{S} = \frac{{S - \mu (S )}}{{\sigma (S )}}$ where N denotes the total number of human eye positions, and σ(·) stands for standard deviation.
3) KLDiv: The KLDiv evaluates the loss of information when the distributions utilized to approximate the distribution, therefore taking a probabilistic interpretation of S and Q. It can be expressed as follows: $(11)$$D({S,Q} )= \sum\limits_{\textrm{i} = 1}^\textrm{n} {{P_S}} (x )\log \frac{{{P_S}(x )}}{{{P_Q}(x )}}$$$ where i indexes the i^th pixel and is a regularization constant.
4) AUC: The AUC metric, defined as the area under the receiver operating characteristic (ROC) curve, is widely used to evaluate the maps by saliency models.

3.3 Comparison against state-of-the-art methods

We compared our proposed model on the two benchmark datasets above to six other state-of-the-art models: Fang [30], DeepFix [8], ML-net [31], DVA [14], iSEEL [32], and SAM [15]. For fair comparison, all models use the same training sets, verification sets, and test sets. We used their publicly available code to obtain saliency maps with the parameters recommended by their authors. Table 1 presents the quantitative results obtained on the NUS and NCTU datasets and Fig. 4 shows sample images of these models, including our model. The table and figure show that, compared to the other methods, our approach has strong robustness and is rarely distracted by high-contrast edges and complex backgrounds, and it can thus generate more accurate saliency prediction results.

Fig. 4. Qualitative results: we compare our results with 6 saliency prediction models. Column 1: original left images; Column 2: the ground truth; Column 3: the proposed model; Columns 4–9: saliency map from [29,8,30,14,31,15].

Download Full Size | PDF

Table 1. Comparison of quantitative scores on NUS [26] and NCTU [27] datasets

View Table | View all tables in this article

The proposed method can detect single and complex bottom-up saliency patterns with different scales and contrasts (see rows 1 and 2), and can effectively handle local and global contrast features (particularly rows 7 and 12; an image contains many complex, densely arranged, tiny objects). More importantly, the face is a top-down factor that our network can detect very well (see row 8), along with people in a complex background (see rows 3 and 9); pedestrians on the street are often salient objects, but, influenced by the ambient light, brighter places (see row 10) and long and short distance objects (see rows 4, 5, and 11) receive more visual attention from people. Our network can also handle the integration of contrasts and top-down factors; in particular, in rows 6, 11, and 12, although the images contain a large number of eye-catching objects or complex backgrounds, our model can still detect the regions that draw human attention.

3.4 Ablation experiments and analyses

In this section, we explore the effect of different components in the proposed method on the NCTU dataset. To prove the effectiveness of the proposed attention mechanism. We compared the results of using the backbone (obtained by removing the attention mechanism and RCUL from the original network; denoted B) with the results obtained by adding the attention mechanism to the backbone (denoted B + A). As shown in Table 2, comparing the first and second rows, it is clear that the proposed attention mechanism brings obvious improvement. Obviously, comparing second and third columns in Fig. 5, the attention mechanism can make prominent areas more prominent and highlighted and the contrast between salient and non-salient regions is promoted.

Fig. 5. Visual comparison with different modules. The meaning of indexes could be seen in the caption of Table 2

Download Full Size | PDF

Table 2. Ablation studies of different modules.

View Table | View all tables in this article

We also evaluated the results using the backbone with RCUL (denoted B + R; the third row in Table 2), which shows that B + R also has positive effects. Comparing the results generated by the backbone (B, the second column in Fig. 5) and the backbone with RCUL (B + R, the fourth column in Fig. 5), it can be seen that RCUL is indeed able to refine the saliency areas detected by the backbone network, and repair the mistakes made by the backbone network in saliency detection. Then, both attention mechanism and RCUL are integrated into the backbone network, which is the model we proposed in Sec. 2 (defined here as B + A + R, and show the performance in the fourth row in Table 2.).

Visually, as shown in Fig. 5, comparing the results between the fifth (B + A+ R) and the above column, it can be seen that it combines the advantages of using attention mechanism or RCUL alone. Although all the indicators are not the best, the results are the best in a comprehensive manner. Visually it is clear that the results of the proposed method are the closest to the ground truth.

4. Summary

In this study, the multi-modal fusion of 3D data was investigated and a new three-stream architecture proposed to enhance the bottom-up and top-down representation capability of the approach. In order to achieve effective cross-modal and cross-level integration, the attention mechanism of cross-modal cross-level fusion domain channel intelligence were introduced in the bottom-up reasoning path. The proposed networks effectively extracted and combined each other from different modes and levels. Then, the coarse saliency map was obtained, which was further refined by refinement blocks in a top-down structure. The significance graph prediction accuracy of the model was also improved.

This model demonstrates superior performance, largely because of its attention mechanism design. It has potential to imitate the human visual system even more closely, and we expect to achieve this in future work by introducing this technology directly to the convolution kernel to develop adaptive convolution kernels which can quickly detect or identify the target, allow significant compression of the model parameters, and be applied to a variety of tasks.

Funding

National Natural Science Foundation of China (61502429, 61672337, 61971247, 61972357); Natural Science Foundation of Zhejiang Province (LY18F020012); Primary Research and Development Plan of Zhejiang Province (2019C03135); Open Project of Provincial Key Laboratory of Information Processing, Communication and Networking; China Postdoctoral Science Foundation (2015M581932).

Acknowledgments

Writing assistance was provided by Editage, Inc.

Disclosures

The authors declare no conflicts of interest.

References

1. M. Liu, C. Lu, H. Li, and X. Liu, “Near eye light field display based on human visual features,” Opt. Express 25(9), 9886–9900 (2017). [CrossRef]

2. C. E. Connor, H. E. Egeth, and S. Yantis, “Visual attention: bottom-up versus top-down,” Curr. Biol. 14(19), R850–R852 (2004). [CrossRef]

3. M. DaneshPanah, B. Javidi, and E. A. Watson, “Three dimensional object recognition with photon counting imagery in the presence of noise,” Opt. Express 18(25), 26450–26460 (2010). [CrossRef]

4. K. H. Yoon, M. K. Kang, H. Lee, and S. K. Kim, “Autostereoscopic 3D display system with dynamic fusion of the viewing zone under eye tracking: principles, setup, and evaluation,” Appl. Opt. 57(1), A101–A117 (2018). [CrossRef]

5. B. Kan, Y. Zhao, and S. Wang, “Objective visual comfort evaluation method based on disparity information and motion for stereoscopic video,” Opt. Express 26(9), 11418–11437 (2018). [CrossRef]

6. W. Zhou, L. Yu, W. Qiu, Y. Zhou, and M. Wu, “Local gradient patterns (LGP): An effective local-statistical-feature extraction scheme for no-reference image quality assessment,” Inf. Sci. 397–398, 1–14 (2017). [CrossRef]

7. X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in International Conference on Computer Vision (ICCV), (IEEE, 2015), pp. 262–270.

8. S. S. Kruthiventi, K. Ayush, and R. Babu, “Deepfix: A fully convolutional neural network for predicting human eye fixations,” IEEE Trans. Image Process. 26(9), 4446–4456 (2017). [CrossRef]

9. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).

10. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition(CVPR), (IEEE, 2015), pp. 1–9.

11. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122 (2015).

12. J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081 (2017).

13. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, 2672–2680 (2014).

14. W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Trans. Image Process. 27(5), 2368–2378 (2018). [CrossRef]

15. M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye fixations via an lstm-based saliency attentive model,” IEEE Trans. Image Process. 27(10), 5142–5154 (2018). [CrossRef]

16. A. Kroner, M. Senden, K. Driessens, and R. Goebel, “Contextual Encoder-Decoder Network for Visual Saliency Prediction,” arXiv preprint arXiv:1902.06634 (2019).

17. Z. Che, A. Borji, G. Zhai, X. Min, G. Guo, and P. Callet, “Leverage eye-movement data for saliency modeling: Invariance Analysis and a Robust New Model,” arXiv preprint arXiv:1905.06803, May. (2019).

18. Q. Zhang, X. Wang, J. Jiang, and L. Ma, “Deep learning features inspired saliency detection of 3D images,” in Pacific rim conference on multimedia, 580–589 (2016).

19. B. Li, Q. Liu, X. Shi, and Y. Yang, “Graph-Based Saliency Fusion with Superpixel-Level Belief Propagation for 3D Fixation Prediction,” in 2018 25th IEEE International Conference on Image Processing (ICIP), (IEEE, 2018), pp. 2321–2325.

20. W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling, “Learning unsupervised video object segmentation through visual attention,” in Computer Vision and Pattern Recognition(CVPR), (IEEE, 2019), pp. 3064–3074.

21. M.-X. Jiang, C. Deng, J.-S. Shan, Y.-Y. Wang, Y.-J. Jia, and X. Sun, “Hierarchical multi-modal fusion FCN with attention model for RGB-D tracking,” Information Fusion 50, 1–8 (2019). [CrossRef]

22. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Pattern Recognition(CVPR), (IEEE, 2016), pp.770–778.

23. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” In Advances in neural information processing systems, 1097–1105 (2012).

24. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.

25. B. Normalization, “Accelerating deep network training by reducing internal covariate shift.” CoRR.–2015.–Vol. abs/1502.03167.–URL: http://arxiv.org/abs/1502.03167 (2015).

26. C. Lang, T. Nguyen, H. Katti, K. Yadati, M. Kankanhalli, and S. Yan, “Depth matters: Influence of depth cues on visual saliency,” in European Conference on Computer Vision (ECCV), (Springer, 2012), pp. 101–115.

27. C.-Y. Ma and H. Hang, “Learning-based saliency model with depth information,” J. Vis. 15(6), 19 (2015). [CrossRef]

28. L. Prechelt, “Early Stopping-But When?” Neural Networks Tricks of the Trade 1524, 55–69 (1998). [CrossRef]

29. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS-W, Oct. (2017).

30. Y. Fang, J. Wang, M. Narwaria, P. Le Callet, and W. Lin, “Saliency detection for stereoscopic images,” IEEE Trans. Image Process. 23(6), 2625–2636 (2014). [CrossRef]

31. M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level network for saliency prediction,” in International Conference on Pattern Recognition (ICPR), 3488–3493(2016).

32. H. R. Tavakoli, A. Borji, J. Laaksonen, and E. J. N. Rahtu, “Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features,” Neurocomputing 244, 10–18 (2017). [CrossRef]

Datasets	Criteria	Fang	DeepFix	ML-net	DVA	iSEEL	SAM	Proposed
NUS	CC	0.333	0.4322	0.446	0.4549	0.5195	0.5013	0.5579
	KLDiv	1.560	1.8138	1.780	2.4349	1.2479	2.9059	1.0903
	AUC	0.795	0.7699	0.766	0.7236	0.8273	0.7461	0.8339
	NSS	1.209	1.6608	1.821	1.7962	2.1250	2.1259	2.3373
NCTU	CC	0.542	0.7974	0.696	0.6834	0.7578	0.7115	0.8614
	KLDiv	0.674	1.3083	0.900	1.1045	0.3985	1.8482	0.2681
	AUC	0.806	0.8650	0.835	0.8035	0.8315	0.7250	0.9143
	NSS	1.264	1.8575	1.588	1.5546	1.7187	1.5386	2.3795

		CC	KLDiv	AUC	NSS
NCTU	B	0.8405	0.3013	0.9069	2.3171
	B + A	0.8453	0.2841	0.9159	2.3773
	B + R	0.8589	0.2678	0.9134	2.3626
	B + A+R	0.8614	0.2681	0.9143	2.3795

Datasets	Criteria	Fang	DeepFix	ML-net	DVA	iSEEL	SAM	Proposed
NUS	CC	0.333	0.4322	0.446	0.4549	0.5195	0.5013	0.5579
	KLDiv	1.560	1.8138	1.780	2.4349	1.2479	2.9059	1.0903
	AUC	0.795	0.7699	0.766	0.7236	0.8273	0.7461	0.8339
	NSS	1.209	1.6608	1.821	1.7962	2.1250	2.1259	2.3373
NCTU	CC	0.542	0.7974	0.696	0.6834	0.7578	0.7115	0.8614
	KLDiv	0.674	1.3083	0.900	1.1045	0.3985	1.8482	0.2681
	AUC	0.806	0.8650	0.835	0.8035	0.8315	0.7250	0.9143
	NSS	1.264	1.8575	1.588	1.5546	1.7187	1.5386	2.3795

		CC	KLDiv	AUC	NSS
NCTU	B	0.8405	0.3013	0.9069	2.3171
	B + A	0.8453	0.2841	0.9159	2.3773
	B + R	0.8589	0.2678	0.9134	2.3626
	B + A+R	0.8614	0.2681	0.9143	2.3795

Attention-based fusion network for human eye-fixation prediction in 3D images

Abstract

1. Introduction

2. Proposed approach

2.1 Multi-scale and multi-level feature extraction

2.2 Attention mechanism

2.3 Refinement

2.4 Hybrid loss

2.5 Implementation

3. Results and discussion

3.1 Datasets

3.2 Evaluation metrics

3.3 Comparison against state-of-the-art methods

3.4 Ablation experiments and analyses

4. Summary

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (5)

Tables (2)

Equations (12)

Optics Express