Abstract
360° videos have been widely used with the development of virtual reality technology and triggered a demand to determine the most visually attractive objects in them, aka 360° video saliency prediction (VSP). While generative models, i.e., variational autoencoders or autoregressive models have proved their effectiveness in handling spatio-temporal data, utilizing them in 360° VSP is still challenging due to the problem of severe distortion and feature alignment inconsistency. In this study, we propose a novel spatio-temporal consistency generative network for 360° VSP. A dual-stream encoder-decoder architecture is adopted to process the forward and backward frame sequences of 360° videos simultaneously. Moreover, a deep autoregressive module termed as axial-attention based spherical ConvLSTM is designed in the encoder to memorize features with global-range spatial and temporal dependencies. Finally, motivated by the bias phenomenon in human viewing behavior, a temporal-convolutional Gaussian prior module is introduced to further improve the accuracy of the saliency prediction. Extensive experiments are conducted to evaluate our model and the state-of-the-art competitors, demonstrating that our model has achieved the best performance on the databases of PVS-HM and VR-Eyetracking.
| Original language | English |
|---|---|
| Pages (from-to) | 311-322 |
| Number of pages | 12 |
| Journal | IEEE Journal on Emerging and Selected Topics in Circuits and Systems |
| Volume | 14 |
| Issue number | 2 |
| DOIs | |
| State | Published - 2024 |
| Externally published | Yes |
Keywords
- 360° videos
- Gaussian priors
- Saliency prediction
- spatio-temporal features
Fingerprint
Dive into the research topics of 'Predicting 360° Video Saliency: A ConvLSTM Encoder-Decoder Network With Spatio-Temporal Consistency'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver