Abstract
Human feelings are integral to Video Quality Assessment (VQA) under the Human Visual System (HVS) paradigm. However, most existing approaches predominantly model spatio-temporal distortions and insufficiently account for perceptual effects driven by human feelings. This work proposes CLiF-VQA+, which pioneers the incorporation of human feelings in VQA to achieve a superior simulation of the HVS. To effectively extract human feelings from videos, we first validate that the CLIP model exhibits high consistency with human objective and subjective feelings. Building on this, we investigate prompt strategies and discover that mixing objective and subjective prompts leads to significant feature suppression, which degrades performance. To address this limitation, we introduce a decoupled strategy that utilizes distinct prompts for separate feature extraction. Objective feelings are captured using Multi-Region Sliding-Window Sampling (MRSWS) at native resolution to preserve local distortion cues. Concurrently, subjective feelings are modeled from downscaled full frames to efficiently retain global semantics. These distinct feeling representations are fused with spatio-temporal features to predict video quality. Extensive experiments demonstrate that CLiF-VQA+ achieves superior performance with high computational efficiency.
| Original language | English |
|---|---|
| Article number | 133940 |
| Journal | Neurocomputing |
| Volume | 694 |
| DOIs | |
| State | Published - 14 Sep 2026 |
| Externally published | Yes |
Keywords
- Human feelings
- Video quality assessment
- Vision-language models
Fingerprint
Dive into the research topics of 'CLiF-VQA+: Enhancing video quality assessment by incorporating human objective and subjective feelings'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver