Skip to main navigation Skip to search Skip to main content

Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition

  • Zhifeng Wang*
  • , Qixuan Zhang
  • , Peter Zhang
  • , Wenjia Niu
  • , Kaihao Zhang
  • , Ramesh Sankaranarayana
  • , Sabrina Caldwell
  • , Tom Gedeon
  • *Corresponding author for this work
  • Australian National University
  • Quriosity Pty Ltd.
  • Curtin University

Research output: Contribution to journalArticlepeer-review

Abstract

Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others’ emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs’ video emotion recognition capabilities.

Original languageEnglish
Pages (from-to)12355-12368
Number of pages14
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number12
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Vision large language models
  • multi-modal prompting
  • zero-shot emotion recognition

Fingerprint

Dive into the research topics of 'Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition'. Together they form a unique fingerprint.

Cite this