Skip to main navigation Skip to search Skip to main content

Discovering Syntactic Interaction Clues for Human-Object Interaction Detection

  • Jinguo Luo
  • , Weihong Ren*
  • , Weibo Jiang
  • , Xi'ai Chen
  • , Qiang Wang
  • , Zhi Han
  • , Honghai Liu
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • CAS - Shenyang Institute of Automation
  • Chinese Academy of Sciences
  • Shenyang University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, Vision-Language Model (VLM) has greatly ad-vanced the Human-Object Interaction (HOI) detection. The existing VLM-based HOI detectors typically adopt a hand-crafted template (e.g., a photo of a person [action] a/an [object]) to acquire text knowledge through the VLM text encoder. However, such approaches, only encoding the action-specific text prompts in vocabulary level, may suffer from learning ambiguity without exploring the fine-grained clues from the perspective of interaction context. In this paper, we propose a novel method to discover Syntactic Interaction Clues for HOI detection (SICHOI) by using VLM. Specifically, we first investigate what are the essen-tial elements for an interaction context, and then establish a syntactic interaction bank from three levels: spatial relationship, action-oriented posture and situational condition. Further, to align visual features with the syntactic interaction bank, we adopt a multi-view extractor to jointly aggre-gate visual features from instance, interaction, and image levels accordingly. In addition, we also introduce a dual cross-attention decoder to perform context propagation be-tween text knowledge and visual features, thereby enhancing the HOI detection. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on HICO-DET and V-COCO.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages28212-28222
Number of pages11
ISBN (Electronic)9798350353006
ISBN (Print)9798350353006
DOIs
StatePublished - 2024
Externally publishedYes
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Fingerprint

Dive into the research topics of 'Discovering Syntactic Interaction Clues for Human-Object Interaction Detection'. Together they form a unique fingerprint.

Cite this