An Empirical Study of Frame Selection for Text-to-Video Retrieval

  • Mengxia Wu
  • , Min Cao*
  • , Yang Bai
  • , Ziyin Zeng
  • , Chen Chen
  • , Liqiang Nie
  • , Min Zhang
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationEMNLP 2023
PublisherAssociation for Computational Linguistics (ACL)
Pages6821-6832
Number of pages12
ISBN (Electronic)9798891760615
DOIs
StatePublished - 2023
Externally publishedYes
Event2023 Findings of the Association for Computational Linguistics: EMNLP 2023 - Hybrid, Singapore
Duration: 6 Dec 202310 Dec 2023

Publication series

NameFindings of the Association for Computational Linguistics: EMNLP 2023

Conference

Conference2023 Findings of the Association for Computational Linguistics: EMNLP 2023
Country/TerritorySingapore
CityHybrid
Period6/12/2310/12/23

Fingerprint

Dive into the research topics of 'An Empirical Study of Frame Selection for Text-to-Video Retrieval'. Together they form a unique fingerprint.

Cite this