TY - GEN
T1 - Prototype-Guided Dual-Transformer Reasoning for Video Individual Counting
AU - Li, Rui
AU - Liu, Yishu
AU - Li, Huafeng
AU - Li, Jinxing
AU - Lu, Guangming
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Video Individual Counting (VIC), which focuses on accurately tallying the total number of individuals in a video without duplication, is crucial for urban public space management and densely-populated areas planning. Existing methods suffer from limitations in terms of expensive manual annotation, and the efficiency of location or detection algorithms. In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. Specifically, we first design a multi-receptive field feature fusion module to acquire initial comprehensive representations. Subsequently, the dynamic prototype generation module memorizes consistent representations of similar information to generate prototypes. Additionally, to further dig out the shared and private features from different frames, a prototype cross-guided decoder and a privacy-decoupling module are designed. Extensive experiments conducted on two existing VIC datasets, consistently demonstrate the superiority of PDTR over state-of-the-art baselines.
AB - Video Individual Counting (VIC), which focuses on accurately tallying the total number of individuals in a video without duplication, is crucial for urban public space management and densely-populated areas planning. Existing methods suffer from limitations in terms of expensive manual annotation, and the efficiency of location or detection algorithms. In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. Specifically, we first design a multi-receptive field feature fusion module to acquire initial comprehensive representations. Subsequently, the dynamic prototype generation module memorizes consistent representations of similar information to generate prototypes. Additionally, to further dig out the shared and private features from different frames, a prototype cross-guided decoder and a privacy-decoupling module are designed. Extensive experiments conducted on two existing VIC datasets, consistently demonstrate the superiority of PDTR over state-of-the-art baselines.
KW - dual-transformer
KW - prototype learning
KW - video individual counting
UR - https://www.scopus.com/pages/publications/85209806333
U2 - 10.1145/3664647.3681211
DO - 10.1145/3664647.3681211
M3 - 会议稿件
AN - SCOPUS:85209806333
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 10258
EP - 10267
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 32nd ACM International Conference on Multimedia, MM 2024
Y2 - 28 October 2024 through 1 November 2024
ER -