Skip to main navigation Skip to search Skip to main content

Exploring the complementarity between convolution and transformer matching for visual tracking

  • School of Computer Science and Technology, Harbin Institute of Technology
  • National University of Defense Technology

Research output: Contribution to journalArticlepeer-review

Abstract

The essence of Siamese trackers is the similarity matching between a target template deep feature and a search region deep feature. With the successful application of the Transformer in the vision community, the similarity matching manner is moving from convolution matching to Transformer matching. While this transition achieves a performance boost, we explore that there exists an intuitive complementarity between convolution matching and Transformer matching. Therefore, employing only one of the two matchings is suboptimal for the trackers, and exploiting their complementarity holds great potential. To this end, we present a Matching Knowledge Fusion (MKF) module that efficiently integrates a convolution matching and an enhanced Transformer matching to exploit the explored matching complementarity. Furthermore, aiming at the issue that the noisy and ambiguous attention weights of Transformer matching lead to the degradation of matching results, a novel mechanism of utilizing complementary matching knowledge to correct the attention weights is proposed. Based on the Matching Knowledge Fusion module, we build a simple but effective tracker, dubbed MKFTrack. Extensive experiments demonstrate the favorable performance of our tracker against state-of-the-art ones.

Original languageEnglish
Article number112184
JournalKnowledge-Based Systems
Volume300
DOIs
StatePublished - 27 Sep 2024
Externally publishedYes

Keywords

  • Convolution matching
  • Matching knowledge fusion
  • Transformer matching
  • Visual object tracking

Fingerprint

Dive into the research topics of 'Exploring the complementarity between convolution and transformer matching for visual tracking'. Together they form a unique fingerprint.

Cite this