Skip to main navigation Skip to search Skip to main content

EFT6D: An efficient fusion transformer network for 6D object pose estimation

  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Recently, leveraging complementary multimodal information to estimate object poses from RGB-D images has gained widespread attention, demonstrating significant improvements in performance. Early methods primarily focus on the correspondence between RGB and depth images, extracting texture and geometric features separately and then simply concatenating them for fusion, which exacerbates the negative impact of redundant background and noise during multimodal fusion. To address these challenges, we introduce EFT6D, which harnesses semantic similarity across modalities to more effectively integrate globally augmented fused features. The additive attention mechanism we introduced eliminates the need for key-value interactions, reducing computational complexity and significantly improving model efficiency and performance. At the same time, we adopt an augmented shortcut connection to further improve the model performance. Experimental results on the LineMOD, Occlusion-LineMOD, and YCB-Video datasets show that EFT6D markedly improves pose estimation accuracy while maintaining real-time inference capabilities. Finally, we apply EFT6D to real-world object pose estimation and grasping experiments, demonstrating the effectiveness of our method.

Original languageEnglish
Article number129035
JournalExpert Systems with Applications
Volume296
DOIs
StatePublished - 15 Jan 2026

Keywords

  • Feature fusion
  • Object pose estimation
  • Transformer

Fingerprint

Dive into the research topics of 'EFT6D: An efficient fusion transformer network for 6D object pose estimation'. Together they form a unique fingerprint.

Cite this