Abstract
Recently, leveraging complementary multimodal information to estimate object poses from RGB-D images has gained widespread attention, demonstrating significant improvements in performance. Early methods primarily focus on the correspondence between RGB and depth images, extracting texture and geometric features separately and then simply concatenating them for fusion, which exacerbates the negative impact of redundant background and noise during multimodal fusion. To address these challenges, we introduce EFT6D, which harnesses semantic similarity across modalities to more effectively integrate globally augmented fused features. The additive attention mechanism we introduced eliminates the need for key-value interactions, reducing computational complexity and significantly improving model efficiency and performance. At the same time, we adopt an augmented shortcut connection to further improve the model performance. Experimental results on the LineMOD, Occlusion-LineMOD, and YCB-Video datasets show that EFT6D markedly improves pose estimation accuracy while maintaining real-time inference capabilities. Finally, we apply EFT6D to real-world object pose estimation and grasping experiments, demonstrating the effectiveness of our method.
| Original language | English |
|---|---|
| Article number | 129035 |
| Journal | Expert Systems with Applications |
| Volume | 296 |
| DOIs | |
| State | Published - 15 Jan 2026 |
Keywords
- Feature fusion
- Object pose estimation
- Transformer
Fingerprint
Dive into the research topics of 'EFT6D: An efficient fusion transformer network for 6D object pose estimation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver