Skip to main navigation Skip to search Skip to main content

Difference-Aware Fusion Network for Efficient RGB-D Semantic Segmentation in Indoor Robots

  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Incorporating both RGB and depth images has proven effective for enhancing the performance of semantic segmentation. However, current RGB-D semantic segmentation methods tend to overlook the critical role of cross-modal difference information during fusion, leading to the undesired suppression of discriminative cues and a failure to achieve potent cross-modal complementary fusion. In this article, a novel RGB-D semantic segmentation approach that realizes the efficient utilization of multimodal information is proposed. To address the issue of the suppression of cross-modal difference information, we propose a dynamic frequency-spatial difference-aware fusion module adept at explicitly emphasizing cross-modal differences, capturing vital features in the frequency domain, and using them to aggregate spatial context information of multimodal features. We also present a novel soft-edge loss to meticulously handle complex scenes by supervising different regions respectively. In addition, a progressive calibration context module is designed to enhance global contextual information by capturing multiscale multimodal representations. Extensive experiments on two public RGB-D datasets demonstrate that the proposed DFNet achieves highly competitive performance compared to state-of-the-art methods, making it well-suited for assisting indoor robots.

Original languageEnglish
Pages (from-to)7424-7434
Number of pages11
JournalIEEE Transactions on Industrial Informatics
Volume21
Issue number10
DOIs
StatePublished - 2025

Keywords

  • Cross-modal difference
  • RGB-D fusion
  • indoor scene
  • multimodal semantic segmentation

Fingerprint

Dive into the research topics of 'Difference-Aware Fusion Network for Efficient RGB-D Semantic Segmentation in Indoor Robots'. Together they form a unique fingerprint.

Cite this