Skip to main navigation Skip to search Skip to main content

多模态文本视觉大模型机器人地形感知算法研究

Translated title of the contribution: Research on multimodal text-visual large model for robotic terrain perception algorithm
  • Hao Sun
  • , Tao Xie
  • , Long He
  • , Wenzhong Guo
  • , Yongfang Yu
  • , Qijun Wu
  • , Jianwei Wang
  • , Hui Dong
  • Fuzhou University
  • Ltd.
  • School of Mechatronics Engineering, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

A terrain segmentation algorithm based on the fusion of information from multimodal text-visual large models was proposed to enhance the intelligent perception capability of robots in dynamic and complex environments. The algorithm integrated simple linear iterative clustering (SLIC) for image data preprocessing, contrastive language-image pre-training (CLIP) and segment anything model (SAM) for mask generation, and Dice coefficient for post-processing. Initially, the original input image was preprocessed using SLIC to obtain image segmentation blocks, and the quality of subsequent masks was improved by adding prompt points, which significantly enhanced terrain classification accuracy. Subsequently, the CLIP large model, which has been pre-trained on text-image data, was used to match the input visual images with predefined terrain text information, leveraging its interpretability and zero-shot learning capabilities to generate sets of terrain prompt points. The SAM large model then generates masked data with semantic labels based on these sets, and the Dice coefficient was applied in post-processing to select usable masks. Using the Cityscapes dataset as a terrain segmentation sample, the superiority of the proposed algorithm over mainstream segmentation algorithms under both supervised and unsupervised learning frameworks was validated. Without the need for labeled data, the algorithm achieved a mask generation rate of 76.58% and an IoU (intersection over union) of 90.14%. For the terrain perception task of a quadruped robot, a U-net encoder/decoder network quantification validation module was added. Using the generated masks as a dataset, a lightweight terrain segmentation model was constructed, deployed on the edge computing device of the quadruped robot, and terrain segmentation experiments were conducted in a real-world environment. The experimental results demonstrated that the two mask optimization methods proposed in this paper improved the model’s mean IoU (MIoU) by 2.36% and .2.56%, respectively, with the final lightweight model achieving an MIoU of 96.34%, demonstrating reliable terrain segmentation accuracy. The segmentation algorithm effectively guided the robot to quickly and safely navigate from the starting point to the target location, while effectively avoiding non-geometric obstacles such as grasslands.

Translated title of the contributionResearch on multimodal text-visual large model for robotic terrain perception algorithm
Original languageChinese (Traditional)
Pages (from-to)558-567
Number of pages10
JournalJournal of Graphics
Volume46
Issue number3
DOIs
StatePublished - 30 Jun 2025

Fingerprint

Dive into the research topics of 'Research on multimodal text-visual large model for robotic terrain perception algorithm'. Together they form a unique fingerprint.

Cite this