Skip to main navigation Skip to search Skip to main content

VLONS: A Vision-and-Language On-Device Navigation System With Multimodal Fusion and Modular Framework

  • Jianyang Shi
  • , Haijun Zhang*
  • , Yuhan Zhang
  • , Tin Lun Lam
  • , Lin Zhang
  • , Hu Huang*
  • , Yuan Gao
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • Sybmbiosis-X Technology Inc.
  • QianHai International Talent Centre
  • The Chinese University of Hong Kong, Shenzhen
  • Shenzhen Institute of Artificial Intelligence and Robotics for Society
  • University of Science and Technology of China

Research output: Contribution to journalArticlepeer-review

Abstract

A fundamental challenge in embodied intelligence is developing vision-and-language navigation (VLN) systems that operate efficiently on resource-constrained devices while possessing plug-and-play generalization capabilities. Current approaches often struggle with edge-side inference, platform compatibility, and task extensibility. Inspired by the modular architecture of human cognition, we present VLONS (Vision-and-Language On-device Navigation System), which introduces a novel framework for edge-optimized multimodal fusion and inference that achieves precise semantic alignment while eliminating computational redundancies. Our key innovation lies in decoupling the navigation pipeline into hierarchical computational modules that can be fully deployed on-device while enabling the system to maintain state-of-the-art performance. This architecture demonstrates remarkable zero-shot generalization to both dynamic and region navigation tasks. Through extensive experimentation, we show that VLONS achieves an 11.5× improvement in retrieval efficiency and a 1.46× boost in navigation accuracy compared to existing methods on the edge side, while matching the accuracy of cloud-based deployment methods. Compared to the previous best cloud-based method based, inference latency has improved by a factor of 5.93. The systems generalizability has been further validated through practical deployment across multiple robotic platforms, making it the first edge-side VLN to support heterogeneous robots and multi-task navigation.

Original languageEnglish
Pages (from-to)1171-1182
Number of pages12
JournalIEEE Transactions on Consumer Electronics
Volume72
Issue number1
DOIs
StatePublished - 1 Feb 2026
Externally publishedYes

Keywords

  • Vision-and-language navigation
  • edge intelligence
  • embodied AI
  • heterogeneous robots

Fingerprint

Dive into the research topics of 'VLONS: A Vision-and-Language On-Device Navigation System With Multimodal Fusion and Modular Framework'. Together they form a unique fingerprint.

Cite this