Abstract
A fundamental challenge in embodied intelligence is developing vision-and-language navigation (VLN) systems that operate efficiently on resource-constrained devices while possessing plug-and-play generalization capabilities. Current approaches often struggle with edge-side inference, platform compatibility, and task extensibility. Inspired by the modular architecture of human cognition, we present VLONS (Vision-and-Language On-device Navigation System), which introduces a novel framework for edge-optimized multimodal fusion and inference that achieves precise semantic alignment while eliminating computational redundancies. Our key innovation lies in decoupling the navigation pipeline into hierarchical computational modules that can be fully deployed on-device while enabling the system to maintain state-of-the-art performance. This architecture demonstrates remarkable zero-shot generalization to both dynamic and region navigation tasks. Through extensive experimentation, we show that VLONS achieves an 11.5× improvement in retrieval efficiency and a 1.46× boost in navigation accuracy compared to existing methods on the edge side, while matching the accuracy of cloud-based deployment methods. Compared to the previous best cloud-based method based, inference latency has improved by a factor of 5.93. The systems generalizability has been further validated through practical deployment across multiple robotic platforms, making it the first edge-side VLN to support heterogeneous robots and multi-task navigation.
| Original language | English |
|---|---|
| Pages (from-to) | 1171-1182 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Consumer Electronics |
| Volume | 72 |
| Issue number | 1 |
| DOIs | |
| State | Published - 1 Feb 2026 |
| Externally published | Yes |
Keywords
- Vision-and-language navigation
- edge intelligence
- embodied AI
- heterogeneous robots
Fingerprint
Dive into the research topics of 'VLONS: A Vision-and-Language On-Device Navigation System With Multimodal Fusion and Modular Framework'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver