Skip to main navigation Skip to search Skip to main content

Vision-Language Navigation With Beam-Constrained Global Normalization

  • Liang Xie
  • , Meishan Zhang
  • , You Li
  • , Wei Qin
  • , Ye Yan
  • , Erwei Yin*
  • *Corresponding author for this work
  • Academy of Military Medical Science China
  • Tianjin Articial Intelligence Innovation Center
  • Institute of Intelligent Ocean Engineering
  • National Key Laboratory of Human Factors Engineering

Research output: Contribution to journalArticlepeer-review

Abstract

Vision-language navigation (VLN) is a challenging task, which guides an agent to navigate in a realistic environment by natural language instructions. Sequence-to-sequence modeling is one of the most prospective architectures for the task, which achieves the agent navigation goal by a sequence of moving actions. The line of work has led to the state-of-the-art performance. Recently, several studies showed that the beam-search decoding during the inference can result in promising performance, as it ranks multiple candidate trajectories by scoring each trajectory as a whole. However, the trajectory-level score might be seriously biased during ranking. The score is a simple averaging of individual unit scores of the target-sequence actions, and these unit scores could be incomparable among different trajectories since they are calculated by a local discriminant classifier. To address this problem, we propose a global normalization strategy to rescale the scores at the trajectory level. Concretely, we present two global score functions to rerank all candidates in the output beam, resulting in more comparable trajectory scores. In this way, the bias problem can be greatly alleviated. We conduct experiments on the benchmark room-to-room (R2R) dataset of VLN to verify our method, and the results show that the proposed global method is effective, providing significant performance than the corresponding baselines. Our final model can achieve competitive performance on the VLN leaderboard.

Original languageEnglish
Article number3183287
Pages (from-to)1352-1363
Number of pages12
JournalIEEE Transactions on Neural Networks and Learning Systems
Volume35
Issue number1
DOIs
StatePublished - 1 Jan 2024
Externally publishedYes

Keywords

  • Beam search
  • global normalization
  • sequence to sequence
  • vision-language navigation (VLN)

Fingerprint

Dive into the research topics of 'Vision-Language Navigation With Beam-Constrained Global Normalization'. Together they form a unique fingerprint.

Cite this