Skip to main navigation Skip to search Skip to main content

Benchmarking and Enhancing Geospatial Visual Reasoning Over Street Maps

  • Wenwen Pan
  • , Haiting Zhou
  • , Zhenwei Shao
  • , Shuai Shao
  • , Suguo Zhu
  • , Min Tan*
  • , Jun Yu
  • , Zhou Yu*
  • *Corresponding author for this work
  • Hangzhou Dianzi University
  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in large multimodal models (LMMs) have enabled substantial progress in various visual question answering (VQA) benchmarks, including the challenging text-centric ones that require a simultaneous understanding of both the visual and textual contents in the images. Despite the prominence of existing text-centric VQA benchmarks, they either have limited textual information or have a limited number of questions requiring complex reasoning skills beyond the basic OCR. To this end, we present SMVQA—a novel text-centric VQA benchmark based on street map images. SMVQA contains more than 10K real-world street map images from the open geospatial database OpenStreetMap (OSM). Each image in SMVQA is also associated with detailed geospatial annotations, enabling it to automatically generate up to 57.5K distinctive QA pairs of five representative question types. In addition to the standard test split, SMVQA introduces an extra test split to verify the generalization abilities over out-of-domain (OOD) images and novel reasoning skills. The evaluation of the state-of-the-art open-source and commercial LMMs reflects the great challenge posed by SMVQA. The latest LMMs, such as GPT-4o, only achieve accuracies of 49.9%, showing plenty of room for improvement. To further improve the latest LMMs’ performance on SMVQA, we introduce a LMM-based agentic framework, LHR, which consists of the localizing, highlighting, and reasoning stages. Specifically, LHR first prompts the LMM to localize the region-of-interest (RoI) to the question and then highlight the RoI and perform chain-of-thought (CoT) reasoning for answer prediction. By integrating LHR with GPT-4o, we observe a significant improvement over the vanilla counterpart, showing the effectiveness of our framework.

Original languageEnglish
Article number3001412
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Geospatial reasoning
  • street map
  • visual question answering (VQA)

Fingerprint

Dive into the research topics of 'Benchmarking and Enhancing Geospatial Visual Reasoning Over Street Maps'. Together they form a unique fingerprint.

Cite this