Abstract
Recent advances in large multimodal models (LMMs) have enabled substantial progress in various visual question answering (VQA) benchmarks, including the challenging text-centric ones that require a simultaneous understanding of both the visual and textual contents in the images. Despite the prominence of existing text-centric VQA benchmarks, they either have limited textual information or have a limited number of questions requiring complex reasoning skills beyond the basic OCR. To this end, we present SMVQA—a novel text-centric VQA benchmark based on street map images. SMVQA contains more than 10K real-world street map images from the open geospatial database OpenStreetMap (OSM). Each image in SMVQA is also associated with detailed geospatial annotations, enabling it to automatically generate up to 57.5K distinctive QA pairs of five representative question types. In addition to the standard test split, SMVQA introduces an extra test split to verify the generalization abilities over out-of-domain (OOD) images and novel reasoning skills. The evaluation of the state-of-the-art open-source and commercial LMMs reflects the great challenge posed by SMVQA. The latest LMMs, such as GPT-4o, only achieve accuracies of 49.9%, showing plenty of room for improvement. To further improve the latest LMMs’ performance on SMVQA, we introduce a LMM-based agentic framework, LHR, which consists of the localizing, highlighting, and reasoning stages. Specifically, LHR first prompts the LMM to localize the region-of-interest (RoI) to the question and then highlight the RoI and perform chain-of-thought (CoT) reasoning for answer prediction. By integrating LHR with GPT-4o, we observe a significant improvement over the vanilla counterpart, showing the effectiveness of our framework.
| Original language | English |
|---|---|
| Article number | 3001412 |
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| Volume | 63 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Geospatial reasoning
- street map
- visual question answering (VQA)
Fingerprint
Dive into the research topics of 'Benchmarking and Enhancing Geospatial Visual Reasoning Over Street Maps'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver