Skip to main navigation Skip to search Skip to main content

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

  • Yunxin Li*
  • , Baotian Hu*
  • , Haoyuan Shi
  • , Wei Wang
  • , Longyue Wang Min Zhang
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Sun Yat-Sen University

Research output: Contribution to journalConference articlepeer-review

Abstract

Large Multimodal Models (LMMs) have achieved impressive success in visual reasoning, particularly in visual mathematics. However, problem-solving capabilities in graph theory remain less explored for LMMs, despite being a crucial aspect of mathematical reasoning that requires accurate understanding of graphical structures and multi-step reasoning on visual graphs. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.

Original languageEnglish
Pages (from-to)27903-27919
Number of pages17
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context'. Together they form a unique fingerprint.

Cite this