Skip to main navigation Skip to search Skip to main content

面向生成式视觉感知的细粒度直接偏好对齐框架

Translated title of the contribution: Fine鄄Grained Direct Preference Alignment Framework for Generative Visual Perception
  • School of Mechatronics Engineering, Harbin Institute of Technology
  • Faculty of Computing, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Generative referring segmentation methods based on multimodal large language model (MLLM) are limited by the mechanism of Supervised Fine鄄Tuning and lack in鄄depth exploration of ways to improve generation quality. Therefore, these methods are faced with the challenges of semantic localization bias and rough mask boundaries in complex scenarios. To address these issues, a fine鄄grained direct preference alignment framework for generative visual perception(FG鄄DPA) is proposed. The direct preference optimization (DPO) algorithm is transferred from text understanding to the pixel鄄level segmentation task. High鄄quality and low鄄quality mask preference pairs are constructed to guide the method toward learning more accurate visual representations within the latent space. Two types of negative samples are produced by leveraging the interactive characteristics of the segment anything model(SAM). To address the issue of imprecise edges, adversarial point prompts are introduced into the ground鄄truth bounding box to generate low鄄quality masks with local omissions or overflows as negative examples. To solve the problem of incorrect target localization, non鄄overlapping masks are randomly sampled in the background region to construct semantic鄄level negative examples. Through training with multiple samples, accurate segmentation is finally achieved in conjunction with SAM. Experiments on multiple public datasets show that FG鄄DPA effectively suppresses localization hallucination and significantly improves the completeness and edge accuracy of mask generation, validating its effectiveness in enhancing multimodal generative visual perception performance.

Translated title of the contributionFine鄄Grained Direct Preference Alignment Framework for Generative Visual Perception
Original languageChinese (Traditional)
Pages (from-to)239-249
Number of pages11
JournalMoshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence
Volume39
Issue number3
DOIs
StatePublished - 25 Mar 2026
Externally publishedYes

Fingerprint

Dive into the research topics of 'Fine鄄Grained Direct Preference Alignment Framework for Generative Visual Perception'. Together they form a unique fingerprint.

Cite this