Skip to main navigation Skip to search Skip to main content

Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

  • Fenghe Tang
  • , Bingkun Nian
  • , Jianrui Ding
  • , Wenxin Ma
  • , Quan Quan
  • , Chengqi Dong
  • , Jie Yang
  • , Wei Liu*
  • , S. Kevin Zhou*
  • *Corresponding author for this work
  • University of Science and Technology of China
  • Robotics
  • Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology
  • Shanghai Jiao Tong University
  • State Grid Corporation of China

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly proposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LKLGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsampled skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at: https://github.com/FengheTan9/Mobile-U-ViT.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages3408-3417
Number of pages10
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • large kernel convolutional neural network
  • light-weight network
  • medical image segmentation
  • vision transformer

Fingerprint

Dive into the research topics of 'Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation'. Together they form a unique fingerprint.

Cite this