Skip to main navigation Skip to search Skip to main content

SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs

  • Ruibo Fan
  • , Xiangrui Yu
  • , Peijie Dong
  • , Zeyu Li
  • , Gu Gong
  • , Qiang Wang*
  • , Wei Wang
  • , Xiaowen Chu*
  • *Corresponding author for this work
  • Hong Kong University of Science and Technology
  • Harbin Institute of Technology Shenzhen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities, but their immense scale poses significant challenges in terms of both memory and computational costs. While unstructured pruning offers promising solutions by introducing sparsity to reduce resource requirements, realizing its benefits in LLM inference remains elusive. This is primarily due to the storage overhead of indexing non-zero elements and the inefficiency of sparse matrix multiplication (SpMM) kernels at low sparsity levels (around 50%). In this paper, we present SpInfer, a high-performance framework tailored for sparsified LLM inference on GPUs. SpInfer introduces Tensor-Core-Aware Bitmap Encoding (TCA-BME), a novel sparse format that minimizes indexing overhead by leveraging efficient bitmap-based indexing, optimized for GPU Tensor Core architectures. Furthermore, SpInfer integrates an optimized SpMM kernel with Shared Memory Bitmap Decoding (SMBD) and asynchronous pipeline design to enhance computational efficiency. Experimental results show that SpInfer significantly outperforms state-of-the-art SpMM implementations (up to 2.14× and 2.27× over Flash-LLM and SparTA, respectively) across a range of sparsity levels (30% to 70%), with substantial improvements in both memory efficiency and end-to-end inference speed (up to 1.58×). SpInfer outperforms highly optimized cuBLAS at sparsity levels as low as 30%, marking the first effective translation of unstructured pruning’s theoretical advantages into practical performance gains for LLM inference.

Original languageEnglish
Title of host publicationEuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems
PublisherAssociation for Computing Machinery, Inc
Pages243-260
Number of pages18
ISBN (Electronic)9798400711961
DOIs
StatePublished - 30 Mar 2025
Externally publishedYes
Event20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025 - Rotterdam, Netherlands
Duration: 30 Mar 20253 Apr 2025

Publication series

NameEuroSys 2025 - Proceedings of the 2025 20th European Conference on Computer Systems

Conference

Conference20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
Country/TerritoryNetherlands
CityRotterdam
Period30/03/253/04/25

Keywords

  • GPU
  • LLM Inference
  • SpMM
  • Sparse
  • Tensor Core
  • Unstructured Pruning

Fingerprint

Dive into the research topics of 'SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs'. Together they form a unique fingerprint.

Cite this