Skip to main navigation Skip to search Skip to main content

QER: Quantized Low-Rank Error Reconstructor for LLM Low-Bitwidth Quantization

  • Shoukai Xu
  • , Runhao Zeng
  • , Zhiyang Zhang
  • , Hao Huang
  • , Qingfang Zheng*
  • , Xiangyuan Lan
  • , Yaowei Wang*
  • , Mingkui Tan*
  • *Corresponding author for this work
  • South China University of Technology
  • Shenzhen MSU-BIT University
  • Pengcheng Laboratory

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large Language Models (LLMs) have achieved remarkable success but face significant deployment challenges in cloud and edge environments due to their massive computational and storage requirements. Model quantization serves as a key solution to enhance the scalability and efficiency of LLMs within distributed cloud platforms. Existing Post-Training Quantization (PTQ) methods often exhibit suboptimal performance in low-bit settings. To further improve their precision, Quantization-Aware Training (QAT) combined with Low-Rank Adaptation (LoRA) has been explored for error correction. However, a critical issue is that the quantized base model and full-precision LoRA parameters suffer from precision mismatch, introducing additional errors during weight merging. To address these challenges, we propose a Quantized Low-rank Error Reconstructor (QER) for LLM low-bitwidth quantization. QER first enables lossless merging in low-bitwidth format by aligning the bitwidth of its low-rank parameters with the quantized base parameters, eliminating dequantization and requantization steps. Through this process, QER reconstructs original errors into two components: the quantization errors of QER parameters (i.e., quantized low-rank parameters) and potential overflow errors during low-bitwidth merging. These two errors are directly related to QER parameters, making them easier to optimize via gradient-based updates within an error-aware training framework. Requiring only 128 samples and 1 training epoch, QER demonstrates superior performance on LLaMA-1/2 families. In 4-bit quantization, compared to QLLM with error correction, QER reduces average perplexity by 13.8% (from 10.97 to 9.45) and improves average accuracy by 3.01 percentage points (from 51.84% to 54.85%) on LLaMA-1-7B. QER bridges the gap between quantization and low-rank adaptation, enabling efficient and accurate low-precision LLM deployment.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331566340
DOIs
StatePublished - 2025
Externally publishedYes
Event2025 IEEE 16th International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2025 - Shenzhen, China
Duration: 14 Nov 202516 Nov 2025

Publication series

NameProceedings - 2025 IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2025

Conference

Conference2025 IEEE 16th International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2025
Country/TerritoryChina
CityShenzhen
Period14/11/2516/11/25

Keywords

  • Large language model
  • efficient computing
  • error correction
  • quantization-aware training

Fingerprint

Dive into the research topics of 'QER: Quantized Low-Rank Error Reconstructor for LLM Low-Bitwidth Quantization'. Together they form a unique fingerprint.

Cite this