TY - GEN
T1 - QER
T2 - 2025 IEEE 16th International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2025
AU - Xu, Shoukai
AU - Zeng, Runhao
AU - Zhang, Zhiyang
AU - Huang, Hao
AU - Zheng, Qingfang
AU - Lan, Xiangyuan
AU - Wang, Yaowei
AU - Tan, Mingkui
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large Language Models (LLMs) have achieved remarkable success but face significant deployment challenges in cloud and edge environments due to their massive computational and storage requirements. Model quantization serves as a key solution to enhance the scalability and efficiency of LLMs within distributed cloud platforms. Existing Post-Training Quantization (PTQ) methods often exhibit suboptimal performance in low-bit settings. To further improve their precision, Quantization-Aware Training (QAT) combined with Low-Rank Adaptation (LoRA) has been explored for error correction. However, a critical issue is that the quantized base model and full-precision LoRA parameters suffer from precision mismatch, introducing additional errors during weight merging. To address these challenges, we propose a Quantized Low-rank Error Reconstructor (QER) for LLM low-bitwidth quantization. QER first enables lossless merging in low-bitwidth format by aligning the bitwidth of its low-rank parameters with the quantized base parameters, eliminating dequantization and requantization steps. Through this process, QER reconstructs original errors into two components: the quantization errors of QER parameters (i.e., quantized low-rank parameters) and potential overflow errors during low-bitwidth merging. These two errors are directly related to QER parameters, making them easier to optimize via gradient-based updates within an error-aware training framework. Requiring only 128 samples and 1 training epoch, QER demonstrates superior performance on LLaMA-1/2 families. In 4-bit quantization, compared to QLLM with error correction, QER reduces average perplexity by 13.8% (from 10.97 to 9.45) and improves average accuracy by 3.01 percentage points (from 51.84% to 54.85%) on LLaMA-1-7B. QER bridges the gap between quantization and low-rank adaptation, enabling efficient and accurate low-precision LLM deployment.
AB - Large Language Models (LLMs) have achieved remarkable success but face significant deployment challenges in cloud and edge environments due to their massive computational and storage requirements. Model quantization serves as a key solution to enhance the scalability and efficiency of LLMs within distributed cloud platforms. Existing Post-Training Quantization (PTQ) methods often exhibit suboptimal performance in low-bit settings. To further improve their precision, Quantization-Aware Training (QAT) combined with Low-Rank Adaptation (LoRA) has been explored for error correction. However, a critical issue is that the quantized base model and full-precision LoRA parameters suffer from precision mismatch, introducing additional errors during weight merging. To address these challenges, we propose a Quantized Low-rank Error Reconstructor (QER) for LLM low-bitwidth quantization. QER first enables lossless merging in low-bitwidth format by aligning the bitwidth of its low-rank parameters with the quantized base parameters, eliminating dequantization and requantization steps. Through this process, QER reconstructs original errors into two components: the quantization errors of QER parameters (i.e., quantized low-rank parameters) and potential overflow errors during low-bitwidth merging. These two errors are directly related to QER parameters, making them easier to optimize via gradient-based updates within an error-aware training framework. Requiring only 128 samples and 1 training epoch, QER demonstrates superior performance on LLaMA-1/2 families. In 4-bit quantization, compared to QLLM with error correction, QER reduces average perplexity by 13.8% (from 10.97 to 9.45) and improves average accuracy by 3.01 percentage points (from 51.84% to 54.85%) on LLaMA-1-7B. QER bridges the gap between quantization and low-rank adaptation, enabling efficient and accurate low-precision LLM deployment.
KW - Large language model
KW - efficient computing
KW - error correction
KW - quantization-aware training
UR - https://www.scopus.com/pages/publications/105034699466
U2 - 10.1109/CloudCom67567.2025.11331377
DO - 10.1109/CloudCom67567.2025.11331377
M3 - 会议稿件
AN - SCOPUS:105034699466
T3 - Proceedings - 2025 IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2025
BT - Proceedings - 2025 IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 November 2025 through 16 November 2025
ER -