TY - GEN
T1 - BirdMoE
T2 - 62nd ACM/IEEE Design Automation Conference, DAC 2025
AU - Wu, Donglei
AU - Yang, Weihao
AU - Zou, Xiangyu
AU - Jia, Jinda
AU - Tao, Dingwen
AU - Xia, Wen
AU - Tian, Zhihong
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Mixture-of-Experts (MoE) model parallelism is prevalent in training Large Language Models (e.g., ChatGPT). However, the intensive all-to-all collective communication of the MoE layer's intermediate computing results substantially degrades MoE training efficiency. In this paper, we propose BirdMoE, a novel load-aware communication compression technique with Bi-random quantization for MoE training with two core modules. Specifically, BirdMoE employs a lightweight Random Quantization (RQ) with expectation invariance property to efficiently map the floating-point intermediate computing results into integers while maintaining the MoE training quality. Additionally, BirdMoE utilizes a Mixed Precision (MP) strategy to dynamically balance the communication loads among expert nodes, significantly improving all-to-all communication efficiency for the MoE training system. Experiments on four typical MoE training tasks demonstrate that BirdMoE achieves higher 4.06 × -10.44 × total communication compression ratios and 1.18 × -5.27 × training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training quality.
AB - Mixture-of-Experts (MoE) model parallelism is prevalent in training Large Language Models (e.g., ChatGPT). However, the intensive all-to-all collective communication of the MoE layer's intermediate computing results substantially degrades MoE training efficiency. In this paper, we propose BirdMoE, a novel load-aware communication compression technique with Bi-random quantization for MoE training with two core modules. Specifically, BirdMoE employs a lightweight Random Quantization (RQ) with expectation invariance property to efficiently map the floating-point intermediate computing results into integers while maintaining the MoE training quality. Additionally, BirdMoE utilizes a Mixed Precision (MP) strategy to dynamically balance the communication loads among expert nodes, significantly improving all-to-all communication efficiency for the MoE training system. Experiments on four typical MoE training tasks demonstrate that BirdMoE achieves higher 4.06 × -10.44 × total communication compression ratios and 1.18 × -5.27 × training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training quality.
KW - Mixture-of-Experts training
KW - communication -Mixture-of-Experts training
KW - communication compression
KW - load balance
UR - https://www.scopus.com/pages/publications/105017720816
U2 - 10.1109/DAC63849.2025.11132853
DO - 10.1109/DAC63849.2025.11132853
M3 - 会议稿件
AN - SCOPUS:105017720816
T3 - Proceedings - Design Automation Conference
BT - 2025 62nd ACM/IEEE Design Automation Conference, DAC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 June 2025 through 25 June 2025
ER -