TY - GEN
T1 - Accelerating Distributed K-FAC with smart parallelism of computing and communication tasks
AU - Shi, Shaohuai
AU - Zhang, Lin
AU - Li, Bo
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/7
Y1 - 2021/7
N2 - Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.
AB - Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.
KW - Distributed Deep Learning
KW - K-FAC
KW - Load-Balancing
KW - Second-Order
KW - Smart Parallelism
UR - https://www.scopus.com/pages/publications/85117128717
U2 - 10.1109/ICDCS51616.2021.00059
DO - 10.1109/ICDCS51616.2021.00059
M3 - 会议稿件
AN - SCOPUS:85117128717
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 550
EP - 560
BT - Proceedings - 2021 IEEE 41st International Conference on Distributed Computing Systems, ICDCS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 41st IEEE International Conference on Distributed Computing Systems, ICDCS 2021
Y2 - 7 July 2021 through 10 July 2021
ER -