TY - GEN
T1 - Exploiting simultaneous communications to accelerate data parallel distributed deep learning
AU - Shi, Shaohuai
AU - Chu, Xiaowen
AU - Li, Bo
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/5/10
Y1 - 2021/5/10
N2 - Synchronous stochastic gradient descent (S-SGD) with data parallelism is widely used for training deep learning (DL) models in distributed systems. A pipelined schedule of the computing and communication tasks of a DL training job is an effective scheme to hide some communication costs. In such pipelined S-SGD, tensor fusion (i.e., merging some consecutive layers' gradients for a single communication) is a key ingredient to improve communication efficiency. However, existing tensor fusion techniques schedule the communication tasks sequentially, which overlooks their independence nature. In this paper, we expand the design space of scheduling by exploiting simultaneous All-Reduce communications. Through theoretical analysis and experiments, we show that simultaneous All-Reduce communications can effectively improve the communication efficiency of small tensors. We formulate an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed. We develop an efficient optimal scheduling solution and implement the distributed training algorithm ASC-WFBP with Horovod and PyTorch. We conduct real-world experiments on an 8-node GPU cluster of 32 GPUs with 10Gbps Ethernet. Experimental results on four modern DNNs show that ASC-WFBP can achieve about 1.09 × -2.48× speedup over the baseline without tensor fusion, and 1.15× -1.35× speedup over the state-of-the-art tensor fusion solution.
AB - Synchronous stochastic gradient descent (S-SGD) with data parallelism is widely used for training deep learning (DL) models in distributed systems. A pipelined schedule of the computing and communication tasks of a DL training job is an effective scheme to hide some communication costs. In such pipelined S-SGD, tensor fusion (i.e., merging some consecutive layers' gradients for a single communication) is a key ingredient to improve communication efficiency. However, existing tensor fusion techniques schedule the communication tasks sequentially, which overlooks their independence nature. In this paper, we expand the design space of scheduling by exploiting simultaneous All-Reduce communications. Through theoretical analysis and experiments, we show that simultaneous All-Reduce communications can effectively improve the communication efficiency of small tensors. We formulate an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed. We develop an efficient optimal scheduling solution and implement the distributed training algorithm ASC-WFBP with Horovod and PyTorch. We conduct real-world experiments on an 8-node GPU cluster of 32 GPUs with 10Gbps Ethernet. Experimental results on four modern DNNs show that ASC-WFBP can achieve about 1.09 × -2.48× speedup over the baseline without tensor fusion, and 1.15× -1.35× speedup over the state-of-the-art tensor fusion solution.
KW - Communication-Efficient
KW - Distributed Deep Learning
KW - Simultaneous Communications
UR - https://www.scopus.com/pages/publications/85111915788
U2 - 10.1109/INFOCOM42981.2021.9488803
DO - 10.1109/INFOCOM42981.2021.9488803
M3 - 会议稿件
AN - SCOPUS:85111915788
T3 - Proceedings - IEEE INFOCOM
BT - INFOCOM 2021 - IEEE Conference on Computer Communications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 40th IEEE Conference on Computer Communications, INFOCOM 2021
Y2 - 10 May 2021 through 13 May 2021
ER -