Skip to main navigation Skip to search Skip to main content

Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair Queuing

  • Yifei Liu
  • , Chen Chen*
  • , Qiang Wang
  • , Yu Feng
  • , Weihao Cui
  • , Quan Chen
  • , Minyi Guo
  • *Corresponding author for this work
  • Shanghai Jiao Tong University
  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Schedulers play a vital role for GPU cluster serving model training jobs, and an ideal scheduler shall behave well in both fairness and efficiency. However, existing clusters mostly focus on only one aspect and fall short in the other. To solve that problem, given that the resource demand of a model training job can often be approximated a priori, our insight is to preferentially service jobs that complete earlier under instantaneous fair sharing, which can emulate shortest job first while avoiding starvation. Following that insight, in this article we propose Ares, an efficient and also fair scheduler for deep learning jobs. Ares leverages the conception of virtual finish time in network fair queuing methods, which supports efficient estimation of job completion order at job arrival time. For the jobs with earlier virtual finish times, we allow it to use more resources than it originally demands to attain fast completion—so that those resources can also be released sooner and no job is actually hurt. We keep the global batch size unchanged to ensure accuracy validity, and also ensure that the degradation of resource utilization caused by scaling-out is bounded. We call such scheduling method elastic fair queuing, which can provide theoretical fairness guarantee. We evaluate Ares performance with both testbed experiments and large-scale simulations. The results show that Ares can reduce the average job completion time by over 20% and also reduce the number of unfairly-served jobs by over 40%.

Original languageEnglish
Article number143
JournalACM Transactions on Architecture and Code Optimization
Volume22
Issue number4
DOIs
StatePublished - 16 Dec 2025
Externally publishedYes

Keywords

  • DNN training
  • GPU cluster
  • Job scheduling
  • scaling

Fingerprint

Dive into the research topics of 'Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair Queuing'. Together they form a unique fingerprint.

Cite this