TY - GEN
T1 - COS
T2 - 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
AU - Lin, Changyao
AU - Liu, Jie
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Multi-tenant inference, as a prevalent inference paradigm nowadays, requires deploying multiple deep learning models on the hardware platform to concurrently process inference tasks. Modern platforms are typically equipped with various heterogeneous processors, such as CPU-GPU platform. To reduce resource contention and improve Quality of Service (QoS) in the multi-tenant scenario, existing work has studied cross-processor inference at the model- and layer-level. However, coarse-grained scheduling cannot flexibly account for subtle resource fluctuations, which may lead to task blockages and incur significant processor switching overheads. Such work usually requires extensive modification and retraining of the models. Therefore, we propose a finer-grained operator-level cross-processor scheduling framework COS, which can more precisely divide the computational workloads and switching overheads for the tenants, without modifying or retraining. We introduce a novel intermediate representation to abstract and simplify the scheduling problem, and propose an efficient two-phase search algorithm. COS is automated and easy-to-scale, through experiments on various heterogeneous hardware platforms and models, we demonstrate that COS is more flexible and effective than layer-level scheduling, and achieves higher throughput than single-processor processing in the multi-tenant scenario. Furthermore, COS is an offline optimization method, and its overhead is highly acceptable.
AB - Multi-tenant inference, as a prevalent inference paradigm nowadays, requires deploying multiple deep learning models on the hardware platform to concurrently process inference tasks. Modern platforms are typically equipped with various heterogeneous processors, such as CPU-GPU platform. To reduce resource contention and improve Quality of Service (QoS) in the multi-tenant scenario, existing work has studied cross-processor inference at the model- and layer-level. However, coarse-grained scheduling cannot flexibly account for subtle resource fluctuations, which may lead to task blockages and incur significant processor switching overheads. Such work usually requires extensive modification and retraining of the models. Therefore, we propose a finer-grained operator-level cross-processor scheduling framework COS, which can more precisely divide the computational workloads and switching overheads for the tenants, without modifying or retraining. We introduce a novel intermediate representation to abstract and simplify the scheduling problem, and propose an efficient two-phase search algorithm. COS is automated and easy-to-scale, through experiments on various heterogeneous hardware platforms and models, we demonstrate that COS is more flexible and effective than layer-level scheduling, and achieves higher throughput than single-processor processing in the multi-tenant scenario. Furthermore, COS is an offline optimization method, and its overhead is highly acceptable.
KW - cross-processor parallelism
KW - multi-tenant deep learning
KW - operator scheduling
KW - reinforcement learning
UR - https://www.scopus.com/pages/publications/85206393584
U2 - 10.1109/IWQoS61813.2024.10682900
DO - 10.1109/IWQoS61813.2024.10682900
M3 - 会议稿件
AN - SCOPUS:85206393584
T3 - IEEE International Workshop on Quality of Service, IWQoS
BT - 2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 June 2024 through 21 June 2024
ER -