TY - GEN
T1 - FlexSP:(1 + ß)-Choice based Flexible Stream Partitioning for Stateful Operators
AU - Chen, Siyuan
AU - Zuo, Decheng
AU - Zhang, Zhan
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/8/12
Y1 - 2024/8/12
N2 - Stream partitioning has a fundamental effect on the efficiency of data parallelism in distributed stream processing systems. The skewed and time-varying nature of streaming data makes it challenging to achieve load balancing while minimizing the cost incurred. The requirement of adaptivity further complicates the problem, that the partitioning mechanism should not only be able to capture the changes in workload and adjust itself but also be quite tolerant of the changes because of the lag in statistics. Existing approaches use one-choice or multiple-choice schemes to make tradeoffs between these factors, but they tend to treat them as opposites, which either fails to achieve good load balancing or incurs excessive cost. There is a lack of deeper insight into how partitioning behavior affects load balancing, cost, and adaptivity when the keys have a different number of candidate choices. Also, it requires a flexible partitioning scheme to allow different trade-offs among the three factors for various scenarios. To address the issues mentioned above, we propose a novel (1 + ß)-choice based stream partitioning scheme, which splits ß ?(0, 1) part of keys selectively to have multiple candidate choices. We demonstrate that just splitting ß part of the keys is sufficient to achieve optimal load balancing while minimizing cost and providing the required adaptivity to workload variance. In a new perspective, we analyze the relationship among load balancing, cost, and adaptivity, as the theoretical foundation of getting proper ß and the corresponding number of choices. Experiments on Apache Flink demonstrate that our approach outperforms state-of-the-art solutions, improving throughput by 7.3 × and reducing latency by 85%.
AB - Stream partitioning has a fundamental effect on the efficiency of data parallelism in distributed stream processing systems. The skewed and time-varying nature of streaming data makes it challenging to achieve load balancing while minimizing the cost incurred. The requirement of adaptivity further complicates the problem, that the partitioning mechanism should not only be able to capture the changes in workload and adjust itself but also be quite tolerant of the changes because of the lag in statistics. Existing approaches use one-choice or multiple-choice schemes to make tradeoffs between these factors, but they tend to treat them as opposites, which either fails to achieve good load balancing or incurs excessive cost. There is a lack of deeper insight into how partitioning behavior affects load balancing, cost, and adaptivity when the keys have a different number of candidate choices. Also, it requires a flexible partitioning scheme to allow different trade-offs among the three factors for various scenarios. To address the issues mentioned above, we propose a novel (1 + ß)-choice based stream partitioning scheme, which splits ß ?(0, 1) part of keys selectively to have multiple candidate choices. We demonstrate that just splitting ß part of the keys is sufficient to achieve optimal load balancing while minimizing cost and providing the required adaptivity to workload variance. In a new perspective, we analyze the relationship among load balancing, cost, and adaptivity, as the theoretical foundation of getting proper ß and the corresponding number of choices. Experiments on Apache Flink demonstrate that our approach outperforms state-of-the-art solutions, improving throughput by 7.3 × and reducing latency by 85%.
KW - distributed stream processing
KW - key splitting
KW - stateful operation
KW - stream partitioning
UR - https://www.scopus.com/pages/publications/85202445164
U2 - 10.1145/3673038.3673157
DO - 10.1145/3673038.3673157
M3 - 会议稿件
AN - SCOPUS:85202445164
T3 - ACM International Conference Proceeding Series
SP - 732
EP - 741
BT - 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 53rd International Conference on Parallel Processing, ICPP 2024
Y2 - 12 August 2024 through 15 August 2024
ER -