TY - GEN
T1 - Greedy Transfer Planning Search for Improving Repair Throughput of RDP-like Coded Storage Clusters
AU - Chen, Juehao
AU - Li, Shiyi
AU - Xia, Wen
AU - Zhang, Shuaipeng
AU - Lin, Qicong
AU - Hu, Haojun
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the increasing scale of data and user demands for low latency, the development of large-scale clusters has become a trend. To ensure high availability of data in data clusters, XOR-based erasure code fault-tolerant technologies are widely used due to their low storage and computational overhead. Meanwhile, as the scale of clusters ranges from hundreds to thousands, the probability of multiple node failures is not negligible. This can lead to serious consequences, such as data loss, and should be recovered as soon as possible. However, codes such as RDP and EVENODD can easily lead to network congestion when recovering in the event of concurrent failures, making it challenging to recover quickly.To address this issue, we propose a novel network transfer plan search algorithm, Greedy Row-Diagonal Parity Search or GRS for short. GRS optimally allocates the network traffic generated during the repair process by greedily utilizing idle bandwidth and leveraging the commutative property of XOR operations, ensuring a more even distribution of traffic across the cluster network, which improves the repair throughput.We build a prototype in a distributed erasure-coded cluster and conduct experiment evaluation. The experimental results indicate that, compared to existing repair optimization methods, GRS improves repair throughput by 230%-880%.
AB - With the increasing scale of data and user demands for low latency, the development of large-scale clusters has become a trend. To ensure high availability of data in data clusters, XOR-based erasure code fault-tolerant technologies are widely used due to their low storage and computational overhead. Meanwhile, as the scale of clusters ranges from hundreds to thousands, the probability of multiple node failures is not negligible. This can lead to serious consequences, such as data loss, and should be recovered as soon as possible. However, codes such as RDP and EVENODD can easily lead to network congestion when recovering in the event of concurrent failures, making it challenging to recover quickly.To address this issue, we propose a novel network transfer plan search algorithm, Greedy Row-Diagonal Parity Search or GRS for short. GRS optimally allocates the network traffic generated during the repair process by greedily utilizing idle bandwidth and leveraging the commutative property of XOR operations, ensuring a more even distribution of traffic across the cluster network, which improves the repair throughput.We build a prototype in a distributed erasure-coded cluster and conduct experiment evaluation. The experimental results indicate that, compared to existing repair optimization methods, GRS improves repair throughput by 230%-880%.
KW - Availability
KW - Distributed systems
KW - Erasure code
KW - Network transfer
KW - Storage cluster
UR - https://www.scopus.com/pages/publications/85206355487
U2 - 10.1109/IWQoS61813.2024.10682840
DO - 10.1109/IWQoS61813.2024.10682840
M3 - 会议稿件
AN - SCOPUS:85206355487
T3 - IEEE International Workshop on Quality of Service, IWQoS
BT - 2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
Y2 - 19 June 2024 through 21 June 2024
ER -