Abstract
In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations.
| Original language | English |
|---|---|
| Title of host publication | 2025 62nd ACM/IEEE Design Automation Conference, DAC 2025 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9798331503048 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 62nd ACM/IEEE Design Automation Conference, DAC 2025 - San Francisco, United States Duration: 22 Jun 2025 → 25 Jun 2025 |
Publication series
| Name | Proceedings - Design Automation Conference |
|---|---|
| ISSN (Print) | 0738-100X |
Conference
| Conference | 62nd ACM/IEEE Design Automation Conference, DAC 2025 |
|---|---|
| Country/Territory | United States |
| City | San Francisco |
| Period | 22/06/25 → 25/06/25 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- DRAM Reliability
- Heterogeneous Clusters
- Memory Failures
- Prediction
Fingerprint
Dive into the research topics of 'MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver