Skip to main navigation Skip to search Skip to main content

MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters

  • Yunfei Gu
  • , Yixuan Liu
  • , Xinyuan Wu
  • , Bo Shao
  • , Chentao Wu*
  • , Shiyi Li
  • , Jieru Zhao
  • , Jie Li
  • , Minyi Guo
  • , Kunlin Yang
  • , Wengui Zhang
  • , Feilong Lin
  • *Corresponding author for this work
  • Shanghai Jiao Tong University
  • Harbin Institute of Technology Shenzhen
  • Huawei Technologies Co., Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations.

Original languageEnglish
Title of host publication2025 62nd ACM/IEEE Design Automation Conference, DAC 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331503048
DOIs
StatePublished - 2025
Externally publishedYes
Event62nd ACM/IEEE Design Automation Conference, DAC 2025 - San Francisco, United States
Duration: 22 Jun 202525 Jun 2025

Publication series

NameProceedings - Design Automation Conference
ISSN (Print)0738-100X

Conference

Conference62nd ACM/IEEE Design Automation Conference, DAC 2025
Country/TerritoryUnited States
CitySan Francisco
Period22/06/2525/06/25

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 7 - Affordable and Clean Energy
    SDG 7 Affordable and Clean Energy

Keywords

  • DRAM Reliability
  • Heterogeneous Clusters
  • Memory Failures
  • Prediction

Fingerprint

Dive into the research topics of 'MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters'. Together they form a unique fingerprint.

Cite this