Abstract
Efficient cache management plays a vital role in in-memory dataparallel systems, such as Spark, Tez, Storm and HANA. Recent research, notably research on the Least Reference Count (LRC) and Most Reference Distance (MRD) policies, has shown that dependency-aware caching management practices that consider the application's directed acyclic graph (DAG) performwell in Spark. However, these practices ignore the further relationship betweenRDDsand cached some redundantRDDswith the same childRDDs, which degrades the memory performance. Hence, in memory-constrained situations, systems may encounter a performance bottleneck due to frequent data block replacement. In addition, the prefetch mechanisms in some cache management policies, such as MRD, are hard to trigger. In this paper, we propose a new cache management method called RDE (Redundant Data Eviction) that can fully utilize applications' DAG information to optimize the management result. By considering both RDDs' dependencies and the reference sequence, we effectively evict RDDs with redundant features and perfect the memory for incoming data blocks. Experiments show that RDE improves performance by an average of 55% compared to LRU and by up to 48% and 20% compared to LRC and MRD, respectively. RDE also shows less sensitivity to memory bottlenecks, which means better availability in memory-constrained environments.
| Original language | English |
|---|---|
| Pages (from-to) | 727-741 |
| Number of pages | 15 |
| Journal | Computers, Materials and Continua |
| Volume | 68 |
| Issue number | 1 |
| DOIs | |
| State | Published - 22 Mar 2021 |
| Externally published | Yes |
Keywords
- Dependency-aware
- cache management
- in-memory computing
- spark
Fingerprint
Dive into the research topics of 'Improving Cache Management with Redundant RDDs Eviction in Spark'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver