Skip to main navigation Skip to search Skip to main content

OMNIKV: DYNAMIC CONTEXT SELECTION FOR EFFICIENT LONG-CONTEXT LLMS

  • Jitai Hao
  • , Yuke Zhu
  • , Tian Wang
  • , Jun Yu
  • , Xin Xin
  • , Bo Zheng
  • , Zhaochun Ren
  • , Sheng Guo*
  • *Corresponding author for this work
  • Shandong University
  • Ant Group
  • Leiden University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

During the inference phase of Large Language Models (LLMs) with long context, a substantial portion of GPU memory is allocated to the KV cache, with memory usage increasing as the sequence length grows. To mitigate the GPU memory footprint associate with KV cache, some previous studies have discarded less important tokens based on the sparsity identified in attention scores in long context scenarios. However, we argue that attention scores cannot indicate the future importance of tokens in subsequent generation iterations, because attention scores are calculated based on current hidden states. Therefore, we propose OmniKV, a token-dropping-free and training-free inference method, which achieves a 1.68x speedup without any loss in performance. It is well-suited for offloading, significantly reducing KV cache memory usage by up to 75% with it. The core innovative insight of OmniKV is: Within a single generation iteration, there is a high degree of similarity in the important tokens identified across consecutive layers. Extensive experiments demonstrate that OmniKV achieves state-of-the-art performance across multiple benchmarks, with particularly advantages in chain-of-thoughts scenarios. OmniKV extends the maximum context length supported by a single A100 for Llama-3-8B from 128K to 450K. Our code is available at https://github.com/antgroup/OmniKV.git.

Original languageEnglish
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages85881-85902
Number of pages22
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: 24 Apr 202528 Apr 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period24/04/2528/04/25

Fingerprint

Dive into the research topics of 'OMNIKV: DYNAMIC CONTEXT SELECTION FOR EFFICIENT LONG-CONTEXT LLMS'. Together they form a unique fingerprint.

Cite this