LAVa is a kv cache compression method aiming to imporve kv cache eviction performance based on theoretical analysis. It supports dynamic head budget allocation and dynamic layer budget allocation.
transformers==4.41.1
flash-attn==2.4.0
datasets
tiktoken
jieba
rouge_score
https://github.com/MGDDestiny/Lava/
cd Lava
make i
python inference.py -m /path/of/mistral_or_qwen/modelcd ./experiments/LongBench
bash eval_longbench_lava.shcd ./experiments/needle_in_haystack
bash eval_needle_lava.shcd ./experiments/ruler
bash eval_ruler_lava.shcd ./experiments/InfiniteBench
bash src/eval_infinite_lava.shIf you found our work valuable, please cite:
@misc{shen2025lavalayerwisekvcache,
title={LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation},
author={Yiqun Shen and Song Yuan and Zhengze Zhang and Xiaoliang Wang and Daxin Jiang and Nguyen Cam-Tu},
year={2025},
eprint={2509.09754},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.09754},
}
We extend our gratitude to Adakv, SnapKV and PyramidKV for their contributions of open-source code, which have significantly facilitated the advancement of this project.