BitStack breaks down large language models into tiny little blocks, which can be sorted and stacked universally, achieving megabyte-level memory-performance tradeoffs while maintaining or surpassing the performance of practical compression methods like GPTQ and AWQ. Check out our paper for more details!
- [2025-01-22] π Our BitStack paper has been accepted to ICLR 2025!
- [2025-01-08] π Add support for Mistral and Qwen models!
- [2024-11-06] π We've released Triton kernels optimized for fused inference with BitStack models! These kernels deliver an impressive 3x to 10x speedup over the original implementation. Just set the
--fused_levelflag to get started! For more details, check out the speedup visualization here. - [2024-11-01] π Try out this Colab demo and play with BitStack models across various memory budgets using an intuitive slider built with Gradio!
- [2024-11-01] π Check out our paper on arXiv!
- [2024-10-31] β¨ Pre-decomposed models are now available on HuggingFaceπ€!
- [2024-10-31] π Code release! We have some awesome inference kernels for BitStack models coming soon, stay tuned!
conda create -yn bitstack python=3.10
conda activate bitstack
pip install -e .
To run the decomposition of a model, run this script or the following command:
python -m bitstack.main \
--model_name_or_path meta-llama/Meta-Llama-3.1-8B \
--niter 16 \ # Number of iterations of decomposition, decrease for shorter runtime
--k 16 \ # Number of singular vectors kept
--nsamples 256 \ # Number of calibration samples
--output_dir outputs \
--do_save \
--score_importance \ # Run the sorting process
--generate_compression_configs # Generate compression configs
To evaluate the decomposed model, run this script or the following command:
python -m bitstack.main \
--model_name_or_path /YOUR/CHECKPOINT/PATH \
--k 16 \
--max_memory_MB 5541 \ # Maximum available memory for the model
--load_bitstack \ # Load the decomposed model
--do_eval \ # Perplexity evaluation
--lm_eval \ # Zero-shot evaluation
--output_dir outputs
We provide pre-decomposed models and compression configs. Currently, the following models are available, with more to comeβstay tuned!
| Model | Download |
|---|---|
| Llama-2 | π€7B / π€13B / π€70B |
| Llama-3 | π€8B / π€70B |
| Llama-3.1 | π€8B / π€70B |
| Llama-3.1-Instruct | π€8B / π€70B |
| Llama-3.2 | π€1B / π€3B |
| Mistral-7B-v0.3 | π€7B |
| Qwen-2.5 | π€0.5B / π€1.5B / π€3B / π€7B / π€14B / π€32B / π€72B |
You can download them via the following commands:
# (Optional) enable hf_transfer for faster download
# pip install hf_transfer
# export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir ./models/BitStack-Llama-3.1-8B
Or just download the compression config for your already decomposed model:
huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir /YOUR/CHECKPOINT/PATH/ --include "compression_config.json"
@misc{wang2025bitstackanysizecompressionlarge,
title={BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments},
author={Xinghao Wang and Pengyu Wang and Bo Wang and Dong Zhang and Yunhua Zhou and Xipeng Qiu},
year={2025},
eprint={2410.23918},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23918},
}
