# SpargeAttn
**Repository Path**: trackcc/SpargeAttn
## Basic Information
- **Project Name**: SpargeAttn
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-16
- **Last Updated**: 2025-12-16
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# SpargeAttention
## Recommended API
We highly recommend using the `spas_sage2_attn_meansim_topk_cuda` and `block_sparse_sage2_attn_cuda` APIs. They are plug-and-play and customizable:
### Plug-and-Play API
```python
from spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False)
```
You can adjust `topk` to balance between attention accuracy (higher `topk` is more accurate) and sparsity (lower `topk` is more sparse).
### Customize your Block-Sparse Mask
```python
from spas_sage_attn import block_sparse_sage2_attn_cuda
attn_output = block_sparse_sage2_attn_cuda(q, k, v, mask_id=None):
```
In this API, we **support computing attention in any block sparse mask per attention head**. Specifically, the attention mask per head, `mask_id`, is of shape `(batch_size, num_heads, seq_len // 128, seq_len // 64)`. Currently, the block size is 128x64.
---
The official implementation of [SpargeAttn](https://arxiv.org/abs/2502.18137), a universal training-free sparse attention accelerating language, image, and video models.
SpargeAttention: Accurate and Training-free Sparse Attention
Accelerating Any Model Inference
## Project Updates
- **Please use the `spas_sage2_attn_meansim_topk_cuda` and `block_sparse_sage2_attn_cuda` APIs.**
- [2025-07]: Release a Triton Kernel example.
- [2025-06]: SpargeAttn based on [SageAttention2++](https://arxiv.org/abs/2505.21136) is released.
- [2025-05]: Add a **very simple usage without tuning or calibration**: `o = spas_sage2_attn_meansim_topk_cuda(q, k, v)`.
- [2025-05]: 🎉SpargeAttn and [SageAttention2](https://github.com/thu-ml/SageAttention) are accepted by ICML 2025!
- [2025-03] Support high acceleration on more GPUs, e.g., H100.
## Installation
### Base environment
+ `python>=3.9` , `torch>=2.3.0`
- `CUDA`:
+ `>=12.8` for Blackwell, `>=12.4` for fp8 support on Ada, `>=12.3` for fp8 support on Hopper, `>=12.0` for Ampere
### Install Package
```bash
pip install ninja # for parallel compilation
python setup.py install # or pip install -e .
```
## Available API
- `spas_sage2_attn_meansim_topk_cuda`: SpargeAttn based on [SageAttention2](https://github.com/thu-ml/SageAttention) that **we recommend using.**
- `spas_sage2_attn_meansim_cuda`: SpargeAttn based on [SageAttention2](https://github.com/thu-ml/SageAttention) that we do not recommend.
- `spas_sage_attn_meansim_topk_cuda`: SpargeAttn based on [SageAttention](https://github.com/thu-ml/SageAttention) that **we recommend using.**
- `spas_sage_attn_meansim_cuda`: SpargeAttn based on [SageAttention](https://github.com/thu-ml/SageAttention) that we do not recommend.
## Usage
### Plug-and-Play Usage
Just replace `torch.nn.functional.scaled_dot_product_attention` API using `spas_sage2_attn_meansim_topk_cuda`:
```diff
from spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
- attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=False) # is_causal can be True
+ attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False) # is_causal can be True
```
## Plug-and-Play API
```python
from spas_sage_attn import spas_sage2_attn_meansim_topk_cuda
attn_output = spas_sage2_attn_meansim_topk_cuda(q, k, v, topk=0.5, is_causal=False)
```
You can adjust `topk` to balance between attention accuracy (higher `topk` is more accurate) and sparsity (lower `topk` is more sparse).
## Customize your Block-Sparse Mask API
```python
from spas_sage_attn import block_sparse_sage2_attn_cuda
attn_output = block_sparse_sage2_attn_cuda(q, k, v, mask_id=None):
```
In this API, we **support computing attention for any block-sparse mask per attention head**. Specifically, the attention mask per head, `mask_id`, is of shape `(batch_size, num_heads, seq_len // 128, seq_len // 64)`. Currently, the block size is 128x64.
## Citation
```
@inproceedings{zhang2025spargeattn,
title={Spargeattn: Accurate sparse attention accelerating any model inference},
author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
```