This repo only contains the AttnGates' weights for Qwen2.5-14B-Instruct Model.

SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.

Original Github Repo

https://github.com/microsoft/SeerAttention.

Evaluation Results

PG19 PPL

Density 8192 tokens (ppl) 16384 tokens (ppl) 32768 tokens (ppl)
0.10 8.62 8.23 8.17
0.20 8.32 8.08 8.06
0.30 8.23 8.02 8.03
0.40 8.19 8.00 8.01
0.50 8.17 7.99 8.00
1.00 8.16 7.99 8.00

LongBench

Dataset 0-4k (Dense / Sparse) 4-8k (Dense / Sparse) 8k+ (Dense / Sparse)
qasper 47.23/48.05 37.51/37.20 35.26/36.49
multifieldqa_en 56.40/56.10 47.13/47.36 48.64/50.36
lcc 62.32/63.25 67.48/66.58 61.47/63.53
gov_report 34.26/34.30 34.06/33.70 33.02/32.52
2wikimqa 51.29/52.13 48.03/47.78 31.68/30.90
multi_news 26.46/26.21 23.71/23.55 22.42/22.58
samsum 42.97/42.95 41.08/40.23 44.88/44.62
passage_count 20.00/19.00 07.00/06.00 08.00/08.00
repobench-p 64.17/63.63 64.87/64.61 57.85/58.60
trec 60.00/60.00 75.00/74.00 71.00/71.00
hotpotqa 58.57/57.16 56.87/55.91 56.18/56.99
triviaqa 87.63/87.35 88.38/90.00 88.49/90.15
passage_retrieval_en 99.00/99.00 100.0/100.0 100.0/100.0
averaged score 54.64/54.55 53.16/52.84 50.68/51.21
averaged density 0.841 0.624 0.379
Downloads last month
0
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for SeerAttention/SeerAttention-Qwen2.5-14B-AttnGates

Base model

Qwen/Qwen2.5-14B
Adapter
(157)
this model