SeerAttention-Llama-3.1-8B

This is a reproduction of the SeerAttention paper. The model contains additional learnable AttnGate modules on top of meta-llama/Meta-Llama-3-8B-Instruct. The AttnGate modules accelerate long-context attention inference by enabling block-level sparsity. During training, the AttnGates are optimized via self-distillation while keeping the original model weights frozen. Specifically, the AttnGates learn to mimic the 2D-maxpooled outputs of the attention maps. At inference time, the soft scores produced by the gates are converted into binary masks, thereby reducing both the I/O overhead and computational cost of the attention mechanism.

Original Github Repo

https://github.com/microsoft/SeerAttention.

Evaluation Results

Perplexity on PG19

Density	8192	16384	32768	65536	131072
1.00	10.03	9.88	9.92	9.97	10.03
0.50	10.04	9.89	9.92	9.99	10.05
0.40	10.06	9.89	9.93	9.99	10.07
0.30	10.09	9.91	9.95	10.01	10.15
0.20	10.19	9.94	9.97	10.04	10.37
0.10	10.61	10.08	10.04	10.09	10.88

LongBench

With threshold set to 2e-3.

Task	0-4k	4-8k	8k+
2wikimqa	51.1	47.85	33.36
gov_report	35.03	35.05	34.57
hotpotqa	63.97	60.0	56.7
lcc	67.98	73.18	65.28
multi_news	28.1	25.78	24.25
multifieldqa_en	58.63	51.45	51.87
passage_count	18.0	10.15	11.88
passage_retrieval_en	100.0	99.0	98.0
qasper	47.77	44.04	39.63
repobench-p	51.78	56.24	56.75
samsum	43.28	41.19	45.29
trec	64.0	76.0	75.0
triviaqa	90.91	88.45	92.43
averaged	55.43	54.49	52.69

RULER

	Dense Baseline	SeerAttn	Avg density
4k	95.53	95.53	0.87
8k	92.27	92.71	0.72
16k	92.01	92.02	0.56
32k	87.63	88.49	0.46
64k	84.39	83.48	0.32
128k	76.26	73.37	0.17