MoH: Multi-Head Attention as Mixture-of-Head Attention

Paper or resources for more information: [Paper] [Code]

โšก Overview

We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:

  • First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
  • Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential.

๐Ÿ˜ฎ Highlights

๐Ÿ’ก General Framework

We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.

Code HuggingFace Model
MoH-ViT ๐Ÿค— MoH-ViT-B-75, MoH-ViT-B-50, MoH-ViT-S-80, MoH-ViT-S-75
MoH-DiT ๐Ÿ˜Š MoH-DiT-90
MoH-LLaMA3-8B ๐Ÿ˜Š MoH-LLaMA3-8B

๐Ÿ”ฅ High Performance

Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%~90% of the attention heads.

๐Ÿค— Support Continue-Tuning Starting from the Multi-Head Attention Models

we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.

The MoH model quickly recovers to over 95% of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.

โœ๏ธ Citation

If you find this paper useful, please consider staring ๐ŸŒŸ this repo and citing ๐Ÿ“‘ our paper:

@article{jin2024moh,
  title={MoH: Multi-Head Attention as Mixture-of-Head Attention}, 
  author={Peng Jin and Bo Zhu and Li Yuan and Shuicheng Yan},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.