File size: 7,084 Bytes
fcfe1ed 94bdb17 fcfe1ed b2554b1 fcfe1ed a69ccd7 dbb2e05 a69ccd7 2eb9b8b 3b160e9 a69ccd7 2eb9b8b a69ccd7 3b160e9 3e60adb a69ccd7 3e60adb a69ccd7 3e60adb 48681e7 3e60adb f0bcc88 a69ccd7 3e60adb 747a5f0 3e60adb e650938 f0bcc88 3e60adb 48681e7 3e60adb 48681e7 e650938 f0bcc88 a69ccd7 3e60adb e650938 f0bcc88 a69ccd7 f0bcc88 93b80e7 f0bcc88 a69ccd7 f0bcc88 93b80e7 f0bcc88 a69ccd7 f0bcc88 a69ccd7 f0bcc88 a69ccd7 3e60adb f0bcc88 a69ccd7 f0bcc88 a69ccd7 f0bcc88 1bc19bc f0bcc88 a69ccd7 f0bcc88 a69ccd7 1bc19bc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
---
base_model:
- nomic-ai/nomic-embed-text-v2-moe-unsupervised
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
license: apache-2.0
language:
- en
- es
- fr
- de
- it
- pt
- pl
- nl
- tr
- ja
- vi
- ru
- id
- ar
- cs
- ro
- sv
- el
- uk
- zh
- hu
- da
- 'no'
- hi
- fi
- bg
- ko
- sk
- th
- he
- ca
- lt
- fa
- ms
- sl
- lv
- mr
- bn
- sq
- cy
- be
- ml
- kn
- mk
- ur
- fy
- te
- eu
- sw
- so
- sd
- uz
- co
- hr
- gu
- ce
- eo
- jv
- la
- zu
- mn
- si
- ga
- ky
- tg
- my
- km
- mg
- pa
- sn
- ha
- ht
- su
- gd
- ny
- ps
- ku
- am
- ig
- lo
- mi
- nn
- sm
- yi
- st
- tl
- xh
- yo
- af
- ta
- tn
- ug
- az
- ba
- bs
- dv
- et
- gl
- gn
- gv
- hy
---
# nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings
## Model Overview
`nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval:
- **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size
- **Multilinguality**: Supports ~100 languages and trained on over 1.6B pairs
- **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations
- **Fully Open-Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released
| Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code |
|-------|------------|----------|------|---------|---------------|---------------|------|
| **Nomic Embed v2** | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ |
| mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ |
| mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ |
| Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ |
| |
| BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ |
| Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ |
| mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ |
## Model Architecture
- **Total Parameters**: 475M
- **Active Parameters During Inference**: 305M
- **Architecture Type**: Mixture of Experts (MoE)
- **MoE Configuration**: 8 experts with top-2 routing
- **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning
- **Maximum Sequence Length**: 512 tokens
- **Languages**: Supports dozens of languages (see Performance section)
## Usage Guide
### Installation
The model can be used through SentenceTransformers and Transformers.
For best performance on GPU, please install
```bash
pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git
```
> [!IMPORTANT]
> **Important!**
> The text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents.
### Transformers
If using Transformers, **make sure to prepend the task instruction prefix**.
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe")
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
sentences = ['search_document: Hello!', 'search_document: ¡Hola!']
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
model.eval()
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)
# torch.Size([2, 768])
similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0)
print(similarity)
# tensor(0.9118)
```
### SentenceTransformers
With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
sentences = ["Hello!", "¡Hola!"]
embeddings = model.encode(sentences, prompt_name="passage")
print(embeddings.shape)
# (2, 768)
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity)
# tensor([[0.9118]])
```
## Performance
nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models:
data:image/s3,"s3://crabby-images/e84dc/e84dc9462b864d0a4e3dd033a20279f38944af4c" alt="image/png"
nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions:
data:image/s3,"s3://crabby-images/4d19e/4d19e8015503e039ff8fd77b90eddf2032ea62e9" alt="image/png"
## Best Practices
- Add appropriate prefixes to your text:
- For queries: "search_query: "
- For documents: "search_document: "
- Maximum input length is 512 tokens
- For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern
## Limitations
- Performance may vary across different languages
- Resource requirements may be higher than traditional dense models due to MoE architecture
- Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation
## Training Details
data:image/s3,"s3://crabby-images/397ce/397ceefba4e943908632d77323a9da693e47b216" alt="image/png"
- Trained on 1.6 billion high-quality pairs across multiple languages
- Uses consistency filtering to ensure high-quality training data
- Incorporates Matryoshka representation learning for dimension flexibility
- Training includes both weakly-supervised contrastive pretraining and supervised finetuning
For more details, please check out the [blog post](https://www.nomic.ai/blog/posts/nomic-embed-text-v2) and [technical report](https://www.arxiv.org/abs/2502.07972).
## Join the Nomic Community
- Nomic: [https://nomic.ai](https://nomic.ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
# Citation
If you find the model, dataset, or training code useful, please cite our work
```bibtex
@misc{nussbaum2025trainingsparsemixtureexperts,
title={Training Sparse Mixture Of Experts Text Embedding Models},
author={Zach Nussbaum and Brandon Duderstadt},
year={2025},
eprint={2502.07972},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07972},
}
``` |