--- base_model: - nomic-ai/nomic-embed-text-v2-moe-unsupervised library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction license: apache-2.0 language: - en - es - fr - de - it - pt - pl - nl - tr - ja - vi - ru - id - ar - cs - ro - sv - el - uk - zh - hu - da - 'no' - hi - fi - bg - ko - sk - th - he - ca - lt - fa - ms - sl - lv - mr - bn - sq - cy - be - ml - kn - mk - ur - fy - te - eu - sw - so - sd - uz - co - hr - gu - ce - eo - jv - la - zu - mn - si - ga - ky - tg - my - km - mg - pa - sn - ha - ht - su - gd - ny - ps - ku - am - ig - lo - mi - nn - sm - yi - st - tl - xh - yo - af - ta - tn - ug - az - ba - bs - dv - et - gl - gn - gv - hy --- # nomic-embed-text-v2-moe: Multilingual Mixture of Experts Text Embeddings ## Model Overview `nomic-embed-text-v2-moe` is SoTA multilingual MoE text embedding model that excels at multilingual retrieval: - **High Performance**: SoTA Multilingual performance compared to ~300M parameter models, competitive with models 2x in size - **Multilinguality**: Supports ~100 languages and trained on over 1.6B pairs - **Flexible Embedding Dimension**: Trained with [Matryoshka Embeddings](https://arxiv.org/abs/2205.13147) with 3x reductions in storage cost with minimal performance degradations - **Fully Open-Source**: Model weights, [code](https://github.com/nomic-ai/contrastors), and training data (see code repo) released | Model | Params (M) | Emb Dim | BEIR | MIRACL | Pretrain Data | Finetune Data | Code | |-------|------------|----------|------|---------|---------------|---------------|------| | **Nomic Embed v2** | 305 | 768 | 52.86 | **65.80** | ✅ | ✅ | ✅ | | mE5 Base | 278 | 768 | 48.88 | 62.30 | ❌ | ❌ | ❌ | | mGTE Base | 305 | 768 | 51.10 | 63.40 | ❌ | ❌ | ❌ | | Arctic Embed v2 Base | 305 | 768 | **55.40** | 59.90 | ❌ | ❌ | ❌ | | | | BGE M3 | 568 | 1024 | 48.80 | **69.20** | ❌ | ✅ | ❌ | | Arctic Embed v2 Large | 568 | 1024 | **55.65** | 66.00 | ❌ | ❌ | ❌ | | mE5 Large | 560 | 1024 | 51.40 | 66.50 | ❌ | ❌ | ❌ | ## Model Architecture - **Total Parameters**: 475M - **Active Parameters During Inference**: 305M - **Architecture Type**: Mixture of Experts (MoE) - **MoE Configuration**: 8 experts with top-2 routing - **Embedding Dimensions**: Supports flexible dimension from 768 to 256 through Matryoshka representation learning - **Maximum Sequence Length**: 512 tokens - **Languages**: Supports dozens of languages (see Performance section) ## Usage Guide ### Installation The model can be used through SentenceTransformers and Transformers. For best performance on GPU, please install ```bash pip install torch transformers einops git+https://github.com/nomic-ai/megablocks.git ``` > [!IMPORTANT] > **Important!** > The text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed. Please use `search_query: ` before your queries/questions, and `search_document: ` before your documents. ### Transformers If using Transformers, **make sure to prepend the task instruction prefix**. ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v2-moe") model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) sentences = ['search_document: Hello!', 'search_document: ¡Hola!'] def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') model.eval() with torch.no_grad(): model_output = model(**encoded_input) embeddings = mean_pooling(model_output, encoded_input['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) print(embeddings.shape) # torch.Size([2, 768]) similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0) print(similarity) # tensor(0.9118) ``` ### SentenceTransformers With SentenceTransformers, you can specify the `prompt_name` as either `"query"` or `"passage"`, and the task instruction will be included automatically. ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True) sentences = ["Hello!", "¡Hola!"] embeddings = model.encode(sentences, prompt_name="passage") print(embeddings.shape) # (2, 768) similarity = model.similarity(embeddings[0], embeddings[1]) print(similarity) # tensor([[0.9118]]) ``` ## Performance nomic-embed-text-v2-moe performance on BEIR and MIRACL compared to other open-weights embedding models: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/xadjrezEIM0Q1jbgmjqO7.png) nomic-embed-text-v2-moe performance on BEIR at 768 dimension and truncated to 256 dimensions: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8hmhWQ_TTmlrviZFIBSxo.png) ## Best Practices - Add appropriate prefixes to your text: - For queries: "search_query: " - For documents: "search_document: " - Maximum input length is 512 tokens - For optimal efficiency, consider using the 256-dimension embeddings if storage/compute is a concern ## Limitations - Performance may vary across different languages - Resource requirements may be higher than traditional dense models due to MoE architecture - Must use `trust_remote_code=True` when loading the model to use our custom architecture implementation ## Training Details ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/F0lyAtV8wXMBmxSbtIgL4.png) - Trained on 1.6 billion high-quality pairs across multiple languages - Uses consistency filtering to ensure high-quality training data - Incorporates Matryoshka representation learning for dimension flexibility - Training includes both weakly-supervised contrastive pretraining and supervised finetuning For more details, please check out the [blog post](https://www.nomic.ai/blog/posts/nomic-embed-text-v2) and [technical report](https://www.arxiv.org/abs/2502.07972). ## Join the Nomic Community - Nomic: [https://nomic.ai](https://nomic.ai) - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8) - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai) # Citation If you find the model, dataset, or training code useful, please cite our work ```bibtex @misc{nussbaum2025trainingsparsemixtureexperts, title={Training Sparse Mixture Of Experts Text Embedding Models}, author={Zach Nussbaum and Brandon Duderstadt}, year={2025}, eprint={2502.07972}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07972}, } ```