nomic-ai
/

nomic-embed-text-v1-ablated

       value: 86.38437691024106
     - type: max_f1
       value: 78.79039565086076
+---
+# nomic-embed-text-v1-ablated: A Reproducible Long Context (8192) Text Embedder
+`nomic-embed-text-v1-ablated` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 performance on short and long tasks.
+.
+| Name                             | SeqLen | MTEB      | LoCo     | Jina Long Context |  Open Weights | Open Training Code | Open Data   |
+| :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
+| nomic-embed-text-v1              | 8192   | **62.39** |**85.53** | 54.16             | ✅            | ✅                  | ✅          |
+| jina-embeddings-v2-base-en       | 8192   | 60.39     | 85.45    | 51.90             | ✅            | ❌                  | ❌          |
+| text-embedding-3-small           | 8191   | 62.26     | 82.40    | **58.20**         | ❌            | ❌                  | ❌          |
+| text-embedding-ada-002           | 8191   | 60.99     | 52.7     | 55.25             | ❌            | ❌                  | ❌          |
+If you would like to finetune a model on more data, you can use this model as an initialization
+## Training Details
+We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
+the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
+In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
+For more details, see Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf).
+Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
+## Usage
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    model_output = model(**encoded_input)
+embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+embeddings = F.normalize(embeddings, p=2, dim=1)
+print(embeddings)
+```
+The model natively supports scaling of the sequence length past 2048 tokens. To do so,
+```diff
+- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
++ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
+- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
++ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True, rotary_scaling_factor=2)
+```