Sarashina-Embedding-v2-1B
"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "Sarashina2.2-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Sarashina2.2-1B
- Maximum Sequence Length: 8,192 tokens
- Output Dimensionality: 1,792 dimensions
- Similarity Function: Cosine Similarity
- Language: Japanese
- License: Sarashina Model NonCommercial License Agreement
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
Usage
First install the Sentence Transformers library:
pip install sentence-transformers==4.0.2
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
# Run inference
query = [
'task: クエリを与えるので、与えられたWeb検索クエリに答える関連文章を検索してください。\nquery: Sarashinaのテキスト埋め込みモデルはありますか?'
]
texts = [
'text: 更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
'text: Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
'text: サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
]
query_embedding = model.encode(query)
text_embeddings = model.encode(texts)
# Get the similarity scores between the embeddings
similarities = model.similarity(query_embedding, text_embeddings)
print(similarities)
# tensor([[0.7403, 0.8651, 0.8775]])
How to add instructions and prefixes
For both the query and document sides, use different prefix formats. On the query side, add the prefix task:
followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)
- Query Side:
task: {Instrcution}\nquery: {Query}
- Document Side:
text: {Document}
Templates for instructions and prefixes
The table below provides instruction and prefix templates for five main tasks.
Task | Query Side | Document Side |
---|---|---|
Retrieval Reranking |
task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery: | text: |
Clustering | task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery: | - |
Classification | task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery: | - |
STS | task: クエリを与えるので,もっともクエリに意味が似ている一節を探してください。\nquery: | task: クエリを与えるので,もっともクエリに意味が似ている一節を探してください。\nquery: |
Training
Sarashina-Embedding-v2-1B is created through the following three-stage learning process:
Stage 1: Weakly-supervised Learning
To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.
Step2: Supervised Fine-tuning
To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.
Stage 3: Model Merging
To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.
Evaluation Results (*) with JMTEB
Model | Avg. | Retrieval | STS | Classification | Reranking | Clustering |
---|---|---|---|---|---|---|
Sarashina-Embedding-v2-1B (This model) | 76.38 | 76.48 | 84.22 | 77.14 | 86.28 | 52.56 |
cl-nagoya/ruri-v3-310m | 75.85 | 76.03 | 81.59 | 77.65 | 85.84 | 50.52 |
sbintuitions/sarashina-embedding-v1-1b | 74.87 | 74.53 | 81.71 | 77.20 | 84.36 | 50.30 |
OpenAI/text-embedding-3-large | 73.86 | 71.95 | 82.52 | 77.27 | 83.06 | 51.82 |
(*) Evaluated on July 28, 2025.
License
This model is licensed under Sarashina Model NonCommercial License Agreement.
If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.
- Downloads last month
- 84
Model tree for sbintuitions/sarashina-embedding-v2-1b
Base model
sbintuitions/sarashina2.2-1b