Sarashina-Embedding-v2-1B

日本語のREADME/Japanese README

"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "Sarashina2.2-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )

This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Sarashina2.2-1B
Maximum Sequence Length: 8,192 tokens
Output Dimensionality: 1,792 dimensions
Similarity Function: Cosine Similarity
Language: Japanese
License: Sarashina Model NonCommercial License Agreement

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)

Usage

First install the Sentence Transformers library:

pip install sentence-transformers==4.0.2

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
# Run inference
query = [
      'task: クエリを与えるので、与えられたWeb検索クエリに答える関連文章を検索してください。\nquery: Sarashinaのテキスト埋め込みモデルはありますか?'
  ]
texts = [
      'text: 更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
      'text: Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
      'text: サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
]
query_embedding = model.encode(query)
text_embeddings = model.encode(texts)
# Get the similarity scores between the embeddings
similarities = model.similarity(query_embedding, text_embeddings)
print(similarities)
# tensor([[0.7403, 0.8651, 0.8775]])

How to add instructions and prefixes

For both the query and document sides, use different prefix formats. On the query side, add the prefix task: followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)

Query Side: task: {Instrcution}\nquery: {Query}
Document Side: text: {Document}

Templates for instructions and prefixes

The table below provides instruction and prefix templates for five main tasks.

Task	Query Side	Document Side
Retrieval Reranking	task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery:	text:
Clustering	task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery:	-
Classification	task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery:	-
STS	task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery:	task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery:

Training

Sarashina-Embedding-v2-1B is created through the following three-stage learning process:

Stage 1: Weakly-supervised Learning

To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.

Step2: Supervised Fine-tuning

To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.

Stage 3: Model Merging

To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.

Evaluation Results (*) with JMTEB

Model	Avg.	Retrieval	STS	Classification	Reranking	Clustering
Sarashina-Embedding-v2-1B (This model)	76.38	76.48	84.22	77.14	86.28	52.56
cl-nagoya/ruri-v3-310m	75.85	76.03	81.59	77.65	85.84	50.52
sbintuitions/sarashina-embedding-v1-1b	74.87	74.53	81.71	77.20	84.36	50.30
OpenAI/text-embedding-3-large	73.86	71.95	82.52	77.27	83.06	51.82

(*) Evaluated on July 28, 2025.

License

This model is licensed under Sarashina Model NonCommercial License Agreement.

If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.

sbintuitions
/

sarashina-embedding-v2-1b