https://github.com/BM-K/Sentence-Embedding-is-all-you-need

Korean-Sentence-Embedding

🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Quick tour

import torch
from transformers import AutoModel, AutoTokenizer

def cal_score(a, b):
    if len(a.shape) == 1: a = a.unsqueeze(0)
    if len(b.shape) == 1: b = b.unsqueeze(0)

    a_norm = a / a.norm(dim=1)[:, None]
    b_norm = b / b.norm(dim=1)[:, None]
    return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100

model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta-multitask') 
AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta-multitask')

sentences = ['치타가 들판을 가로 질러 먹이를 쫓는다.',
             '치타 한 마리가 먹이 뒤에서 달리고 있다.',
             '원숭이 한 마리가 드럼을 연주한다.']

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings, _ = model(**inputs, return_dict=False)

score01 = cal_score(embeddings[0][0], embeddings[1][0])
score02 = cal_score(embeddings[0][0], embeddings[2][0])

Performance

  • Semantic Textual Similarity test set results
Model AVG Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTSKT 77.40 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERT 80.39 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTa 81.64 81.20 82.20 81.79 82.34 81.59 82.20 80.62 81.25
KoSentenceBART 77.14 79.71 78.74 78.42 78.02 78.40 78.00 74.24 72.15
KoSentenceT5 77.83 80.87 79.74 80.24 79.36 80.19 79.27 72.81 70.17
KoSimCSE-BERTSKT 81.32 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERT 83.37 83.22 83.58 83.24 83.60 83.15 83.54 83.13 83.49
KoSimCSE-RoBERTa 83.65 83.60 83.77 83.54 83.76 83.55 83.77 83.55 83.64
KoSimCSE-BERT-multitask 85.71 85.29 86.02 85.63 86.01 85.57 85.97 85.26 85.93
KoSimCSE-RoBERTa-multitask 85.77 85.08 86.12 85.84 86.12 85.83 86.12 85.03 85.99
Downloads last month
25,414
Safetensors
Model size
111M params
Tensor type
I64
·
F32
·
Inference Providers NEW

Spaces using BM-K/KoSimCSE-roberta-multitask 3

Collection including BM-K/KoSimCSE-roberta-multitask