korean-neural-sparse-encoder

A Korean-specific SPLADE-max sparse encoder fine-tuned from skt/A.X-Encoder-base (ModernBERT). It maps Korean sentences and paragraphs to a 50,000-dimensional sparse vector space for semantic search and sparse retrieval tasks.

Model Details

Property Value
Model Type SPLADE Sparse Encoder (SPLADE-max)
Base Model skt/A.X-Encoder-base (ModernBERT)
Parameters 149M
Output Dimensionality 50,000
Hidden Size 768
Layers 22
Korean Token Ratio 48.4% of vocabulary
Similarity Function Dot Product
Maximum Sequence Length 8,192 tokens

Architecture

ModernBertForMaskedLM
  → MLM Head (hidden_size → vocab_size)
  → log(1 + ReLU(logits))        # SPLADE activation
  → Max Pooling over sequence     # Position-invariant representation
  → Sparse Vector (50,000-dim)

Usage

Direct Usage (Transformers)

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "sewoong/korean-neural-sparse-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

special_ids = {tokenizer.cls_token_id, tokenizer.sep_token_id,
               tokenizer.pad_token_id, tokenizer.unk_token_id}

@torch.no_grad()
def encode(text: str, max_length: int = 256) -> dict[str, float]:
    inputs = tokenizer(text, return_tensors="pt",
                       max_length=max_length, truncation=True)
    logits = model(**inputs).logits
    sparse = torch.log1p(torch.relu(logits))
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    vec = (sparse * mask).max(dim=1).values.squeeze(0)

    result = {}
    for idx in (vec > 0).nonzero(as_tuple=True)[0].tolist():
        if idx not in special_ids:
            token = tokenizer.convert_ids_to_tokens(idx)
            result[token] = round(vec[idx].item(), 4)
    return result

# Example
vec = encode("한국 전쟁의 원인과 결과")
print(f"Active dimensions: {len(vec)}")
print(sorted(vec.items(), key=lambda x: -x[1])[:10])

Usage with OpenSearch (Client-Side Encoding)

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "sewoong/korean-neural-sparse-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

special_ids = {tokenizer.cls_token_id, tokenizer.sep_token_id,
               tokenizer.pad_token_id, tokenizer.unk_token_id}

@torch.no_grad()
def encode_for_opensearch(text: str, max_length: int = 256) -> dict[str, float]:
    """Encode text to sparse vector with integer token IDs for sparse_vector field."""
    inputs = tokenizer(text, return_tensors="pt",
                       max_length=max_length, truncation=True)
    logits = model(**inputs).logits
    sparse = torch.log1p(torch.relu(logits))
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    vec = (sparse * mask).max(dim=1).values.squeeze(0)

    result = {}
    for idx in (vec > 0).nonzero(as_tuple=True)[0].tolist():
        if idx not in special_ids:
            weight = round(vec[idx].item(), 4)
            if weight > 0:
                result[str(idx)] = weight  # Integer token ID as string key
    return result

Create Index

PUT /my-sparse-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "nori"
      },
      "sparse_embedding": {
        "type": "sparse_vector",
        "index": true,
        "method": {
          "name": "seismic",
          "parameters": {
            "n_postings": 300,
            "cluster_ratio": 0.1,
            "summary_prune_ratio": 0.4
          }
        }
      }
    }
  }
}

Search

GET /my-sparse-index/_search
{
  "query": {
    "neural_sparse": {
      "sparse_embedding": {
        "query_tokens": {
          "31380": 2.5134,
          "32470": 1.8921,
          "15678": 1.2045
        }
      }
    }
  }
}

Note: The sparse_vector field type requires integer token IDs as keys (e.g., "31380"), not string tokens (e.g., "한국"). Use encode_for_opensearch() above for correct format.

Evaluation

Korean Retrieval Benchmarks

Evaluated on standard Korean retrieval benchmarks using OpenSearch with neural_sparse search. All differences vs. BM25 are statistically significant (paired t-test, p < 0.001).

Benchmark Queries Corpus Description
Ko-StrategyQA 592 9,251 Korean multi-hop retrieval (translated from StrategyQA)
MIRACL-ko 213 10,000 Wikipedia-based Korean document retrieval
Mr.TyDi-ko 421 10,000 Wikipedia-based Korean document retrieval

Performance Summary (Recall@1)

Benchmark BM25 Neural Sparse (Ours) Dense (BGE-M3)
Ko-StrategyQA 53.7% 62.2% (+8.5pp) 73.5%
MIRACL-ko 44.1% 62.0% (+17.9pp) 70.9%
Mr.TyDi-ko 55.6% 73.4% (+17.8pp) 84.1%
Average 51.1% 65.9% (+14.7pp) 76.2%

Detailed Metrics

Benchmark Method R@1 R@5 R@10 MRR NDCG@10 P50 Latency
Ko-StrategyQA BM25 53.7% 75.3% 81.9% 0.626 0.673 8.2ms
Ko-StrategyQA Neural Sparse 62.2% 80.6% 83.6% 0.700 0.734 9.4ms
Ko-StrategyQA Dense (BGE-M3) 73.5% 87.3% 89.4% 0.795 0.819 11.8ms
MIRACL-ko BM25 44.1% 80.8% 90.6% 0.589 0.666 7.9ms
MIRACL-ko Neural Sparse 62.0% 89.7% 93.4% 0.733 0.783 9.5ms
MIRACL-ko Dense (BGE-M3) 70.9% 93.9% 97.7% 0.810 0.851 11.8ms
Mr.TyDi-ko BM25 55.6% 79.1% 85.7% 0.656 0.705 8.3ms
Mr.TyDi-ko Neural Sparse 73.4% 92.4% 94.8% 0.816 0.849 9.6ms
Mr.TyDi-ko Dense (BGE-M3) 84.1% 95.7% 96.9% 0.894 0.913 12.0ms

Comparison with Other Sparse Models

Model Parameters Ko-StrategyQA R@1 MIRACL-ko R@1 Mr.TyDi-ko R@1
sewoong/korean-neural-sparse-encoder 149M 62.2% 62.0% 73.4%
opensearch-neural-sparse-encoding-multilingual-v1 110M

Hybrid Search Performance (Ko-StrategyQA)

Combining BM25 + Neural Sparse + Dense retrieval with linear interpolation:

Method R@1 R@5 R@10 MRR NDCG@10
BM25 only 53.7% 75.3% 81.9% 0.626 0.673
Neural Sparse only 62.2% 80.6% 83.6% 0.700 0.734
Dense (BGE-M3) only 73.5% 87.3% 89.4% 0.795 0.819
Hybrid (sparse=0.3, dense=0.7) 72.3% 87.5% 89.2% 0.788 0.814
Hybrid (sparse=0.4, dense=0.6) 71.8% 87.0% 89.4% 0.784 0.811
Hybrid (sparse=0.5, dense=0.5) 70.3% 86.3% 89.0% 0.773 0.802

Sparsity Characteristics

Property Query Document
Avg. active dimensions ~33 ~54
Sparsity rate 99.93% 99.89%
Vocabulary size 50,000 50,000

The model produces ultra-sparse representations where only 0.07%~0.11% of the vocabulary dimensions are activated, enabling efficient inverted index storage and retrieval.

Training Details

Training Data

4.59M Korean triplets (query, positive document, hard negative) from 28 sources:

Dataset Samples Ratio Type
AIHub News QA (#624) 1,325,966 28.9% News question-answering
OPUS-100 (ko-en) 732,044 15.9% Parallel corpus
kor-triplet-v1.0 681,896 14.8% Retrieval triplets
mC4-ko 475,292 10.3% Web passage pairs
Wikipedia-ko 328,733 7.2% Wikipedia passage pairs
Korean NLI 252,063 5.5% Natural language inference
AIHub Dialog QA (#86) 150,771 3.3% Dialog-based QA
ko-wikidata-QA 130,657 2.8% Wikidata QA
OIG-smallchip2-ko 117,887 2.6% Instruction following
KorQuAD 2.0 80,914 1.8% Machine reading comprehension
Others (18 datasets) 317,384 6.9% NLI, STS, classification, dialog
Total 4,593,607 100%

Training Configuration

Parameter Value
Base model skt/A.X-Encoder-base (ModernBERT)
Loss function InfoNCE + FLOPS regularization
Temperature 1.0 (sparse dot-product)
FLOPS lambda (query) 0.01
FLOPS lambda (document) 0.003
FLOPS warmup 20,000 steps (quadratic schedule)
Learning rate 5e-5 (cosine decay)
Warmup ratio 0.06
Weight decay 0.01
Gradient clipping 1.0
Effective batch size 2,048 (64/GPU x 4 grad_accum x 8 GPUs)
Epochs 25
Mixed precision BF16
Query max length 64 tokens
Document max length 256 tokens
Seed 42

Hardware

Component Specification
GPU 8x NVIDIA B200 (183GB VRAM each)
Total VRAM 1,464 GB
Training time ~24 hours
DDP DistributedDataParallel (NCCL)

Framework Versions

Framework Version
Python 3.12
PyTorch 2.6
Transformers 4.48
CUDA 12.8

Limitations

  • Korean-focused: Optimized for Korean text; performance on other languages is not guaranteed.
  • Query length: Best results with queries under 64 tokens. Longer queries are truncated.
  • Term expansion scope: SPLADE expansion is bounded by the 50K vocabulary. Out-of-vocabulary terms fall back to subword tokenization.
  • No built-in reranking: For best results, combine with a cross-encoder reranker.

Citation

@software{korean-neural-sparse-encoder,
  author       = {Sewoong Kim},
  title        = {korean-neural-sparse-encoder},
  subtitle     = {Korean SPLADE-max Sparse Encoder for Neural Sparse Retrieval},
  publisher    = {Hugging Face},
  year         = {2026},
  month        = {2},
  version      = {1.0.0},
  url          = {https://huggingface.co/sewoong/korean-neural-sparse-encoder}
}

References

Author

Sewoong Kim - February 2026

License

Apache 2.0

Downloads last month
38
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train sewoong/korean-neural-sparse-encoder

Papers for sewoong/korean-neural-sparse-encoder