Overview

  • ModernBertMultilingual is a multilingual model trained from scratch.
  • Uses the ModernBERT-base architecture.
  • Supports four languages and their variants, including Chinese (Simplified, Traditional), English, Japanese, and Korean.
  • Performs well on mixed East Asian language text tasks.

Technical Specifications

  • Uses a slightly adjusted vocabulary from the Qwen2.5 series to support multilingualism.
  • Trained for approximately 100 hours on L40*7 devices, with a training volume of about 60B tokens.
  • Key training parameters:
    • Batch Size : 1792
    • Learing Rate : 5e-04
    • Maximum Sequence Length : 512
    • Optimizer : adamw_torch
    • LR Scheduler: warmup_stable_decay
    • Train Precision : bf16 mix
  • For other technical specifications, please refer to the original release information and paper of ModernBERT-base.

Released Versions

  • Provides 3 different weight versions:
    • base - Fully trained with general corpus, suitable for various text domains.
    • nodecay - Checkpoint before the annealing stage, you can fine-tune it with domain-specific data to better adapt to target domains.
    • keyword_gacha_multilingual - Fine-tuned version using ACGN (e.g., light novels, game text, manga text, etc.) type text.
Model Version Description
modern_bert_multilingual 20250128 base
modern_bert_multilingual_nodecay 20250128 nodecay
keyword_gacha_multilingual_base 20250128 keyword_gacha_multilingual

Other

综述

  • ModernBertMultilingual 是一个从零开始训练的多语言模型
  • 使用 ModernBERT-base 架构
  • 支持 中文(简体、繁体)英文日文韩文 等四种语言及其变种
  • 可以很好处理东亚语言混合文本任务

技术指标

  • 使用略微调整后的 Qwen2.5 系列的词表以支持多语言
  • L40*7 的设备上训练了大约 100 个小时,训练量大约 60B Token
  • 主要训练参数
    • Batch Size : 1792
    • Learing Rate : 5e-04
    • Maximum Sequence Length : 512
    • Optimizer : adamw_torch
    • LR Scheduler: warmup_stable_decay
    • Train Precision : bf16 mix
  • 其余技术指标可以参考 ModernBERT-base 原始发布信息与论文

发布版本

  • 提供 3 个不同的权重版本
    • base - 使用通用预料完整训练,可以较好的适用于各种不同领域文本
    • nodecay - 退火阶段开始前的检查点,你可以在这个权重的基础上添加领域语料进行退火以使其更适应目标领域
    • keyword_gacha_multilingual - 使用 ACGN(例如 轻小说游戏文本漫画文本 等)类型文本进行退火的版本
模型 版本 说明
modern_bert_multilingual 20250128 base
modern_bert_multilingual_nodecay 20250128 nodecay
keyword_gacha_multilingual_base 20250128 keyword_gacha_multilingual

其他

Downloads last month
48
Safetensors
Model size
228M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.