Overview
- ModernBertMultilingual is a multilingual model trained from scratch.
- Uses the ModernBERT-base architecture.
- Supports four languages and their variants, including
Chinese (Simplified, Traditional)
,English
,Japanese
, andKorean
. - Performs well on mixed East Asian language text tasks.
Technical Specifications
- Uses a slightly adjusted vocabulary from the
Qwen2.5
series to support multilingualism. - Trained for approximately
100
hours onL40*7
devices, with a training volume of about60B
tokens. - Key training parameters:
- Batch Size : 1792
- Learing Rate : 5e-04
- Maximum Sequence Length : 512
- Optimizer : adamw_torch
- LR Scheduler: warmup_stable_decay
- Train Precision : bf16 mix
- For other technical specifications, please refer to the original release information and paper of ModernBERT-base.
Released Versions
- Provides 3 different weight versions:
- base - Fully trained with general corpus, suitable for various text domains.
- nodecay - Checkpoint before the annealing stage, you can fine-tune it with domain-specific data to better adapt to target domains.
- keyword_gacha_multilingual - Fine-tuned version using ACGN (e.g.,
light novels
,game text
,manga text
, etc.) type text.
Model | Version | Description |
---|---|---|
modern_bert_multilingual | 20250128 | base |
modern_bert_multilingual_nodecay | 20250128 | nodecay |
keyword_gacha_multilingual_base | 20250128 | keyword_gacha_multilingual |
Other
- Training script: Github
综述
- ModernBertMultilingual 是一个从零开始训练的多语言模型
- 使用 ModernBERT-base 架构
- 支持
中文(简体、繁体)
、英文
、日文
、韩文
等四种语言及其变种 - 可以很好处理东亚语言混合文本任务
技术指标
- 使用略微调整后的
Qwen2.5
系列的词表以支持多语言 - 在
L40*7
的设备上训练了大约100
个小时,训练量大约60B
Token - 主要训练参数
- Batch Size : 1792
- Learing Rate : 5e-04
- Maximum Sequence Length : 512
- Optimizer : adamw_torch
- LR Scheduler: warmup_stable_decay
- Train Precision : bf16 mix
- 其余技术指标可以参考 ModernBERT-base 原始发布信息与论文
发布版本
- 提供 3 个不同的权重版本
- base - 使用通用预料完整训练,可以较好的适用于各种不同领域文本
- nodecay - 退火阶段开始前的检查点,你可以在这个权重的基础上添加领域语料进行退火以使其更适应目标领域
- keyword_gacha_multilingual - 使用 ACGN(例如
轻小说
、游戏文本
、漫画文本
等)类型文本进行退火的版本
模型 | 版本 | 说明 |
---|---|---|
modern_bert_multilingual | 20250128 | base |
modern_bert_multilingual_nodecay | 20250128 | nodecay |
keyword_gacha_multilingual_base | 20250128 | keyword_gacha_multilingual |
其他
- 训练脚本 Github
- Downloads last month
- 48