makiart
/

multilingual-ModernBert-base-preview

Fill-Mask

PyTorch

Safetensors

modernbert

Model card Files Files and versions Community

makiart commited on 10 days ago

Commit

e003c33

verified ·

1 Parent(s): 2f4ed4c

Update README.md

Browse files

Files changed (1) hide show

README.md +66 -0

README.md CHANGED Viewed

@@ -9,6 +9,72 @@ pipeline_tag: fill-mask
 ---
 # makiart/multilingual-ModernBert-base-preview
 このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
 - コンテキスト長：8192

 ---
 # makiart/multilingual-ModernBert-base-preview
+This model was developed by the [Algomatic](https://algomatic.jp/) team using computational resources provided by the [ABCI Generative AI Hackathon](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html).
+- **Context Length:** 8192
+- **Vocabulary Size:** 151,680
+- **Total Training Tokens:** Approximately 250B tokens
+- **Parameter Count:** 228M
+- **Non-embedding Parameter Count:** 110M
+- Utilizes fineweb and fineweb2
+## How to Use
+Install the required package using:
+```bash
+pip install -U transformers>=4.48.0
+```
+If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
+```bash
+pip install flash-attn --no-build-isolation
+```
+## Example Usage
+```python
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
+model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
+tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
+fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+results = fill_mask("We must learn to [MASK] that we can be nothing other than who we are here and now.")
+for result in results:
+    print(result)
+# {'score': 0.2373046875, 'token': 4411, 'token_str': ' believe', 'sequence': 'We must learn to believe that we can be nothing other than who we are here and now.'}
+# {'score': 0.09912109375, 'token': 4193, 'token_str': ' accept', 'sequence': 'We must learn to accept that we can be nothing other than who we are here and now.'}
+# {'score': 0.09912109375, 'token': 15282, 'token_str': ' recognize', 'sequence': 'We must learn to recognize that we can be nothing other than who we are here and now.'}
+# {'score': 0.0771484375, 'token': 13083, 'token_str': ' realize', 'sequence': 'We must learn to realize that we can be nothing other than who we are here and now.'}
+# {'score': 0.06005859375, 'token': 13217, 'token_str': ' ourselves', 'sequence': 'We must learn to ourselves that we can be nothing other than who we are here and now.'}
+```
+## Model Description
+- **Training Approach:**The model was trained using a two-stage Masked Language Modeling (MLM) process:
+    - **Masking Rate:** 30%
+    - **Training Data:** Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
+- **Tokenizer:**Based on Qwen2.5, the tokenizer features:
+    - A vocabulary size of 151,680 tokens.
+    - Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
+- **Dataset:**
+    - Utilizes the fineweb and fineweb2 datasets.
+    - For languages with an abundance of data, the volume has been reduced.
+- **Computational Resources:**Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.
+## Evaluation
+A comprehensive evaluation has not been performed yet 😭.
+Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
+---
 このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
 - コンテキスト長：8192