Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,72 @@ pipeline_tag: fill-mask
|
|
9 |
---
|
10 |
# makiart/multilingual-ModernBert-base-preview
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
|
13 |
|
14 |
- コンテキスト長:8192
|
|
|
9 |
---
|
10 |
# makiart/multilingual-ModernBert-base-preview
|
11 |
|
12 |
+
This model was developed by the [Algomatic](https://algomatic.jp/) team using computational resources provided by the [ABCI Generative AI Hackathon](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html).
|
13 |
+
|
14 |
+
- **Context Length:** 8192
|
15 |
+
- **Vocabulary Size:** 151,680
|
16 |
+
- **Total Training Tokens:** Approximately 250B tokens
|
17 |
+
- **Parameter Count:** 228M
|
18 |
+
- **Non-embedding Parameter Count:** 110M
|
19 |
+
- Utilizes fineweb and fineweb2
|
20 |
+
|
21 |
+
## How to Use
|
22 |
+
|
23 |
+
Install the required package using:
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install -U transformers>=4.48.0
|
27 |
+
```
|
28 |
+
|
29 |
+
If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
|
30 |
+
|
31 |
+
```bash
|
32 |
+
pip install flash-attn --no-build-isolation
|
33 |
+
```
|
34 |
+
|
35 |
+
## Example Usage
|
36 |
+
|
37 |
+
```python
|
38 |
+
import torch
|
39 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
40 |
+
|
41 |
+
model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
|
43 |
+
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
44 |
+
|
45 |
+
results = fill_mask("We must learn to [MASK] that we can be nothing other than who we are here and now.")
|
46 |
+
|
47 |
+
for result in results:
|
48 |
+
print(result)
|
49 |
+
|
50 |
+
# {'score': 0.2373046875, 'token': 4411, 'token_str': ' believe', 'sequence': 'We must learn to believe that we can be nothing other than who we are here and now.'}
|
51 |
+
# {'score': 0.09912109375, 'token': 4193, 'token_str': ' accept', 'sequence': 'We must learn to accept that we can be nothing other than who we are here and now.'}
|
52 |
+
# {'score': 0.09912109375, 'token': 15282, 'token_str': ' recognize', 'sequence': 'We must learn to recognize that we can be nothing other than who we are here and now.'}
|
53 |
+
# {'score': 0.0771484375, 'token': 13083, 'token_str': ' realize', 'sequence': 'We must learn to realize that we can be nothing other than who we are here and now.'}
|
54 |
+
# {'score': 0.06005859375, 'token': 13217, 'token_str': ' ourselves', 'sequence': 'We must learn to ourselves that we can be nothing other than who we are here and now.'}
|
55 |
+
```
|
56 |
+
|
57 |
+
## Model Description
|
58 |
+
|
59 |
+
- **Training Approach:**The model was trained using a two-stage Masked Language Modeling (MLM) process:
|
60 |
+
- **Masking Rate:** 30%
|
61 |
+
- **Training Data:** Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
|
62 |
+
- **Tokenizer:**Based on Qwen2.5, the tokenizer features:
|
63 |
+
- A vocabulary size of 151,680 tokens.
|
64 |
+
- Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
|
65 |
+
- **Dataset:**
|
66 |
+
- Utilizes the fineweb and fineweb2 datasets.
|
67 |
+
- For languages with an abundance of data, the volume has been reduced.
|
68 |
+
- **Computational Resources:**Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.
|
69 |
+
|
70 |
+
## Evaluation
|
71 |
+
|
72 |
+
A comprehensive evaluation has not been performed yet 😭.
|
73 |
+
|
74 |
+
Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
|
75 |
+
|
76 |
+
---
|
77 |
+
|
78 |
このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
|
79 |
|
80 |
- コンテキスト長:8192
|