makiart commited on
Commit
e003c33
·
verified ·
1 Parent(s): 2f4ed4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md CHANGED
@@ -9,6 +9,72 @@ pipeline_tag: fill-mask
9
  ---
10
  # makiart/multilingual-ModernBert-base-preview
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
13
 
14
  - コンテキスト長:8192
 
9
  ---
10
  # makiart/multilingual-ModernBert-base-preview
11
 
12
+ This model was developed by the [Algomatic](https://algomatic.jp/) team using computational resources provided by the [ABCI Generative AI Hackathon](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html).
13
+
14
+ - **Context Length:** 8192
15
+ - **Vocabulary Size:** 151,680
16
+ - **Total Training Tokens:** Approximately 250B tokens
17
+ - **Parameter Count:** 228M
18
+ - **Non-embedding Parameter Count:** 110M
19
+ - Utilizes fineweb and fineweb2
20
+
21
+ ## How to Use
22
+
23
+ Install the required package using:
24
+
25
+ ```bash
26
+ pip install -U transformers>=4.48.0
27
+ ```
28
+
29
+ If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
30
+
31
+ ```bash
32
+ pip install flash-attn --no-build-isolation
33
+ ```
34
+
35
+ ## Example Usage
36
+
37
+ ```python
38
+ import torch
39
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
40
+
41
+ model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
42
+ tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
43
+ fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
44
+
45
+ results = fill_mask("We must learn to [MASK] that we can be nothing other than who we are here and now.")
46
+
47
+ for result in results:
48
+ print(result)
49
+
50
+ # {'score': 0.2373046875, 'token': 4411, 'token_str': ' believe', 'sequence': 'We must learn to believe that we can be nothing other than who we are here and now.'}
51
+ # {'score': 0.09912109375, 'token': 4193, 'token_str': ' accept', 'sequence': 'We must learn to accept that we can be nothing other than who we are here and now.'}
52
+ # {'score': 0.09912109375, 'token': 15282, 'token_str': ' recognize', 'sequence': 'We must learn to recognize that we can be nothing other than who we are here and now.'}
53
+ # {'score': 0.0771484375, 'token': 13083, 'token_str': ' realize', 'sequence': 'We must learn to realize that we can be nothing other than who we are here and now.'}
54
+ # {'score': 0.06005859375, 'token': 13217, 'token_str': ' ourselves', 'sequence': 'We must learn to ourselves that we can be nothing other than who we are here and now.'}
55
+ ```
56
+
57
+ ## Model Description
58
+
59
+ - **Training Approach:**The model was trained using a two-stage Masked Language Modeling (MLM) process:
60
+ - **Masking Rate:** 30%
61
+ - **Training Data:** Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
62
+ - **Tokenizer:**Based on Qwen2.5, the tokenizer features:
63
+ - A vocabulary size of 151,680 tokens.
64
+ - Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
65
+ - **Dataset:**
66
+ - Utilizes the fineweb and fineweb2 datasets.
67
+ - For languages with an abundance of data, the volume has been reduced.
68
+ - **Computational Resources:**Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.
69
+
70
+ ## Evaluation
71
+
72
+ A comprehensive evaluation has not been performed yet 😭.
73
+
74
+ Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
75
+
76
+ ---
77
+
78
  このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
79
 
80
  - コンテキスト長:8192