|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- bg |
|
|
- cs |
|
|
- nl |
|
|
- en |
|
|
- fi |
|
|
- fr |
|
|
- de |
|
|
- el |
|
|
- it |
|
|
- pl |
|
|
- pt |
|
|
- es |
|
|
- sv |
|
|
- code |
|
|
tags: |
|
|
- multilingual |
|
|
- base-model |
|
|
- transformer |
|
|
- decoder-only |
|
|
- LLM |
|
|
- smol |
|
|
- MiniLingua |
|
|
--- |
|
|
|
|
|
# MiniLingua-1b |
|
|
|
|
|
**MiniLingua-1b** is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages: |
|
|
|
|
|
Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code. |
|
|
|
|
|
### Training Details |
|
|
|
|
|
MiniLingua-1b was trained on a 1 trillion token corpus that includes: |
|
|
- [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) |
|
|
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack) |
|
|
- Curated high-quality multilingual and code data from public sources |
|
|
|
|
|
The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using: |
|
|
- 256 AMD MI250X GPUs |
|
|
- bf16 precision |
|
|
- Megatron-LM library |
|
|
- Data parellelism |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages. |
|
|
|
|
|
### License |
|
|
|
|
|
Apache 2.0 — free for research and commercial use, subject to the terms. |
|
|
|
|
|
--- |
|
|
|