minilingua-ai
/

MiniLingua-1b

Model card Files Files and versions

MiniLingua-1b / README.md

aaaksenova's picture

Update README.md

a5f3f3c verified 4 months ago

|

history blame contribute delete

1.38 kB

	---
	license: apache-2.0
	language:
	- bg
	- cs
	- nl
	- en
	- fi
	- fr
	- de
	- el
	- it
	- pl
	- pt
	- es
	- sv
	- code
	tags:
	- multilingual
	- base-model
	- transformer
	- decoder-only
	- LLM
	- smol
	- MiniLingua
	---

	# MiniLingua-1b

	MiniLingua-1b is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:

	Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.

	### Training Details

	MiniLingua-1b was trained on a 1 trillion token corpus that includes:
	- [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
	- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
	- Curated high-quality multilingual and code data from public sources

	The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using:
	- 256 AMD MI250X GPUs
	- bf16 precision
	- Megatron-LM library
	- Data parellelism

	### Intended Use

	This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.

	### License

	Apache 2.0 — free for research and commercial use, subject to the terms.

	---