|
--- |
|
library_name: transformers |
|
language: |
|
- rcf |
|
--- |
|
|
|
# Tokenizer for Réunion Creole 🇷🇪 |
|
|
|
This tokenizer is specifically designed for working with **Réunion Creole**, a language primarily spoken on the island of Réunion. It is based on the **Byte Pair Encoding (BPE)** model and optimized for the lexical and orthographic specificities of the language. |
|
|
|
## Features |
|
|
|
- Built using the **BPE (Byte Pair Encoding)** model. |
|
- Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book. |
|
- Supports special tokens for common NLP tasks: |
|
- `[CLS]`: Start-of-sequence token for classification tasks. |
|
- `[SEP]`: Separator token for multi-segment inputs. |
|
- `[PAD]`: Padding token. |
|
- `[MASK]`: Masking token used for training masked language models. |
|
- `[UNK]`: Token for unknown words. |
|
|
|
## Usage |
|
|
|
### Loading the Tokenizer |
|
|
|
You can easily load this tokenizer using the `transformers` library: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer") |
|
|
|
# Example of tokenization |
|
text = "Comment i lé zot tout ?" |
|
tokens = tokenizer.encode(text) |
|
print(tokens) |
|
``` |
|
|
|
|
|
Hugo How-Choong |