hugohow's picture
Update README.md
1ac580e verified
---
library_name: transformers
language:
- rcf
---
# Tokenizer for Réunion Creole 🇷🇪
This tokenizer is specifically designed for working with **Réunion Creole**, a language primarily spoken on the island of Réunion. It is based on the **Byte Pair Encoding (BPE)** model and optimized for the lexical and orthographic specificities of the language.
## Features
- Built using the **BPE (Byte Pair Encoding)** model.
- Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
- Supports special tokens for common NLP tasks:
- `[CLS]`: Start-of-sequence token for classification tasks.
- `[SEP]`: Separator token for multi-segment inputs.
- `[PAD]`: Padding token.
- `[MASK]`: Masking token used for training masked language models.
- `[UNK]`: Token for unknown words.
## Usage
### Loading the Tokenizer
You can easily load this tokenizer using the `transformers` library:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")
# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)
```
Hugo How-Choong