hugohow
/

creole_reunion_tokenizer

Réunion Creole French

Inference Endpoints

Model card Files Files and versions Community

creole_reunion_tokenizer / README.md

hugohow's picture

Update README.md

1ac580e verified 3 months ago

|

history blame contribute delete

1.17 kB

	---
	library_name: transformers
	language:
	- rcf
	---

	# Tokenizer for Réunion Creole 🇷🇪

	This tokenizer is specifically designed for working with Réunion Creole, a language primarily spoken on the island of Réunion. It is based on the Byte Pair Encoding (BPE) model and optimized for the lexical and orthographic specificities of the language.

	## Features

	- Built using the BPE (Byte Pair Encoding) model.
	- Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
	- Supports special tokens for common NLP tasks:
	- `[CLS]`: Start-of-sequence token for classification tasks.
	- `[SEP]`: Separator token for multi-segment inputs.
	- `[PAD]`: Padding token.
	- `[MASK]`: Masking token used for training masked language models.
	- `[UNK]`: Token for unknown words.

	## Usage

	### Loading the Tokenizer

	You can easily load this tokenizer using the `transformers` library:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")

	# Example of tokenization
	text = "Comment i lé zot tout ?"
	tokens = tokenizer.encode(text)
	print(tokens)
	```


	Hugo How-Choong