tfs-mt
Transformer from scratch for Machine Translation
This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic.
Here you can find the weights of the trained small size Transformer and the pretrained tokenizers.
Quick Start
pip install tfs-mt
import torch
from tfs_mt.architecture import build_model
from tfs_mt.data_utils import WordTokenizer
from tfs_mt.decoding_utils import greedy_decoding
base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/"
src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json")
tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json")
model = build_model(
config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml",
from_pretrained=True,
model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors",
)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?")
output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0]
print(output)
Model Architecture
Model Size: small
- Encoder Layers: 6
- Decoder Layers: 6
- Model Dimension: 100
- Attention Heads: 6
- FFN Dimension: 400
- Normalization Type: postnorm
- Dropout: 0.1
- Pretrained Embeddings: GloVe
- Positional Embeddings: sinusoidal
- GloVe Version: glove.2024.wikigiga.100d
Tokenizer
- Type: word
- Max Sequence Length: 131
- Max Vocabulary Size: 70000
- Minimum Frequency: 2
Dataset
- Task: machine-translation
- Dataset ID:
Helsinki-NLP/europarl - Dataset Name:
en-it - Source Language: en
- Target Language: it
- Train Split: 0.95
Full training configuration
Click to expand complete config-lock.yaml
seed: 42
log_every_iters: 1000
save_every_iters: 10000
eval_every_iters: 10000
update_pbar_every_iters: 100
time_limit_sec: -1
checkpoints_retain_n: 5
model_base_name: tfs_mt
model_parameters:
dropout: 0.1
model_configs:
pretrained_word_embeddings: GloVe
positional_embeddings: sinusoidal
nano:
num_encoder_layers: 4
num_decoder_layers: 4
d_model: 50
num_heads: 4
d_ff: 200
norm_type: postnorm
glove_version: glove.2024.wikigiga.50d
glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined
small:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 100
num_heads: 6
d_ff: 400
norm_type: postnorm
glove_version: glove.2024.wikigiga.100d
glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined
base:
num_encoder_layers: 8
num_decoder_layers: 8
d_model: 300
num_heads: 8
d_ff: 800
norm_type: postnorm
glove_version: glove.2024.wikigiga.300d
glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined
original:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 512
num_heads: 8
d_ff: 2048
norm_type: postnorm
training_hp:
num_epochs: 5
distributed_training: false
use_amp: true
amp_dtype: bfloat16
torch_compile_mode: max-autotune
loss:
type: KLdiv-labelsmoothing
label_smoothing: 0.1
optimizer:
type: AdamW
weight_decay: 0.0001
beta1: 0.9
beta2: 0.999
eps: 1.0e-08
lr_scheduler:
type: original
min_lr: 0.0003
max_lr: 0.001
warmup_iters: 25000
stable_iters_prop: 0.7
max_gradient_norm: 5.0
early_stopping:
enabled: false
patience: 40000
min_delta: 1.0e-05
tokenizer:
type: word
sos_token: <s>
eos_token: </s>
pad_token: <PAD>
unk_token: <UNK>
max_seq_len: 131
max_vocab_size: 70000
vocab_min_freq: 2
dataset:
dataset_task: machine-translation
dataset_id: Helsinki-NLP/europarl
dataset_name: en-it
train_split: 0.95
src_lang: en
tgt_lang: it
max_len: -1
train_dataloader:
batch_size: 64
num_workers: 8
shuffle: true
drop_last: true
prefetch_factor: 2
pad_all_to_max_len: true
test_dataloader:
batch_size: 128
num_workers: 8
shuffle: false
drop_last: false
prefetch_factor: 2
pad_all_to_max_len: true
backend: none
chosen_model_size: small
model_name: tfs_mt_small_251104-1748
exec_mode: dev
src_tokenizer_vocab_size: 70000
tgt_tokenizer_vocab_size: 70000
num_train_iters_per_epoch: 28889
num_test_iters_per_epoch: 761
License
This model weights are licensed under the MIT License.
The base weights used for training were sourced from GloVe. Their are licensed under the ODC Public Domain Dedication and License (PDDL). As the PDDL allows for unrestricted modification and redistribution, this derivative work is being released under the MIT License.
Citation
If you use tfs-mt in your research or project, please cite:
@software{Spadaro_tfs-mt,
author = {Spadaro, Giovanni},
licenses = {MIT, CC BY-SA 4.0},
title = {{tfs-mt}},
url = {https://github.com/Giovo17/tfs-mt}
}