tfs-mt
Transformer from scratch for Machine Translation


This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic.

Here you can find the weights of the trained small size Transformer and the pretrained tokenizers.

Quick Start

pip install tfs-mt
import torch

from tfs_mt.architecture import build_model
from tfs_mt.data_utils import WordTokenizer
from tfs_mt.decoding_utils import greedy_decoding

base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/"
src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json")
tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json")

model = build_model(
    config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml",
    from_pretrained=True,
    model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors",
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?")

output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0]
print(output)

Model Architecture

Model Size: small

  • Encoder Layers: 6
  • Decoder Layers: 6
  • Model Dimension: 100
  • Attention Heads: 6
  • FFN Dimension: 400
  • Normalization Type: postnorm
  • Dropout: 0.1
  • Pretrained Embeddings: GloVe
  • Positional Embeddings: sinusoidal
  • GloVe Version: glove.2024.wikigiga.100d

Tokenizer

  • Type: word
  • Max Sequence Length: 131
  • Max Vocabulary Size: 70000
  • Minimum Frequency: 2

Dataset

  • Task: machine-translation
  • Dataset ID: Helsinki-NLP/europarl
  • Dataset Name: en-it
  • Source Language: en
  • Target Language: it
  • Train Split: 0.95

Full training configuration

Click to expand complete config-lock.yaml
seed: 42
log_every_iters: 1000
save_every_iters: 10000
eval_every_iters: 10000
update_pbar_every_iters: 100
time_limit_sec: -1
checkpoints_retain_n: 5
model_base_name: tfs_mt
model_parameters:
  dropout: 0.1
model_configs:
  pretrained_word_embeddings: GloVe
  positional_embeddings: sinusoidal
  nano:
    num_encoder_layers: 4
    num_decoder_layers: 4
    d_model: 50
    num_heads: 4
    d_ff: 200
    norm_type: postnorm
    glove_version: glove.2024.wikigiga.50d
    glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined
  small:
    num_encoder_layers: 6
    num_decoder_layers: 6
    d_model: 100
    num_heads: 6
    d_ff: 400
    norm_type: postnorm
    glove_version: glove.2024.wikigiga.100d
    glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined
  base:
    num_encoder_layers: 8
    num_decoder_layers: 8
    d_model: 300
    num_heads: 8
    d_ff: 800
    norm_type: postnorm
    glove_version: glove.2024.wikigiga.300d
    glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined
  original:
    num_encoder_layers: 6
    num_decoder_layers: 6
    d_model: 512
    num_heads: 8
    d_ff: 2048
    norm_type: postnorm
training_hp:
  num_epochs: 5
  distributed_training: false
  use_amp: true
  amp_dtype: bfloat16
  torch_compile_mode: max-autotune
  loss:
    type: KLdiv-labelsmoothing
    label_smoothing: 0.1
  optimizer:
    type: AdamW
    weight_decay: 0.0001
    beta1: 0.9
    beta2: 0.999
    eps: 1.0e-08
  lr_scheduler:
    type: original
    min_lr: 0.0003
    max_lr: 0.001
    warmup_iters: 25000
    stable_iters_prop: 0.7
  max_gradient_norm: 5.0
  early_stopping:
    enabled: false
    patience: 40000
    min_delta: 1.0e-05
tokenizer:
  type: word
  sos_token: <s>
  eos_token: </s>
  pad_token: <PAD>
  unk_token: <UNK>
  max_seq_len: 131
  max_vocab_size: 70000
  vocab_min_freq: 2
dataset:
  dataset_task: machine-translation
  dataset_id: Helsinki-NLP/europarl
  dataset_name: en-it
  train_split: 0.95
  src_lang: en
  tgt_lang: it
  max_len: -1
train_dataloader:
  batch_size: 64
  num_workers: 8
  shuffle: true
  drop_last: true
  prefetch_factor: 2
  pad_all_to_max_len: true
test_dataloader:
  batch_size: 128
  num_workers: 8
  shuffle: false
  drop_last: false
  prefetch_factor: 2
  pad_all_to_max_len: true
backend: none
chosen_model_size: small
model_name: tfs_mt_small_251104-1748
exec_mode: dev
src_tokenizer_vocab_size: 70000
tgt_tokenizer_vocab_size: 70000
num_train_iters_per_epoch: 28889
num_test_iters_per_epoch: 761

License

This model weights are licensed under the MIT License.

The base weights used for training were sourced from GloVe. Their are licensed under the ODC Public Domain Dedication and License (PDDL). As the PDDL allows for unrestricted modification and redistribution, this derivative work is being released under the MIT License.

Citation

If you use tfs-mt in your research or project, please cite:

@software{Spadaro_tfs-mt,
author = {Spadaro, Giovanni},
licenses = {MIT, CC BY-SA 4.0},
title = {{tfs-mt}},
url = {https://github.com/Giovo17/tfs-mt}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train giovo17/tfs-mt

Space using giovo17/tfs-mt 1