Safetensors
mistral

VocADT is a solution for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the modelโ€™s weights fixed. VocADT offers a flexible and scalable solution without requiring external resources or language constraints.

New Vocabulary Adapted Models

Only the input/output embeddings are replaced, while all other original weights of base model remain fixed. These are the merged version: after training the adapters, we merge the original embeddings with the adapter to generate the new embeddings.

Name Adapted Model Base Model New Vocab Size Focused Languages
VocADT-Latin-Mistral h-j-han/Mistral-7B-VocADT-50k-Latin Mistral 50k Swahili (sw), Indonesian (id), Estonian (et), Haitian Creole (ht), English (en)
VocADT-Mixed-Mistral h-j-han/Mistral-7B-VocADT-50k-Mixed Mistral 50k Korean (ko), Greek (el), Russian (ru), Bulgarian (bg), English (en)
VocADT-Cyrillic-Mistral h-j-han/Mistral-7B-VocADT-50k-Cyrillic Mistral 50k Russian (ru), Bulgarian (bg), Ukrainian (uk), Kazakh (kk), English (en)
VocADT-All-Mistral h-j-han/Mistral-7B-VocADT-50k-All Mistral 50k Swahili (sw), Indonesian (id), Estonian (et), Haitian Creole (ht), Korean (ko), Greek (el), Russian (ru), Bulgarian (bg), Ukrainian (uk), Kazakh (kk), English (en)
VocADT-Latin-LLama h-j-han/Llama2-7B-VocADT-50k-Latin Llama 50k Swahili (sw), Indonesian (id), Estonian (et), Haitian Creole (ht), English (en)
VocADT-Mixed-LLama h-j-han/Llama2-7B-VocADT-50k-Mixed Llama 50k Korean (ko), Greek (el), Russian (ru), Bulgarian (bg), English (en)
VocADT-Cyrillic-LLama h-j-han/Llama2-7B-VocADT-50k-Cyrillic Llama 50k Russian (ru), Bulgarian (bg), Ukrainian (uk), Kazakh (kk), English (en)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
# model_name = "mistralai/Mistral-7B-v0.1" # Base Model
model_name = "h-j-han/Mistral-7B-VocADT-50k-All" # Vocabulary Adapted Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prefix = "\nEnglish: Hello \nKorean: ์•ˆ๋…•ํ•˜์„ธ์š” \nEnglish: Thank you\nKorean: ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค\nEnglish: "
line = "I'm a student."
suffix = f"\nKorean:"
prompt = prefix + line + suffix
inputs = tokenizer(prompt, return_tensors="pt")
for item in inputs:
    inputs[item] = inputs[item].cuda()
outputs = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Base Model Output: "๋‚˜๋Š” ํ•™" # This short incomplete phrase in Korean is 5 tokens for the base model.
# VocADT Output: "์ €๋Š” ํ•™์ƒ์ž…๋‹ˆ๋‹ค." # Complete and good output within 5 tokens

Reference

We provide code in Github repo : https://github.com/h-j-han/VocADT
Also, please find details in this paper :

@misc{han2024vocadt,
      title={Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?}, 
      author={HyoJung Han and Akiko Eriguchi and Haoran Xu and Hieu Hoang and Marine Carpuat and Huda Khayrallah},
      year={2024},
      eprint={2410.09644},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.09644}, 
}
Downloads last month
6
Safetensors
Model size
7.39B params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for h-j-han/Mistral-7B-VocADT-50k-All

Finetuned
(848)
this model

Dataset used to train h-j-han/Mistral-7B-VocADT-50k-All

Collection including h-j-han/Mistral-7B-VocADT-50k-All