🧚🏻‍♀️ brown-fairy-base-v0 Model Card

Fairy logo

Fairies are among the most enchanting and magical beings in folklore and mythology. They appear across countless cultures and stories, from ancient forests to modern gardens. They are celebrated for their ability to bridge the mundane and magical realms, known for their ethereal grace and transformative powers. Fairies are tiny, higher-dimensional beings that can interact with the world in ways that are beyond our understanding.

The fairy series of models are an attempt to tune the beetle series of models to be more suitable for downstream tasks. These models are meant to fully open experiments at making state-of-the-art static embeddings.

The brown-fairy-base-v0 model is a distillation of the baai/bge-base-en-v1.5 model into the brown-beetle-base-v0 model. There was no PCA or Zipf applied to this model.

Installation

Install model2vec using pip:

pip install model2vec

Usage

Load this model using the from_pretrained method:

from model2vec import StaticModel

# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("bhavnicksm/brown-fairy-base-v0")

# Compute text embeddings
embeddings = model.encode(["Example sentence"])

Read more about the Model2Vec library here.

Reproduce this model

This model was trained on a subset of the 2 Million texts from the FineWeb-Edu dataset, which was labeled by the baai/bge-base-en-v1.5 model.

Training Code

Note: The datasets need to me made seperately and loaded with the datasets library.

static_embedding = StaticEmbedding.from_model2vec("bhavnicksm/brown-beetle-base-v0")
model = SentenceTransformer(
    modules=[static_embedding]
)

loss = MSELoss(model)

run_name = "brown-fairy-base-v0"
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=f"output/{run_name}",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=2048,
    per_device_eval_batch_size=2048,
    learning_rate=1e-1,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=5,
    logging_steps=50,
    logging_first_step=True,
    run_name=run_name, 
)

evaluator = NanoBEIREvaluator()
evaluator(model)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()

evaluator(model)

model.save_pretrained(f"output/{run_name}")

Comparison with other models

Coming soon...

Acknowledgements

This model is based on the Model2Vec library. Credit goes to the Minish Lab team for developing this library.

Citation

This model builds on work done by Minish Lab. Please cite the Model2Vec repository if you use this model in your work.

@software{minishlab2024model2vec,
  authors = {Stephan Tulkens, Thomas van Dongen},
  title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec},
}
Downloads last month
144
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Evaluation results