YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Text Similarity Model using Sentence-BERT

This project fine-tunes a Sentence-BERT model (paraphrase-MiniLM-L6-v2) on the STS Benchmark English dataset (stsb_multi_mt) to perform semantic similarity scoring between two text inputs.


🚀 Features

  • 🔁 Fine-tunes sentence-transformers/paraphrase-MiniLM-L6-v2
  • 🔧 Trained on the stsb_multi_mt dataset (English split)
  • 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
  • ⚙️ Uses a custom PyTorch model and manual training loop
  • 💾 Model is saved as similarity_model.pt
  • 🧠 Supports inference on custom sentence pairs

📦 Dependencies

Install required libraries:

pip install -q transformers datasets sentence-transformers evaluate --upgrade

📊 Dataset

  • Dataset: stsb_multi_mt
  • Split: "en"
  • Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.

from datasets import load_dataset

dataset = load_dataset("stsb_multi_mt", name="en", split="train")
dataset = dataset.shuffle(seed=42).select(range(10000))  # Sample subset for faster training

🏗️ Model Architecture

✅ Base Model

  • sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)

✅ Fine-Tuning

  • Cosine similarity computed between the CLS token embeddings of two inputs

  • Loss: Mean Squared Error (MSE) between predicted similarity and true score

🧠 Training

  • Epochs: 3

  • Optimizer: Adam

  • Loss: MSELoss

  • Manual training loop using PyTorch

Files and Structure

📦text-similarity-project ┣ 📜similarity_model.pt # Trained PyTorch model ┣ 📜training_script.py # Full training and inference script ┣ 📜README.md # Documentation

Downloads last month
13
Safetensors
Model size
22.7M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support