Fine-Tuned BLIP Model for Fashion Image Captioning
This is a fine-tuned BLIP (Bootstrapped Language-Image Pretraining) model specifically designed for fashion image captioning. It was fine-tuned on the Marqo Fashion Dataset to generate descriptive and contextually relevant captions for fashion-related images.
Model Details
- Model Type: BLIP (Vision-Language Pretraining)
- Architecture: BLIP uses a multimodal transformer architecture to jointly model visual and textual information.
- Fine-Tuning Dataset: Marqo Fashion Dataset (a dataset containing fashion images and corresponding captions)
- Task: Fashion Image Captioning
- License: Apache 2.0
Usage
You can use this model with the Hugging Face transformers
library for fashion image captioning tasks.
Installation
First, install the required libraries:
pip install transformers torch
- Downloads last month
- 93
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for rcfg/FashionBLIP-1
Base model
Salesforce/blip-image-captioning-large