ABC
A collection of models and datasets from ABC: Achieving Better Control of Multimodal Embeddings using VLMs.
Viewer • Updated • 2.25M • 479 • 5Note Pretraining data for ABC-Qwen2VL-Pretrain, derived from Conceptual Captions using negative mining for details, see the paper.
TIGER-Lab/ABC-VG-Instruct
Viewer • Updated • 12.5k • 124Note Instruction finetuning dataset derived from Visual Genome, contains multiple instructions for each image (which can be used as negatives for each other while training).
TIGER-Lab/ABC-Qwen2VL-Pretrain
Image-Text-to-Text • Updated • 22 • 1Note The pretrained base adapter. Supports text and image embeddings (similar to CLIP) for creating embeddings. If training your own adapter, use this as the base.
TIGER-Lab/ABC-Qwen2VL-Instruct
Image-Text-to-Text • Updated • 15Note The final instruction finetuned model. Support text, image, and image-text modalities when creating embeddings.
-
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Paper • 2503.00329 • Published • 20