Model Card: ClipMD

Model Details

ClipMD is a medical image-text matching model based on OpenAI's CLIP model with a sliding window text encoder.

Model Description

The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked sliding window elf-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

The model was fine-tuned on the ROCO dataset.

Use with Transformers

from PIL import Image

from transformers import AutoProcessor,AutoModel

model = AutoModel.from_pretrained("Idan0405/ClipMD",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Idan0405/ClipMD")

image = Image.open("your image path")

inputs = processor(text=["chest x-ray", "head MRI"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs[0] # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

See also

Downloads last month
874
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.