SMOLLM_VISON_Image_Captioner
Overview
This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset.
Requirements
Before running the code, ensure you have installed the necessary dependencies:
pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests
Model and Token Configuration
The code utilizes the following models:
- CLIP:
openai/clip-vit-large-patch14
- LLM:
alibidaran/SMOLL_image_captioner
- Tokenizer:
HuggingFaceTB/SmolLM2-360M
Installation and Setup
Load Necessary Libraries
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import cv2
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import matplotlib.pyplot as plt
Load CLIP Model
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0')
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
print(torch.cuda.is_available())
Load Tokenizer and LLM Model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")
llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda')
Download Pretrained Model Weights
wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt
Image Captioning Model
Load Model Weights
from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor
image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda')
model = image_captioning_model
processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor)
saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda'))
Image Caption Generation
Load Image and Extract Features
import cv2
import matplotlib.pyplot as plt
image_url = '/content/54322546688_71515f8335_w.jpg'
image_features = processor.get_features(image_url, device='cuda')
Generate Caption
tokenizer.pad_token = tokenizer.eos_token
prompt = """
##User <image> Write a caption
##Assitant:"""
# Tokenize input
tokenized = tokenizer(prompt, return_tensors='pt')
label = tokenized['input_ids'].to('cuda')
att = tokenized['attention_mask'].to('cuda')
# Generate caption
with torch.no_grad():
_, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att)
generate_kwargs = {
"input_ids": None,
"inputs_embeds": embeds,
"max_new_tokens": 50,
}
output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10)
# Decode and display result
print(tokenizer.decode(output[0]))
plt.imshow(image)
- Downloads last month
- 10
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for alibidaran/SMOLL_image_captioner
Base model
HuggingFaceTB/SmolLM2-360M