SMOLLM_VISON_Image_Captioner

Overview

This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset.

Requirements

Before running the code, ensure you have installed the necessary dependencies:

pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests

Model and Token Configuration

The code utilizes the following models:

CLIP: openai/clip-vit-large-patch14
LLM: alibidaran/SMOLL_image_captioner
Tokenizer: HuggingFaceTB/SmolLM2-360M

Installation and Setup

Load Necessary Libraries

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import cv2
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import matplotlib.pyplot as plt

Load CLIP Model

clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0')
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
print(torch.cuda.is_available())

Load Tokenizer and LLM Model

device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda')

Download Pretrained Model Weights

wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt

Image Captioning Model

Load Model Weights

from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor

image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda')
model = image_captioning_model
processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor)
saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda'))

Image Caption Generation

Load Image and Extract Features

import cv2
import matplotlib.pyplot as plt

image_url = '/content/54322546688_71515f8335_w.jpg'
image_features = processor.get_features(image_url, device='cuda')

Generate Caption

tokenizer.pad_token = tokenizer.eos_token
prompt = """
        ##User <image> Write a caption
        ##Assitant:"""

# Tokenize input
tokenized = tokenizer(prompt, return_tensors='pt')
label = tokenized['input_ids'].to('cuda')
att = tokenized['attention_mask'].to('cuda')

# Generate caption
with torch.no_grad():
    _, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att)
    generate_kwargs = {
        "input_ids": None,
        "inputs_embeds": embeds,
        "max_new_tokens": 50,
    }
    output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10)

# Decode and display result
print(tokenizer.decode(output[0]))
plt.imshow(image)

alibidaran
/

SMOLL_image_captioner