|
--- |
|
base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- mllama |
|
license: apache-2.0 |
|
language: |
|
- en |
|
datasets: |
|
- unsloth/Radiology_mini |
|
library_name: transformers |
|
--- |
|
|
|
# Uploaded finetuned model |
|
|
|
- **Developed by:** Haq Nawaz Malik |
|
- **License:** apache-2.0 |
|
- **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit |
|
|
|
# Documentation: Hnm_Llama3.2_(11B)-Vision_lora_model |
|
|
|
## Overview |
|
The **Hnm_Llama3.2_(11B)-Vision_lora_model** is a fine-tuned version of **Llama 3.2 (11B) Vision** with **LoRA-based parameter-efficient fine-tuning (PEFT)**. It specializes in **vision-language tasks**, particularly for **medical image captioning and understanding**. |
|
|
|
This model was fine-tuned on a **Tesla T4 (Google Colab)** using **Unsloth**, a framework designed for efficient fine-tuning of large models. |
|
|
|
--- |
|
|
|
## Features |
|
- **Fine-tuned on Radiology Images**: Trained using the **Radiology_mini** dataset. |
|
- **Supports Image Captioning**: Can describe medical images. |
|
- **4-bit Quantization (QLoRA)**: Memory efficient, runs on consumer GPUs. |
|
- **LoRA-based PEFT**: Trains only **1% of parameters**, significantly reducing computational cost. |
|
- **Multi-modal Capabilities**: Works with both **text and image** inputs. |
|
- **Supports both Vision and Language fine-tuning**. |
|
|
|
--- |
|
|
|
## Model Details |
|
- **Base Model**: `unsloth/Llama-3.2-11B-Vision-Instruct` |
|
- **Fine-tuning Method**: LoRA + 4-bit Quantization (QLoRA) |
|
- **Dataset**: `unsloth/Radiology_mini` |
|
- **Framework**: Unsloth + Hugging Face Transformers |
|
- **Training Environment**: Google Colab (Tesla T4 GPU) |
|
|
|
--- |
|
|
|
|
|
|
|
### 2. Load the Model |
|
```python |
|
from unsloth import FastVisionModel |
|
|
|
model, tokenizer = FastVisionModel.from_pretrained( |
|
"Hnm_Llama3.2_(11B)-Vision_lora_model", |
|
load_in_4bit=True # Set to False for full precision |
|
) |
|
``` |
|
|
|
--- |
|
|
|
## Usage |
|
### **1. Image Captioning Example** |
|
```python |
|
import torch |
|
from transformers import TextStreamer |
|
|
|
FastVisionModel.for_inference(model) # Enable inference mode |
|
|
|
# Load an image from dataset |
|
dataset = load_dataset("unsloth/Radiology_mini", split="train") |
|
image = dataset[0]["image"] |
|
instruction = "Describe this medical image accurately." |
|
|
|
messages = [ |
|
{"role": "user", "content": [ |
|
{"type": "image"}, |
|
{"type": "text", "text": instruction} |
|
]} |
|
] |
|
|
|
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) |
|
inputs = tokenizer( |
|
image, |
|
input_text, |
|
add_special_tokens=False, |
|
return_tensors="pt" |
|
).to("cuda") |
|
|
|
text_streamer = TextStreamer(tokenizer, skip_prompt=True) |
|
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128, |
|
use_cache=True, temperature=1.5, min_p=0.1) |
|
``` |
|
|
|
|
|
|
|
## Notes |
|
- This model is optimized for vision-language tasks in the medical field but can be adapted for other applications. |
|
- Uses **LoRA adapters**, meaning you can fine-tune it efficiently with very few GPU resources. |
|
- Supports **Hugging Face Model Hub** for deployment and sharing. |
|
|
|
--- |
|
|
|
## Citation |
|
If you use this model, please cite: |
|
``` |
|
@misc{Hnm_Llama3.2_11B_Vision, |
|
author = {Haq Nawaz Malik}, |
|
title = {Fine-tuned Llama 3.2 (11B) Vision Model}, |
|
year = {2025}, |
|
url = {https://huggingface.co/Omarrran/Hnm_Llama3_2_Vision_lora_model} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## Contact |
|
For any questions or support, reach out via: |
|
- **GitHub**: [view](https://github.com/Haq-Nawaz-Malik) |
|
- **Hugging Face**: [view](https://huggingface.co/Omarrran) |