lambertxiao's picture
Overwrite with converted Qwen2.5-3B model files
492f6af verified
---
license: apache-2.0
tags:
- image-captioning
- multimodal
- vision-language
- diffusion
- pytorch
- transformers
library_name: transformers
pipeline_tag: image-to-text
datasets:
- conceptual_captions
- coco
model_type: VLV_decoder
---
# VLV Captioner Model
This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.
## Model Description
The VLV Captioner is a multimodal model that:
- Uses a diffusion-based vision encoder to extract image features
- Employs the Qwen2.5-3B language model for text generation
- Generates natural language descriptions of input images
## Model Architecture
- **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components
- **Language Model**: Qwen2.5-3B transformer model
- **Image Size**: 384x384 pixels
- **Max Caption Length**: 300 tokens
- **Precision**: Mixed precision (bfloat16/float32)
## Usage
### Method 1: Load from Hugging Face Hub
```python
from transformers import AutoModel, AutoConfig
from PIL import Image
import torch
import os
# Optional: Set custom cache directory if needed
cache_dir = "/path/to/your/cache" # Use a directory with sufficient space
os.makedirs(cache_dir, exist_ok=True)
# Load the model with authentication token (if required)
token = os.getenv('HUGGINGFACE_TOKEN') # or your token string
print("Loading config...")
config = AutoConfig.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir
)
print("Loading model...")
try:
model = AutoModel.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir,
torch_dtype=torch.float32, # Specify dtype explicitly
low_cpu_mem_usage=True
# Note: Avoid device_map="auto" to prevent meta tensor issues
)
print("Model loaded successfully!")
# Load and process an image
image = Image.open("path/to/your/image.jpg")
# Move model to GPU if available
if torch.cuda.is_available():
model = model.to('cuda')
print("Model moved to GPU!")
# Generate caption
print("Generating caption...")
with torch.no_grad():
captions = model([image], max_length=300)
# Handle different possible output formats
if hasattr(captions, 'generated_text'):
print("Generated caption:", captions.generated_text[0])
elif isinstance(captions, list):
print("Generated caption:", captions[0])
else:
print("Generated caption:", captions)
except Exception as e:
print(f"Error during model loading or inference: {e}")
# If cached files are corrupted, try clearing cache and redownloading
import shutil
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
if os.path.exists(cache_path):
print(f"Clearing cache at {cache_path}")
shutil.rmtree(cache_path)
# Retry with force download
model = AutoModel.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir,
force_download=True,
torch_dtype=torch.float32
)
```
### Method 2: Load from original checkpoint
```python
from VLV_stage2 import VLV_MODEL
# Load from original .pt checkpoint file
model = VLV_MODEL.from_checkpoint("path/to/model.pt")
# Load and process an image
image = Image.open("path/to/your/image.jpg")
# Generate caption
with torch.no_grad():
captions = model([image], max_length=300)
print(captions.generated_text[0]) # Generated caption
```
## Model Details
- **Model Type**: Vision-Language Model
- **Architecture**: VLV_decoder
- **Language Backbone**: Qwen/Qwen2.5-3B
- **Vision Backbone**: Stable Diffusion + Florence2
- **Training Data**: Various image-caption datasets
- **Framework**: PyTorch, Transformers
## Training Configuration
- **Batch Size**: 1 (inference)
- **Learnable Token Length**: 77
- **Guidance Scale**: 7.5
- **Inference Steps**: 50
- **Beam Search**: 4 beams
## Requirements
```bash
pip install torch transformers safetensors torchvision pillow diffusers
```
## Troubleshooting
### Common Issues and Solutions
#### 1. Meta Tensor Issues
If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model:
```python
# ❌ Don't use this - can cause meta tensor issues
model = AutoModel.from_pretrained("model-name", device_map="auto")
# ✅ Use this instead
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
if torch.cuda.is_available():
model = model.to('cuda')
```
#### 2. Cache Issues
If you experience corrupted cache files, clear the cache and redownload:
```python
import shutil
import os
cache_dir = "/your/cache/directory"
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
if os.path.exists(cache_path):
shutil.rmtree(cache_path)
# Then reload with force_download=True
model = AutoModel.from_pretrained("model-name", force_download=True)
```
#### 3. Authentication Issues
Make sure your Hugging Face token is properly set:
```bash
# Option 1: Environment variable
export HUGGINGFACE_TOKEN="your_token_here"
# Option 2: Hugging Face CLI login
huggingface-cli login
```
#### 4. Memory Issues
For large models, use a custom cache directory with sufficient space:
```python
cache_dir = "/path/to/large/storage"
os.makedirs(cache_dir, exist_ok=True)
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
```
## Advanced Usage
### Batch Processing with Original Inference Script
For large-scale inference, you can use the original training inference script:
```bash
python Caption_inference.py \
--input_path /path/to/images \
--output_path captions.json \
--clip_decoder_checkpoint /path/to/model.pt \
--qwen_model Qwen/Qwen2.5-3B \
--stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
--florence2_model_path microsoft/Florence-2-large \
--batch_size 4 \
--max_length 300 \
--num_beams 4 \
--image_size 384 \
--guidance_scale 7.5 \
--use_text_encoder \
--distributed # For multi-GPU inference
```
### Configuration Parameters
- `image_size`: Input image resolution (default: 384)
- `guidance_scale`: Diffusion guidance scale (default: 7.5)
- `learnable_token_length`: Number of vision tokens (default: 77)
- `max_length`: Maximum caption length (default: 300)
- `num_beams`: Beam search width (default: 4)
- `use_text_encoder`: Enable CLIP text encoder (recommended: True)
```
## Citation
```bibtex
@article{vlv_autoencoder,
title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
journal={arXiv preprint},
year={2024}
}
```
## License
This model is released under the Apache 2.0 license.