|
--- |
|
license: apache-2.0 |
|
tags: |
|
- image-captioning |
|
- multimodal |
|
- vision-language |
|
- diffusion |
|
- pytorch |
|
- transformers |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
datasets: |
|
- conceptual_captions |
|
- coco |
|
model_type: VLV_decoder |
|
--- |
|
|
|
# VLV Captioner Model |
|
|
|
This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images. |
|
|
|
## Model Description |
|
|
|
The VLV Captioner is a multimodal model that: |
|
- Uses a diffusion-based vision encoder to extract image features |
|
- Employs the Qwen2.5-3B language model for text generation |
|
- Generates natural language descriptions of input images |
|
|
|
## Model Architecture |
|
|
|
- **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components |
|
- **Language Model**: Qwen2.5-3B transformer model |
|
- **Image Size**: 384x384 pixels |
|
- **Max Caption Length**: 300 tokens |
|
- **Precision**: Mixed precision (bfloat16/float32) |
|
|
|
## Usage |
|
|
|
### Method 1: Load from Hugging Face Hub |
|
|
|
```python |
|
from transformers import AutoModel, AutoConfig |
|
from PIL import Image |
|
import torch |
|
import os |
|
|
|
# Optional: Set custom cache directory if needed |
|
cache_dir = "/path/to/your/cache" # Use a directory with sufficient space |
|
os.makedirs(cache_dir, exist_ok=True) |
|
|
|
# Load the model with authentication token (if required) |
|
token = os.getenv('HUGGINGFACE_TOKEN') # or your token string |
|
|
|
print("Loading config...") |
|
config = AutoConfig.from_pretrained( |
|
"your-username/vlv-captioner", |
|
trust_remote_code=True, |
|
token=token, |
|
cache_dir=cache_dir |
|
) |
|
|
|
print("Loading model...") |
|
try: |
|
model = AutoModel.from_pretrained( |
|
"your-username/vlv-captioner", |
|
trust_remote_code=True, |
|
token=token, |
|
cache_dir=cache_dir, |
|
torch_dtype=torch.float32, # Specify dtype explicitly |
|
low_cpu_mem_usage=True |
|
# Note: Avoid device_map="auto" to prevent meta tensor issues |
|
) |
|
print("Model loaded successfully!") |
|
|
|
# Load and process an image |
|
image = Image.open("path/to/your/image.jpg") |
|
|
|
# Move model to GPU if available |
|
if torch.cuda.is_available(): |
|
model = model.to('cuda') |
|
print("Model moved to GPU!") |
|
|
|
# Generate caption |
|
print("Generating caption...") |
|
with torch.no_grad(): |
|
captions = model([image], max_length=300) |
|
|
|
# Handle different possible output formats |
|
if hasattr(captions, 'generated_text'): |
|
print("Generated caption:", captions.generated_text[0]) |
|
elif isinstance(captions, list): |
|
print("Generated caption:", captions[0]) |
|
else: |
|
print("Generated caption:", captions) |
|
|
|
except Exception as e: |
|
print(f"Error during model loading or inference: {e}") |
|
# If cached files are corrupted, try clearing cache and redownloading |
|
import shutil |
|
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner" |
|
if os.path.exists(cache_path): |
|
print(f"Clearing cache at {cache_path}") |
|
shutil.rmtree(cache_path) |
|
|
|
# Retry with force download |
|
model = AutoModel.from_pretrained( |
|
"your-username/vlv-captioner", |
|
trust_remote_code=True, |
|
token=token, |
|
cache_dir=cache_dir, |
|
force_download=True, |
|
torch_dtype=torch.float32 |
|
) |
|
``` |
|
|
|
### Method 2: Load from original checkpoint |
|
|
|
```python |
|
from VLV_stage2 import VLV_MODEL |
|
|
|
# Load from original .pt checkpoint file |
|
model = VLV_MODEL.from_checkpoint("path/to/model.pt") |
|
|
|
# Load and process an image |
|
image = Image.open("path/to/your/image.jpg") |
|
|
|
# Generate caption |
|
with torch.no_grad(): |
|
captions = model([image], max_length=300) |
|
print(captions.generated_text[0]) # Generated caption |
|
``` |
|
|
|
## Model Details |
|
|
|
- **Model Type**: Vision-Language Model |
|
- **Architecture**: VLV_decoder |
|
- **Language Backbone**: Qwen/Qwen2.5-3B |
|
- **Vision Backbone**: Stable Diffusion + Florence2 |
|
- **Training Data**: Various image-caption datasets |
|
- **Framework**: PyTorch, Transformers |
|
|
|
## Training Configuration |
|
|
|
- **Batch Size**: 1 (inference) |
|
- **Learnable Token Length**: 77 |
|
- **Guidance Scale**: 7.5 |
|
- **Inference Steps**: 50 |
|
- **Beam Search**: 4 beams |
|
|
|
## Requirements |
|
|
|
```bash |
|
pip install torch transformers safetensors torchvision pillow diffusers |
|
``` |
|
|
|
## Troubleshooting |
|
|
|
### Common Issues and Solutions |
|
|
|
#### 1. Meta Tensor Issues |
|
If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model: |
|
|
|
```python |
|
# ❌ Don't use this - can cause meta tensor issues |
|
model = AutoModel.from_pretrained("model-name", device_map="auto") |
|
|
|
# ✅ Use this instead |
|
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True) |
|
if torch.cuda.is_available(): |
|
model = model.to('cuda') |
|
``` |
|
|
|
#### 2. Cache Issues |
|
If you experience corrupted cache files, clear the cache and redownload: |
|
|
|
```python |
|
import shutil |
|
import os |
|
|
|
cache_dir = "/your/cache/directory" |
|
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name" |
|
if os.path.exists(cache_path): |
|
shutil.rmtree(cache_path) |
|
|
|
# Then reload with force_download=True |
|
model = AutoModel.from_pretrained("model-name", force_download=True) |
|
``` |
|
|
|
#### 3. Authentication Issues |
|
Make sure your Hugging Face token is properly set: |
|
|
|
```bash |
|
# Option 1: Environment variable |
|
export HUGGINGFACE_TOKEN="your_token_here" |
|
|
|
# Option 2: Hugging Face CLI login |
|
huggingface-cli login |
|
``` |
|
|
|
#### 4. Memory Issues |
|
For large models, use a custom cache directory with sufficient space: |
|
|
|
```python |
|
cache_dir = "/path/to/large/storage" |
|
os.makedirs(cache_dir, exist_ok=True) |
|
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True) |
|
``` |
|
|
|
## Advanced Usage |
|
|
|
### Batch Processing with Original Inference Script |
|
|
|
For large-scale inference, you can use the original training inference script: |
|
|
|
```bash |
|
python Caption_inference.py \ |
|
--input_path /path/to/images \ |
|
--output_path captions.json \ |
|
--clip_decoder_checkpoint /path/to/model.pt \ |
|
--qwen_model Qwen/Qwen2.5-3B \ |
|
--stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \ |
|
--florence2_model_path microsoft/Florence-2-large \ |
|
--batch_size 4 \ |
|
--max_length 300 \ |
|
--num_beams 4 \ |
|
--image_size 384 \ |
|
--guidance_scale 7.5 \ |
|
--use_text_encoder \ |
|
--distributed # For multi-GPU inference |
|
``` |
|
|
|
### Configuration Parameters |
|
|
|
- `image_size`: Input image resolution (default: 384) |
|
- `guidance_scale`: Diffusion guidance scale (default: 7.5) |
|
- `learnable_token_length`: Number of vision tokens (default: 77) |
|
- `max_length`: Maximum caption length (default: 300) |
|
- `num_beams`: Beam search width (default: 4) |
|
- `use_text_encoder`: Enable CLIP text encoder (recommended: True) |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{vlv_autoencoder, |
|
title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models}, |
|
author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei}, |
|
journal={arXiv preprint}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model is released under the Apache 2.0 license. |
|
|