---
license: apache-2.0
tags:
- image-captioning
- multimodal
- vision-language
- diffusion
- pytorch
- transformers
library_name: transformers
pipeline_tag: image-to-text
datasets:
- conceptual_captions
- coco
model_type: VLV_decoder
---

# VLV Captioner Model

This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.

## Model Description

The VLV Captioner is a multimodal model that:
- Uses a diffusion-based vision encoder to extract image features
- Employs the Qwen2.5-3B language model for text generation
- Generates natural language descriptions of input images

## Model Architecture

- **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components
- **Language Model**: Qwen2.5-3B transformer model
- **Image Size**: 384x384 pixels
- **Max Caption Length**: 300 tokens
- **Precision**: Mixed precision (bfloat16/float32)

## Usage

### Method 1: Load from Hugging Face Hub

```python
from transformers import AutoModel, AutoConfig
from PIL import Image
import torch
import os

# Optional: Set custom cache directory if needed
cache_dir = "/path/to/your/cache"  # Use a directory with sufficient space
os.makedirs(cache_dir, exist_ok=True)

# Load the model with authentication token (if required)
token = os.getenv('HUGGINGFACE_TOKEN')  # or your token string

print("Loading config...")
config = AutoConfig.from_pretrained(
    "your-username/vlv-captioner", 
    trust_remote_code=True, 
    token=token, 
    cache_dir=cache_dir
)

print("Loading model...")
try:
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        torch_dtype=torch.float32,  # Specify dtype explicitly
        low_cpu_mem_usage=True
        # Note: Avoid device_map="auto" to prevent meta tensor issues
    )
    print("Model loaded successfully!")
    
    # Load and process an image
    image = Image.open("path/to/your/image.jpg")
    
    # Move model to GPU if available
    if torch.cuda.is_available():
        model = model.to('cuda')
        print("Model moved to GPU!")
    
    # Generate caption
    print("Generating caption...")
    with torch.no_grad():
        captions = model([image], max_length=300)
        
        # Handle different possible output formats
        if hasattr(captions, 'generated_text'):
            print("Generated caption:", captions.generated_text[0])
        elif isinstance(captions, list):
            print("Generated caption:", captions[0])
        else:
            print("Generated caption:", captions)
            
except Exception as e:
    print(f"Error during model loading or inference: {e}")
    # If cached files are corrupted, try clearing cache and redownloading
    import shutil
    cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
    if os.path.exists(cache_path):
        print(f"Clearing cache at {cache_path}")
        shutil.rmtree(cache_path)
    
    # Retry with force download
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        force_download=True,
        torch_dtype=torch.float32
    )
```

### Method 2: Load from original checkpoint

```python
from VLV_stage2 import VLV_MODEL

# Load from original .pt checkpoint file
model = VLV_MODEL.from_checkpoint("path/to/model.pt")

# Load and process an image
image = Image.open("path/to/your/image.jpg")

# Generate caption
with torch.no_grad():
    captions = model([image], max_length=300)
    print(captions.generated_text[0])  # Generated caption
```

## Model Details

- **Model Type**: Vision-Language Model
- **Architecture**: VLV_decoder
- **Language Backbone**: Qwen/Qwen2.5-3B
- **Vision Backbone**: Stable Diffusion + Florence2
- **Training Data**: Various image-caption datasets
- **Framework**: PyTorch, Transformers

## Training Configuration

- **Batch Size**: 1 (inference)
- **Learnable Token Length**: 77
- **Guidance Scale**: 7.5
- **Inference Steps**: 50
- **Beam Search**: 4 beams

## Requirements

```bash
pip install torch transformers safetensors torchvision pillow diffusers
```

## Troubleshooting

### Common Issues and Solutions

#### 1. Meta Tensor Issues
If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model:

```python
# ❌ Don't use this - can cause meta tensor issues
model = AutoModel.from_pretrained("model-name", device_map="auto")

# ✅ Use this instead
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
if torch.cuda.is_available():
    model = model.to('cuda')
```

#### 2. Cache Issues
If you experience corrupted cache files, clear the cache and redownload:

```python
import shutil
import os

cache_dir = "/your/cache/directory"
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
if os.path.exists(cache_path):
    shutil.rmtree(cache_path)

# Then reload with force_download=True
model = AutoModel.from_pretrained("model-name", force_download=True)
```

#### 3. Authentication Issues
Make sure your Hugging Face token is properly set:

```bash
# Option 1: Environment variable
export HUGGINGFACE_TOKEN="your_token_here"

# Option 2: Hugging Face CLI login
huggingface-cli login
```

#### 4. Memory Issues
For large models, use a custom cache directory with sufficient space:

```python
cache_dir = "/path/to/large/storage"
os.makedirs(cache_dir, exist_ok=True)
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
```

## Advanced Usage

### Batch Processing with Original Inference Script

For large-scale inference, you can use the original training inference script:

```bash
python Caption_inference.py \
  --input_path /path/to/images \
  --output_path captions.json \
  --clip_decoder_checkpoint /path/to/model.pt \
  --qwen_model Qwen/Qwen2.5-3B \
  --stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
  --florence2_model_path microsoft/Florence-2-large \
  --batch_size 4 \
  --max_length 300 \
  --num_beams 4 \
  --image_size 384 \
  --guidance_scale 7.5 \
  --use_text_encoder \
  --distributed  # For multi-GPU inference
```

### Configuration Parameters

- `image_size`: Input image resolution (default: 384)
- `guidance_scale`: Diffusion guidance scale (default: 7.5)  
- `learnable_token_length`: Number of vision tokens (default: 77)
- `max_length`: Maximum caption length (default: 300)
- `num_beams`: Beam search width (default: 4)
- `use_text_encoder`: Enable CLIP text encoder (recommended: True)
```

## Citation

```bibtex
@article{vlv_autoencoder,
  title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
  author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
  journal={arXiv preprint},
  year={2024}
}
```

## License

This model is released under the Apache 2.0 license.