--- license: apache-2.0 tags: - image-captioning - multimodal - vision-language - diffusion - pytorch - transformers library_name: transformers pipeline_tag: image-to-text datasets: - conceptual_captions - coco model_type: VLV_decoder --- # VLV Captioner Model This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images. ## Model Description The VLV Captioner is a multimodal model that: - Uses a diffusion-based vision encoder to extract image features - Employs the Qwen2.5-3B language model for text generation - Generates natural language descriptions of input images ## Model Architecture - **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components - **Language Model**: Qwen2.5-3B transformer model - **Image Size**: 384x384 pixels - **Max Caption Length**: 300 tokens - **Precision**: Mixed precision (bfloat16/float32) ## Usage ### Method 1: Load from Hugging Face Hub ```python from transformers import AutoModel, AutoConfig from PIL import Image import torch import os # Optional: Set custom cache directory if needed cache_dir = "/path/to/your/cache" # Use a directory with sufficient space os.makedirs(cache_dir, exist_ok=True) # Load the model with authentication token (if required) token = os.getenv('HUGGINGFACE_TOKEN') # or your token string print("Loading config...") config = AutoConfig.from_pretrained( "your-username/vlv-captioner", trust_remote_code=True, token=token, cache_dir=cache_dir ) print("Loading model...") try: model = AutoModel.from_pretrained( "your-username/vlv-captioner", trust_remote_code=True, token=token, cache_dir=cache_dir, torch_dtype=torch.float32, # Specify dtype explicitly low_cpu_mem_usage=True # Note: Avoid device_map="auto" to prevent meta tensor issues ) print("Model loaded successfully!") # Load and process an image image = Image.open("path/to/your/image.jpg") # Move model to GPU if available if torch.cuda.is_available(): model = model.to('cuda') print("Model moved to GPU!") # Generate caption print("Generating caption...") with torch.no_grad(): captions = model([image], max_length=300) # Handle different possible output formats if hasattr(captions, 'generated_text'): print("Generated caption:", captions.generated_text[0]) elif isinstance(captions, list): print("Generated caption:", captions[0]) else: print("Generated caption:", captions) except Exception as e: print(f"Error during model loading or inference: {e}") # If cached files are corrupted, try clearing cache and redownloading import shutil cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner" if os.path.exists(cache_path): print(f"Clearing cache at {cache_path}") shutil.rmtree(cache_path) # Retry with force download model = AutoModel.from_pretrained( "your-username/vlv-captioner", trust_remote_code=True, token=token, cache_dir=cache_dir, force_download=True, torch_dtype=torch.float32 ) ``` ### Method 2: Load from original checkpoint ```python from VLV_stage2 import VLV_MODEL # Load from original .pt checkpoint file model = VLV_MODEL.from_checkpoint("path/to/model.pt") # Load and process an image image = Image.open("path/to/your/image.jpg") # Generate caption with torch.no_grad(): captions = model([image], max_length=300) print(captions.generated_text[0]) # Generated caption ``` ## Model Details - **Model Type**: Vision-Language Model - **Architecture**: VLV_decoder - **Language Backbone**: Qwen/Qwen2.5-3B - **Vision Backbone**: Stable Diffusion + Florence2 - **Training Data**: Various image-caption datasets - **Framework**: PyTorch, Transformers ## Training Configuration - **Batch Size**: 1 (inference) - **Learnable Token Length**: 77 - **Guidance Scale**: 7.5 - **Inference Steps**: 50 - **Beam Search**: 4 beams ## Requirements ```bash pip install torch transformers safetensors torchvision pillow diffusers ``` ## Troubleshooting ### Common Issues and Solutions #### 1. Meta Tensor Issues If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model: ```python # ❌ Don't use this - can cause meta tensor issues model = AutoModel.from_pretrained("model-name", device_map="auto") # ✅ Use this instead model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True) if torch.cuda.is_available(): model = model.to('cuda') ``` #### 2. Cache Issues If you experience corrupted cache files, clear the cache and redownload: ```python import shutil import os cache_dir = "/your/cache/directory" cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name" if os.path.exists(cache_path): shutil.rmtree(cache_path) # Then reload with force_download=True model = AutoModel.from_pretrained("model-name", force_download=True) ``` #### 3. Authentication Issues Make sure your Hugging Face token is properly set: ```bash # Option 1: Environment variable export HUGGINGFACE_TOKEN="your_token_here" # Option 2: Hugging Face CLI login huggingface-cli login ``` #### 4. Memory Issues For large models, use a custom cache directory with sufficient space: ```python cache_dir = "/path/to/large/storage" os.makedirs(cache_dir, exist_ok=True) model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True) ``` ## Advanced Usage ### Batch Processing with Original Inference Script For large-scale inference, you can use the original training inference script: ```bash python Caption_inference.py \ --input_path /path/to/images \ --output_path captions.json \ --clip_decoder_checkpoint /path/to/model.pt \ --qwen_model Qwen/Qwen2.5-3B \ --stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \ --florence2_model_path microsoft/Florence-2-large \ --batch_size 4 \ --max_length 300 \ --num_beams 4 \ --image_size 384 \ --guidance_scale 7.5 \ --use_text_encoder \ --distributed # For multi-GPU inference ``` ### Configuration Parameters - `image_size`: Input image resolution (default: 384) - `guidance_scale`: Diffusion guidance scale (default: 7.5) - `learnable_token_length`: Number of vision tokens (default: 77) - `max_length`: Maximum caption length (default: 300) - `num_beams`: Beam search width (default: 4) - `use_text_encoder`: Enable CLIP text encoder (recommended: True) ``` ## Citation ```bibtex @article{vlv_autoencoder, title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models}, author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei}, journal={arXiv preprint}, year={2024} } ``` ## License This model is released under the Apache 2.0 license.