Overwrite with converted Qwen2.5-3B model files

492f6af verified 12 days ago

7.18 kB

	---
	license: apache-2.0
	tags:
	- image-captioning
	- multimodal
	- vision-language
	- diffusion
	- pytorch
	- transformers
	library_name: transformers
	pipeline_tag: image-to-text
	datasets:
	- conceptual_captions
	- coco
	model_type: VLV_decoder
	---

	# VLV Captioner Model

	This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.

	## Model Description

	The VLV Captioner is a multimodal model that:
	- Uses a diffusion-based vision encoder to extract image features
	- Employs the Qwen2.5-3B language model for text generation
	- Generates natural language descriptions of input images

	## Model Architecture

	- Vision Encoder: Stable Diffusion-based image encoder with Florence2 components
	- Language Model: Qwen2.5-3B transformer model
	- Image Size: 384x384 pixels
	- Max Caption Length: 300 tokens
	- Precision: Mixed precision (bfloat16/float32)

	## Usage

	### Method 1: Load from Hugging Face Hub

	```python
	from transformers import AutoModel, AutoConfig
	from PIL import Image
	import torch
	import os

	# Optional: Set custom cache directory if needed
	cache_dir = "/path/to/your/cache" # Use a directory with sufficient space
	os.makedirs(cache_dir, exist_ok=True)

	# Load the model with authentication token (if required)
	token = os.getenv('HUGGINGFACE_TOKEN') # or your token string

	print("Loading config...")
	config = AutoConfig.from_pretrained(
	"your-username/vlv-captioner",
	trust_remote_code=True,
	token=token,
	cache_dir=cache_dir
	)

	print("Loading model...")
	try:
	model = AutoModel.from_pretrained(
	"your-username/vlv-captioner",
	trust_remote_code=True,
	token=token,
	cache_dir=cache_dir,
	torch_dtype=torch.float32, # Specify dtype explicitly
	low_cpu_mem_usage=True
	# Note: Avoid device_map="auto" to prevent meta tensor issues
	)
	print("Model loaded successfully!")

	# Load and process an image
	image = Image.open("path/to/your/image.jpg")

	# Move model to GPU if available
	if torch.cuda.is_available():
	model = model.to('cuda')
	print("Model moved to GPU!")

	# Generate caption
	print("Generating caption...")
	with torch.no_grad():
	captions = model([image], max_length=300)

	# Handle different possible output formats
	if hasattr(captions, 'generated_text'):
	print("Generated caption:", captions.generated_text[0])
	elif isinstance(captions, list):
	print("Generated caption:", captions[0])
	else:
	print("Generated caption:", captions)

	except Exception as e:
	print(f"Error during model loading or inference: {e}")
	# If cached files are corrupted, try clearing cache and redownloading
	import shutil
	cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
	if os.path.exists(cache_path):
	print(f"Clearing cache at {cache_path}")
	shutil.rmtree(cache_path)

	# Retry with force download
	model = AutoModel.from_pretrained(
	"your-username/vlv-captioner",
	trust_remote_code=True,
	token=token,
	cache_dir=cache_dir,
	force_download=True,
	torch_dtype=torch.float32
	)
	```

	### Method 2: Load from original checkpoint

	```python
	from VLV_stage2 import VLV_MODEL

	# Load from original .pt checkpoint file
	model = VLV_MODEL.from_checkpoint("path/to/model.pt")

	# Load and process an image
	image = Image.open("path/to/your/image.jpg")

	# Generate caption
	with torch.no_grad():
	captions = model([image], max_length=300)
	print(captions.generated_text[0]) # Generated caption
	```

	## Model Details

	- Model Type: Vision-Language Model
	- Architecture: VLV_decoder
	- Language Backbone: Qwen/Qwen2.5-3B
	- Vision Backbone: Stable Diffusion + Florence2
	- Training Data: Various image-caption datasets
	- Framework: PyTorch, Transformers

	## Training Configuration

	- Batch Size: 1 (inference)
	- Learnable Token Length: 77
	- Guidance Scale: 7.5
	- Inference Steps: 50
	- Beam Search: 4 beams

	## Requirements

	```bash
	pip install torch transformers safetensors torchvision pillow diffusers
	```

	## Troubleshooting

	### Common Issues and Solutions

	#### 1. Meta Tensor Issues
	If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model:

	```python
	# ❌ Don't use this - can cause meta tensor issues
	model = AutoModel.from_pretrained("model-name", device_map="auto")

	# ✅ Use this instead
	model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
	if torch.cuda.is_available():
	model = model.to('cuda')
	```

	#### 2. Cache Issues
	If you experience corrupted cache files, clear the cache and redownload:

	```python
	import shutil
	import os

	cache_dir = "/your/cache/directory"
	cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
	if os.path.exists(cache_path):
	shutil.rmtree(cache_path)

	# Then reload with force_download=True
	model = AutoModel.from_pretrained("model-name", force_download=True)
	```

	#### 3. Authentication Issues
	Make sure your Hugging Face token is properly set:

	```bash
	# Option 1: Environment variable
	export HUGGINGFACE_TOKEN="your_token_here"

	# Option 2: Hugging Face CLI login
	huggingface-cli login
	```

	#### 4. Memory Issues
	For large models, use a custom cache directory with sufficient space:

	```python
	cache_dir = "/path/to/large/storage"
	os.makedirs(cache_dir, exist_ok=True)
	model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
	```

	## Advanced Usage

	### Batch Processing with Original Inference Script

	For large-scale inference, you can use the original training inference script:

	```bash
	python Caption_inference.py \
	--input_path /path/to/images \
	--output_path captions.json \
	--clip_decoder_checkpoint /path/to/model.pt \
	--qwen_model Qwen/Qwen2.5-3B \
	--stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
	--florence2_model_path microsoft/Florence-2-large \
	--batch_size 4 \
	--max_length 300 \
	--num_beams 4 \
	--image_size 384 \
	--guidance_scale 7.5 \
	--use_text_encoder \
	--distributed # For multi-GPU inference
	```

	### Configuration Parameters

	- `image_size`: Input image resolution (default: 384)
	- `guidance_scale`: Diffusion guidance scale (default: 7.5)
	- `learnable_token_length`: Number of vision tokens (default: 77)
	- `max_length`: Maximum caption length (default: 300)
	- `num_beams`: Beam search width (default: 4)
	- `use_text_encoder`: Enable CLIP text encoder (recommended: True)
	```

	## Citation

	```bibtex
	@article{vlv_autoencoder,
	title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
	author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
	journal={arXiv preprint},
	year={2024}
	}
	```

	## License

	This model is released under the Apache 2.0 license.