--- license: apache-2.0 language: - en library_name: transformers base_model: - prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text tags: - trl - VisionLanguageAttribution - VisualUnderstanding - text-generation-inference - AttributeCaptioning - VLA - High-Fidelity datasets: - prithivMLmods/blip3o-caption-mini-arrow - prithivMLmods/Caption3o-Opt-v3 - prithivMLmods/Caption3o-Opt-v2 - >- Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647 --- ![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OWDu8wz6b4UlDWMJaSJu6.png) # **DeepCaption-VLA-7B** > The **DeepCaption-VLA-7B** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Image Captioning** and **Vision Language Attribution**. This variant is designed to generate precise, highly descriptive captions with a focus on **defining visual properties, object attributes, and scene details** across a wide spectrum of images and aspect ratios. # Key Highlights 1. **Vision Language Attribution (VLA):** Specially fine-tuned to attribute and define visual properties of objects, scenes, and environments. 2. **Detailed Object Definitions:** Generates captions with rich attribute descriptions, making outputs more precise than generic captioners. 3. **High-Fidelity Descriptions:** Handles general, artistic, technical, abstract, and low-context images with descriptive depth. 4. **Robust Across Aspect Ratios:** Accurately captions images regardless of format—wide, tall, square, or irregular. 5. **Variational Detail Control:** Supports both concise summaries and fine-grained attributions depending on prompt structure. 6. **Foundation on Qwen2.5-VL Architecture:** Leverages Qwen2.5-VL-7B’s multimodal reasoning for visual comprehension and instruction-following. 7. **Multilingual Capability:** Default in English, but adaptable for multilingual captioning through prompt engineering. > model type: experimental # Training Details This model was fine-tuned with a curated mix of datasets focused on **caption richness and object-attribute alignment**: * [prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow) * [prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3) * [prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2) * [Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647) * Private/unlisted datasets for domain-specific image captioning tasks. The training objective emphasized **Vision Language Attribution**: defining image properties, attributes, and objects with clarity, while preserving descriptive fluency. --- ## Example of a SYSTEM_PROMPT type✋ ```py CAPTION_SYSTEM_PROMPT = """ You are an AI assistant that rigorously follows this response protocol: 1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language. 2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics. 3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format. - Use the syntax: `{class_name==write_the_core_theme}` - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}` 4. Maintain the following strict format in your output: - **Caption:** - **Attributes:** - **{class_name==core_theme}** 5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required. 6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name. """.strip() ``` --- > [!note] General Query: Caption the image precisely. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://huggingface.co/prithivMLmods/DeepCaption-VLA-7B/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb) # Quick Start with Transformers ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "prithivMLmods/DeepCaption-VLA-7B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-7B") messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image with detailed attributes and properties."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` --- # Intended Use * Generating attribute-rich image captions for research, dataset creation, and AI training. * Vision-language attribution for object detection, scene understanding, and dataset annotation. * Supporting creative, artistic, and technical applications requiring detailed descriptions. * Captioning across varied aspect ratios, unusual visual styles, and non-standard datasets. # Limitations * May over-attribute or infer properties not explicitly visible in ambiguous images. * Outputs can vary in tone depending on prompt phrasing. * Not intended for filtered captioning tasks (explicit or sensitive content may appear). * Accuracy may degrade on synthetic or highly abstract visual domains.