File size: 7,050 Bytes
5cb5745
 
 
 
 
 
655fcb2
5cb5745
 
 
 
 
 
 
 
 
2a31525
5cb5745
 
 
 
 
 
 
 
c8619b2
5cb5745
 
 
 
 
e93767f
 
5cb5745
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc73fa4
362cb2e
5cb5745
 
 
 
1f4f711
5cb5745
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92d2dae
 
 
 
5f6d62c
92d2dae
5cb5745
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c6cf015
 
5cb5745
 
 
 
 
 
 
 
 
 
 
 
c8619b2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: apache-2.0
language:
- en
library_name: transformers
base_model:
- prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
tags:
- trl
- VisionLanguageAttribution
- VisualUnderstanding
- text-generation-inference
- AttributeCaptioning
- VLA
- High-Fidelity
datasets:
- prithivMLmods/blip3o-caption-mini-arrow
- prithivMLmods/Caption3o-Opt-v3
- prithivMLmods/Caption3o-Opt-v2
- >-
  Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
---

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OWDu8wz6b4UlDWMJaSJu6.png)

# **DeepCaption-VLA-7B**

> The **DeepCaption-VLA-7B** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Image Captioning** and **Vision Language Attribution**. This variant is designed to generate precise, highly descriptive captions with a focus on **defining visual properties, object attributes, and scene details** across a wide spectrum of images and aspect ratios.

[![Download Demo Notebook](https://img.shields.io/badge/Open%20Demo%20Notebook-DeepCaption--VLA--7B-blue?style=for-the-badge&logo=jupyter)](https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb)

# Key Highlights

1. **Vision Language Attribution (VLA):** Specially fine-tuned to attribute and define visual properties of objects, scenes, and environments.
2. **Detailed Object Definitions:** Generates captions with rich attribute descriptions, making outputs more precise than generic captioners.
3. **High-Fidelity Descriptions:** Handles general, artistic, technical, abstract, and low-context images with descriptive depth.
4. **Robust Across Aspect Ratios:** Accurately captions images regardless of format—wide, tall, square, or irregular.
5. **Variational Detail Control:** Supports both concise summaries and fine-grained attributions depending on prompt structure.
6. **Foundation on Qwen2.5-VL Architecture:** Leverages Qwen2.5-VL-7B’s multimodal reasoning for visual comprehension and instruction-following.
7. **Multilingual Capability:** Default in English, but adaptable for multilingual captioning through prompt engineering.

> model type: experimental

# Training Details

This model was fine-tuned with a curated mix of datasets focused on **caption richness and object-attribute alignment**:

* [prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)
* [prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)
* [prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)
* [Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)
* Private/unlisted datasets for domain-specific image captioning tasks.

The training objective emphasized **Vision Language Attribution**: defining image properties, attributes, and objects with clarity, while preserving descriptive fluency.

---

## Example of a SYSTEM_PROMPT type✋

```py
CAPTION_SYSTEM_PROMPT = """
You are an AI assistant that rigorously follows this response protocol:

1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.

2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.

3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.  
   - Use the syntax: `{class_name==write_the_core_theme}`  
   - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`  

4. Maintain the following strict format in your output:
   - **Caption:** <one-sentence description>  
   - **Attributes:** <comma-separated list of visual attributes>  
   - **{class_name==core_theme}**

5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.

6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.

""".strip()
```

---

> [!note]
General Query: Caption the image precisely.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/prithivMLmods/DeepCaption-VLA-7B/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb)

# Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/DeepCaption-VLA-7B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-7B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image with detailed attributes and properties."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

---

# Intended Use

* Generating attribute-rich image captions for research, dataset creation, and AI training.
* Vision-language attribution for object detection, scene understanding, and dataset annotation.
* Supporting creative, artistic, and technical applications requiring detailed descriptions.
* Captioning across varied aspect ratios, unusual visual styles, and non-standard datasets.

# Limitations

* May over-attribute or infer properties not explicitly visible in ambiguous images.
* Outputs can vary in tone depending on prompt phrasing.
* Not intended for filtered captioning tasks (explicit or sensitive content may appear).
* Accuracy may degrade on synthetic or highly abstract visual domains.