Improve model card: Add library name, key features, and usage example
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,19 +1,123 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- psp-dada/SENTINEL
|
5 |
language:
|
6 |
- en
|
7 |
-
|
8 |
-
- Qwen/Qwen2.5-VL-7B-Instruct
|
9 |
pipeline_tag: image-text-to-text
|
|
|
10 |
---
|
11 |
|
12 |
-
# Model Card for SENTINEL
|
|
|
|
|
|
|
13 |
|
14 |
<a href='https://arxiv.org/abs/2507.12455'>
|
15 |
<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
|
16 |
<a href='https://github.com/pspdada/SENTINEL'>
|
17 |
<img src='https://img.shields.io/badge/Github-Repo-Green'></a>
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
4 |
datasets:
|
5 |
- psp-dada/SENTINEL
|
6 |
language:
|
7 |
- en
|
8 |
+
license: apache-2.0
|
|
|
9 |
pipeline_tag: image-text-to-text
|
10 |
+
library_name: transformers
|
11 |
---
|
12 |
|
13 |
+
# Model Card for SENTINEL:
|
14 |
+
Mitigating Object Hallucinations via Sentence-Level Early Intervention
|
15 |
+
|
16 |
+
This repository contains the **SENTINEL** model, a fine-tuned version of `Qwen2.5-VL-7B-Instruct` designed to mitigate object hallucinations in Multimodal Large Language Models (MLLMs). SENTINEL introduces a novel framework for **S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning, eliminating the dependency on human annotations for hallucination mitigation.
|
17 |
|
18 |
<a href='https://arxiv.org/abs/2507.12455'>
|
19 |
<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
|
20 |
<a href='https://github.com/pspdada/SENTINEL'>
|
21 |
<img src='https://img.shields.io/badge/Github-Repo-Green'></a>
|
22 |
|
23 |
+
## Key Features
|
24 |
+
|
25 |
+
* 🧠 **Early intervention halts hallucination propagation**: We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
|
26 |
+
* 🔍 **In-domain contextual preference learning without human labels**: SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
|
27 |
+
* 💡 **Context matters: rich coherence drives robustness**: By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
|
28 |
+
* ♻️ **Iterative contextual bootstrapping for diverse hallucination-free contexts**: Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
|
29 |
+
* 📊 **State-of-the-art results across benchmarks**: SENTINEL achieves **up to 92% reduction** in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.
|
30 |
+
|
31 |
+
## How to Use
|
32 |
+
|
33 |
+
You can easily load and use the SENTINEL model with the Hugging Face `transformers` library, combining the base model with the provided LoRA adapter weights.
|
34 |
+
|
35 |
+
```python
|
36 |
+
from transformers import AutoModelForCausalLM, AutoProcessor
|
37 |
+
from peft import PeftModel
|
38 |
+
from PIL import Image
|
39 |
+
import requests
|
40 |
+
|
41 |
+
# Load the base model and its processor
|
42 |
+
base_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
|
43 |
+
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
|
44 |
+
processor = AutoProcessor.from_pretrained(base_model_id)
|
45 |
+
|
46 |
+
# Load the SENTINEL LoRA adapter weights
|
47 |
+
lora_model_id = "psp-dada/Qwen2.5-VL-7B-SENTINEL"
|
48 |
+
model = PeftModel.from_pretrained(model, lora_model_id)
|
49 |
+
# Optional: merge adapter weights into the base model for direct usage if no further training is planned
|
50 |
+
# model = model.merge_and_unload()
|
51 |
+
|
52 |
+
# Example: Describe an image
|
53 |
+
image_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-images/resolve/main/r_and_c_cat.png"
|
54 |
+
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
55 |
+
|
56 |
+
messages = [
|
57 |
+
{"role": "user", "content": [
|
58 |
+
{"type": "image", "image": raw_image},
|
59 |
+
{"type": "text", "text": "Describe the image in detail."}
|
60 |
+
]}
|
61 |
+
]
|
62 |
+
|
63 |
+
# Apply chat template and prepare inputs.
|
64 |
+
# The Qwen2.5VLProcessor handles vision and text tokenization.
|
65 |
+
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
66 |
+
inputs = processor(text=text_input, images=raw_image, return_tensors="pt").to(model.device)
|
67 |
+
|
68 |
+
# Generate response
|
69 |
+
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
70 |
+
|
71 |
+
# Decode only the newly generated tokens
|
72 |
+
output_text = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
|
73 |
+
print(output_text)
|
74 |
+
```
|
75 |
+
|
76 |
+
## Dataset
|
77 |
+
|
78 |
+
We present the [**SENTINEL Dataset**](https://huggingface.co/datasets/psp-dada/SENTINEL), a in-domain multimodal preference dataset for mitigating object hallucination constructed **without** human annotation.
|
79 |
+
|
80 |
+
<details>
|
81 |
+
<summary>Dataset details</summary>
|
82 |
+
|
83 |
+
The SENTINEL dataset records the preference pairs of the `LLaVA-v1.5`, `LLaVA-v1.6`, `Qwen2-VL` and `Qwen2.5-VL` family, enabling robust and scalable hallucination mitigation without external supervision.
|
84 |
+
|
85 |
+
It contains the following components:
|
86 |
+
|
87 |
+
* `image_data.jsonl` file
|
88 |
+
|
89 |
+
This file contains a selection of open-source images extracted from the Visual Genome dataset. It includes only three fields: `image_id`, `image_path`, and `question`, and is used to construct preference training data for hallucination suppression in image captioning tasks.
|
90 |
+
|
91 |
+
**Note**: If you want to use the data from this file, please make sure to replace the `image_path` field with the path to your local copy of the Visual Genome dataset.
|
92 |
+
|
93 |
+
* `<model_name>.json` files
|
94 |
+
|
95 |
+
These files represent the preference training datasets generated after the training data construction step, with each file corresponding to a specific model.
|
96 |
+
|
97 |
+
They include the necessary fields for **C-DPO training**, such as: `"question"`, `"context"`, `"y_win"`, and `"y_lose"`.
|
98 |
+
|
99 |
+
<table align="center">
|
100 |
+
<p align="center">
|
101 |
+
<img src="/docs/figures/dataset.png" width="80%" />
|
102 |
+
</p>
|
103 |
+
</table>
|
104 |
+
</details>
|
105 |
+
|
106 |
+
## Acknowledgement
|
107 |
+
|
108 |
+
* [LLaVA](https://github.com/haotian-liu/LLaVA): LLaVA-v1.5 is an excellent open-source project on MLLMs.
|
109 |
+
* [HA-DPO](https://github.com/opendatalab/HA-DPO): Our code for the LLaVA-v1.5 part is based on HA-DPO, an influential work in the field of object hallucination in MLLMs. It provided us with valuable inspiration.
|
110 |
+
* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): A unified and efficient fine-tuning framework of LLMs. Our implementations for LLaVA-v1.6, Qwen2-VL, and Qwen2.5-VL are based on this framework.
|
111 |
+
|
112 |
+
## Citation
|
113 |
+
|
114 |
+
If you find our model/code/data/paper helpful, please consider citing our papers 📝 and star us ⭐️!
|
115 |
+
|
116 |
+
```bibtex
|
117 |
+
@article{peng2025mitigating,
|
118 |
+
title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
|
119 |
+
author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
|
120 |
+
journal={arXiv preprint arXiv:2507.12455},
|
121 |
+
year={2025}
|
122 |
+
}
|
123 |
+
```
|