Improve model card: Add `library_name`, abstract, project page, and usage example
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,19 +1,110 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- psp-dada/SENTINEL
|
5 |
language:
|
6 |
- en
|
7 |
-
|
8 |
-
- Qwen/Qwen2-VL-7B-Instruct
|
9 |
pipeline_tag: image-text-to-text
|
|
|
10 |
---
|
11 |
|
12 |
-
# Model Card for SENTINEL
|
13 |
|
14 |
<a href='https://arxiv.org/abs/2507.12455'>
|
15 |
<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
|
16 |
<a href='https://github.com/pspdada/SENTINEL'>
|
17 |
<img src='https://img.shields.io/badge/Github-Repo-Green'></a>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2-VL-7B-Instruct
|
4 |
datasets:
|
5 |
- psp-dada/SENTINEL
|
6 |
language:
|
7 |
- en
|
8 |
+
license: apache-2.0
|
|
|
9 |
pipeline_tag: image-text-to-text
|
10 |
+
library_name: transformers
|
11 |
---
|
12 |
|
13 |
+
# Model Card for SENTINEL: Mitigating Object Hallucinations via Sentence-Level Early Intervention
|
14 |
|
15 |
<a href='https://arxiv.org/abs/2507.12455'>
|
16 |
<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
|
17 |
<a href='https://github.com/pspdada/SENTINEL'>
|
18 |
<img src='https://img.shields.io/badge/Github-Repo-Green'></a>
|
19 |
+
<a href='https://huggingface.co/collections/psp-dada/sentinel-686ea70912079af142015286'>
|
20 |
+
<img src='https://img.shields.io/badge/Project-HuggingFace_Collection-orange'></a>
|
21 |
+
|
22 |
+
## Abstract
|
23 |
+
|
24 |
+
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose **SENTINEL** (**S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90\% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at this https URL .
|
25 |
+
|
26 |
+
## 🚀 Overview
|
27 |
+
|
28 |
+
**SENTINEL** introduces an automatic, sentence‑level early intervention strategy to prevent and mitigate object hallucinations in multimodal large language models (MLLMs). Key advantages:
|
29 |
+
|
30 |
+
- **Annotation‑free**: No human labeling required.
|
31 |
+
- **Model-agnostic**: Compatible with any MLLM architecture.
|
32 |
+
- **Efficient**: Lightweight LoRA fine‑tuning.
|
33 |
+
|
34 |
+
## 🔑 Key Features
|
35 |
+
|
36 |
+
- 🧠 **Early intervention halts hallucination propagation**. We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
|
37 |
+
- 🔍 **In-domain contextual preference learning without human labels**. SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
|
38 |
+
- 💡 **Context matters: rich coherence drives robustness**. By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
|
39 |
+
- ♻️ **Iterative contextual bootstrapping for diverse hallucination-free contexts**. Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
|
40 |
+
- 📊 **State-of-the-art results across benchmarks**. SENTINEL achieves **up to 92% reduction** in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.
|
41 |
+
|
42 |
+
## 📦 Model Weights
|
43 |
+
|
44 |
+
This model is a LoRA adapter for `Qwen/Qwen2-VL-7B-Instruct`. It can be seamlessly plugged into the corresponding base model for inference or further fine-tuning.
|
45 |
+
|
46 |
+
## Usage
|
47 |
+
|
48 |
+
You can use this model with the Hugging Face `transformers` library. Since this is a LoRA adapter for a Qwen2-VL model, ensure `trust_remote_code=True` is used for proper loading.
|
49 |
+
|
50 |
+
```python
|
51 |
+
from transformers import AutoModelForCausalLM, AutoProcessor
|
52 |
+
from PIL import Image
|
53 |
+
import requests
|
54 |
+
from io import BytesIO
|
55 |
+
|
56 |
+
# Load model and processor.
|
57 |
+
# For Qwen2-VL models, trust_remote_code=True is required due to custom code.
|
58 |
+
model_id = "psp-dada/Qwen2-VL-7B-SENTINEL"
|
59 |
+
model = AutoModelForCausalLM.from_pretrained(
|
60 |
+
model_id,
|
61 |
+
torch_dtype="auto",
|
62 |
+
device_map="auto",
|
63 |
+
trust_remote_code=True
|
64 |
+
)
|
65 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
66 |
+
|
67 |
+
# Example image from the SENTINEL GitHub repository
|
68 |
+
image_url = "https://raw.githubusercontent.com/pspdada/SENTINEL/main/docs/figures/figure1.png"
|
69 |
+
response = requests.get(image_url)
|
70 |
+
image = Image.open(BytesIO(response.content)).convert("RGB")
|
71 |
+
|
72 |
+
# Prepare messages following the Qwen-VL chat template
|
73 |
+
messages = [
|
74 |
+
{
|
75 |
+
"role": "user",
|
76 |
+
"content": [
|
77 |
+
{"type": "image", "image": image},
|
78 |
+
{"type": "text", "text": "Describe this image in detail."}
|
79 |
+
],
|
80 |
+
}
|
81 |
+
]
|
82 |
+
|
83 |
+
# Apply chat template and process inputs
|
84 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
85 |
+
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
|
86 |
+
inputs = inputs.to(model.device)
|
87 |
+
|
88 |
+
# Generate response
|
89 |
+
generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
|
90 |
+
|
91 |
+
# Decode and print the output
|
92 |
+
# The generated_ids contain the input_ids as prefix. Trim them for clean output.
|
93 |
+
output_text = processor.batch_decode(
|
94 |
+
generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=False
|
95 |
+
)[0]
|
96 |
+
print(output_text)
|
97 |
+
```
|
98 |
+
|
99 |
+
## 📝 Citation
|
100 |
+
|
101 |
+
If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!
|
102 |
|
103 |
+
```bibtex
|
104 |
+
@article{peng2025mitigating,
|
105 |
+
title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
|
106 |
+
author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
|
107 |
+
journal={arXiv preprint arXiv:2507.12455},
|
108 |
+
year={2025}
|
109 |
+
}
|
110 |
+
```
|