|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
datasets: |
|
- psp-dada/SENTINEL |
|
language: |
|
- en |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# Model Card for SENTINEL: |
|
Mitigating Object Hallucinations via Sentence-Level Early Intervention |
|
|
|
This repository contains the **SENTINEL** model, a fine-tuned version of `Qwen2.5-VL-7B-Instruct` designed to mitigate object hallucinations in Multimodal Large Language Models (MLLMs). SENTINEL introduces a novel framework for **S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning, eliminating the dependency on human annotations for hallucination mitigation. |
|
|
|
<a href='https://arxiv.org/abs/2507.12455'> |
|
<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a> |
|
<a href='https://github.com/pspdada/SENTINEL'> |
|
<img src='https://img.shields.io/badge/Github-Repo-Green'></a> |
|
|
|
## Key Features |
|
|
|
* 🧠 **Early intervention halts hallucination propagation**: We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation. |
|
* 🔍 **In-domain contextual preference learning without human labels**: SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations. |
|
* 💡 **Context matters: rich coherence drives robustness**: By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization. |
|
* ♻️ **Iterative contextual bootstrapping for diverse hallucination-free contexts**: Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations. |
|
* 📊 **State-of-the-art results across benchmarks**: SENTINEL achieves **up to 92% reduction** in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance. |
|
|
|
## How to Use |
|
|
|
You can easily load and use the SENTINEL model with the Hugging Face `transformers` library, combining the base model with the provided LoRA adapter weights. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
from peft import PeftModel |
|
from PIL import Image |
|
import requests |
|
|
|
# Load the base model and its processor |
|
base_model_id = "Qwen/Qwen2.5-VL-7B-Instruct" |
|
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto") |
|
processor = AutoProcessor.from_pretrained(base_model_id) |
|
|
|
# Load the SENTINEL LoRA adapter weights |
|
lora_model_id = "psp-dada/Qwen2.5-VL-7B-SENTINEL" |
|
model = PeftModel.from_pretrained(model, lora_model_id) |
|
# Optional: merge adapter weights into the base model for direct usage if no further training is planned |
|
# model = model.merge_and_unload() |
|
|
|
# Example: Describe an image |
|
image_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-images/resolve/main/r_and_c_cat.png" |
|
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") |
|
|
|
messages = [ |
|
{"role": "user", "content": [ |
|
{"type": "image", "image": raw_image}, |
|
{"type": "text", "text": "Describe the image in detail."} |
|
]} |
|
] |
|
|
|
# Apply chat template and prepare inputs. |
|
# The Qwen2.5VLProcessor handles vision and text tokenization. |
|
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
inputs = processor(text=text_input, images=raw_image, return_tensors="pt").to(model.device) |
|
|
|
# Generate response |
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
|
|
# Decode only the newly generated tokens |
|
output_text = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0] |
|
print(output_text) |
|
``` |
|
|
|
## Dataset |
|
|
|
We present the [**SENTINEL Dataset**](https://huggingface.co/datasets/psp-dada/SENTINEL), a in-domain multimodal preference dataset for mitigating object hallucination constructed **without** human annotation. |
|
|
|
<details> |
|
<summary>Dataset details</summary> |
|
|
|
The SENTINEL dataset records the preference pairs of the `LLaVA-v1.5`, `LLaVA-v1.6`, `Qwen2-VL` and `Qwen2.5-VL` family, enabling robust and scalable hallucination mitigation without external supervision. |
|
|
|
It contains the following components: |
|
|
|
* `image_data.jsonl` file |
|
|
|
This file contains a selection of open-source images extracted from the Visual Genome dataset. It includes only three fields: `image_id`, `image_path`, and `question`, and is used to construct preference training data for hallucination suppression in image captioning tasks. |
|
|
|
**Note**: If you want to use the data from this file, please make sure to replace the `image_path` field with the path to your local copy of the Visual Genome dataset. |
|
|
|
* `<model_name>.json` files |
|
|
|
These files represent the preference training datasets generated after the training data construction step, with each file corresponding to a specific model. |
|
|
|
They include the necessary fields for **C-DPO training**, such as: `"question"`, `"context"`, `"y_win"`, and `"y_lose"`. |
|
|
|
<table align="center"> |
|
<p align="center"> |
|
<img src="/docs/figures/dataset.png" width="80%" /> |
|
</p> |
|
</table> |
|
</details> |
|
|
|
## Acknowledgement |
|
|
|
* [LLaVA](https://github.com/haotian-liu/LLaVA): LLaVA-v1.5 is an excellent open-source project on MLLMs. |
|
* [HA-DPO](https://github.com/opendatalab/HA-DPO): Our code for the LLaVA-v1.5 part is based on HA-DPO, an influential work in the field of object hallucination in MLLMs. It provided us with valuable inspiration. |
|
* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): A unified and efficient fine-tuning framework of LLMs. Our implementations for LLaVA-v1.6, Qwen2-VL, and Qwen2.5-VL are based on this framework. |
|
|
|
## Citation |
|
|
|
If you find our model/code/data/paper helpful, please consider citing our papers 📝 and star us ⭐️! |
|
|
|
```bibtex |
|
@article{peng2025mitigating, |
|
title={Mitigating Object Hallucinations via Sentence-Level Early Intervention}, |
|
author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao}, |
|
journal={arXiv preprint arXiv:2507.12455}, |
|
year={2025} |
|
} |
|
``` |