Qwen2.5-VL-7B-Instruct-SENTINEL / README.md

Improve model card: Add library name, key features, and usage example

8adf2a9 verified about 2 months ago

6.27 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	datasets:
	- psp-dada/SENTINEL
	language:
	- en
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# Model Card for SENTINEL:
	Mitigating Object Hallucinations via Sentence-Level Early Intervention

	This repository contains the SENTINEL model, a fine-tuned version of `Qwen2.5-VL-7B-Instruct` designed to mitigate object hallucinations in Multimodal Large Language Models (MLLMs). SENTINEL introduces a novel framework for Sentence-level Early iNtervention Through IN-domain prEference Learning, eliminating the dependency on human annotations for hallucination mitigation.

	<a href='https://arxiv.org/abs/2507.12455'>
	<img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
	<a href='https://github.com/pspdada/SENTINEL'>
	<img src='https://img.shields.io/badge/Github-Repo-Green'></a>

	## Key Features

	* 🧠 Early intervention halts hallucination propagation: We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
	* 🔍 In-domain contextual preference learning without human labels: SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
	* 💡 Context matters: rich coherence drives robustness: By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
	* ♻️ Iterative contextual bootstrapping for diverse hallucination-free contexts: Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
	* 📊 State-of-the-art results across benchmarks: SENTINEL achieves up to 92% reduction in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.

	## How to Use

	You can easily load and use the SENTINEL model with the Hugging Face `transformers` library, combining the base model with the provided LoRA adapter weights.

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	from peft import PeftModel
	from PIL import Image
	import requests

	# Load the base model and its processor
	base_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
	model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
	processor = AutoProcessor.from_pretrained(base_model_id)

	# Load the SENTINEL LoRA adapter weights
	lora_model_id = "psp-dada/Qwen2.5-VL-7B-SENTINEL"
	model = PeftModel.from_pretrained(model, lora_model_id)
	# Optional: merge adapter weights into the base model for direct usage if no further training is planned
	# model = model.merge_and_unload()

	# Example: Describe an image
	image_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-images/resolve/main/r_and_c_cat.png"
	raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

	messages = [
	{"role": "user", "content": [
	{"type": "image", "image": raw_image},
	{"type": "text", "text": "Describe the image in detail."}
	]}
	]

	# Apply chat template and prepare inputs.
	# The Qwen2.5VLProcessor handles vision and text tokenization.
	text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=text_input, images=raw_image, return_tensors="pt").to(model.device)

	# Generate response
	generated_ids = model.generate(**inputs, max_new_tokens=512)

	# Decode only the newly generated tokens
	output_text = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
	print(output_text)
	```

	## Dataset

	We present the [SENTINEL Dataset](https://huggingface.co/datasets/psp-dada/SENTINEL), a in-domain multimodal preference dataset for mitigating object hallucination constructed without human annotation.

	<details>
	<summary>Dataset details</summary>

	The SENTINEL dataset records the preference pairs of the `LLaVA-v1.5`, `LLaVA-v1.6`, `Qwen2-VL` and `Qwen2.5-VL` family, enabling robust and scalable hallucination mitigation without external supervision.

	It contains the following components:

	* `image_data.jsonl` file

	This file contains a selection of open-source images extracted from the Visual Genome dataset. It includes only three fields: `image_id`, `image_path`, and `question`, and is used to construct preference training data for hallucination suppression in image captioning tasks.

	Note: If you want to use the data from this file, please make sure to replace the `image_path` field with the path to your local copy of the Visual Genome dataset.

	* `<model_name>.json` files

	These files represent the preference training datasets generated after the training data construction step, with each file corresponding to a specific model.

	They include the necessary fields for C-DPO training, such as: `"question"`, `"context"`, `"y_win"`, and `"y_lose"`.

	<table align="center">
	<p align="center">
	<img src="/docs/figures/dataset.png" width="80%" />
	</p>
	</table>
	</details>

	## Acknowledgement

	* [LLaVA](https://github.com/haotian-liu/LLaVA): LLaVA-v1.5 is an excellent open-source project on MLLMs.
	* [HA-DPO](https://github.com/opendatalab/HA-DPO): Our code for the LLaVA-v1.5 part is based on HA-DPO, an influential work in the field of object hallucination in MLLMs. It provided us with valuable inspiration.
	* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): A unified and efficient fine-tuning framework of LLMs. Our implementations for LLaVA-v1.6, Qwen2-VL, and Qwen2.5-VL are based on this framework.

	## Citation

	If you find our model/code/data/paper helpful, please consider citing our papers 📝 and star us ⭐️！

	```bibtex
	@article{peng2025mitigating,
	title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
	author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
	journal={arXiv preprint arXiv:2507.12455},
	year={2025}
	}
	```