Qwen2.5-VL-7B-Instruct-SENTINEL / README.md

nielsr HF Staff

Improve model card: Add library name, key features, and usage example

8adf2a9 verified about 2 months ago

preview code

raw

history blame

6.27 kB

metadata

base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
datasets:
  - psp-dada/SENTINEL
language:
  - en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers

Model Card for SENTINEL:

Mitigating Object Hallucinations via Sentence-Level Early Intervention

This repository contains the SENTINEL model, a fine-tuned version of Qwen2.5-VL-7B-Instruct designed to mitigate object hallucinations in Multimodal Large Language Models (MLLMs). SENTINEL introduces a novel framework for Sentence-level Early iNtervention Through IN-domain prEference Learning, eliminating the dependency on human annotations for hallucination mitigation.

Key Features

🧠 Early intervention halts hallucination propagation: We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
🔍 In-domain contextual preference learning without human labels: SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
💡 Context matters: rich coherence drives robustness: By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
♻️ Iterative contextual bootstrapping for diverse hallucination-free contexts: Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
📊 State-of-the-art results across benchmarks: SENTINEL achieves up to 92% reduction in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.

How to Use

You can easily load and use the SENTINEL model with the Hugging Face transformers library, combining the base model with the provided LoRA adapter weights.

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
from PIL import Image
import requests

# Load the base model and its processor
base_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(base_model_id)

# Load the SENTINEL LoRA adapter weights
lora_model_id = "psp-dada/Qwen2.5-VL-7B-SENTINEL"
model = PeftModel.from_pretrained(model, lora_model_id)
# Optional: merge adapter weights into the base model for direct usage if no further training is planned
# model = model.merge_and_unload()

# Example: Describe an image
image_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-images/resolve/main/r_and_c_cat.png"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": raw_image},
        {"type": "text", "text": "Describe the image in detail."}
    ]}
]

# Apply chat template and prepare inputs.
# The Qwen2.5VLProcessor handles vision and text tokenization.
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text_input, images=raw_image, return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)

# Decode only the newly generated tokens
output_text = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(output_text)

Dataset

We present the SENTINEL Dataset, a in-domain multimodal preference dataset for mitigating object hallucination constructed without human annotation.

Dataset details

The SENTINEL dataset records the preference pairs of the LLaVA-v1.5, LLaVA-v1.6, Qwen2-VL and Qwen2.5-VL family, enabling robust and scalable hallucination mitigation without external supervision.

It contains the following components:

image_data.jsonl file

This file contains a selection of open-source images extracted from the Visual Genome dataset. It includes only three fields: image_id, image_path, and question, and is used to construct preference training data for hallucination suppression in image captioning tasks.

Note: If you want to use the data from this file, please make sure to replace the image_path field with the path to your local copy of the Visual Genome dataset.
<model_name>.json files

These files represent the preference training datasets generated after the training data construction step, with each file corresponding to a specific model.

They include the necessary fields for C-DPO training, such as: "question", "context", "y_win", and "y_lose".

Acknowledgement

LLaVA: LLaVA-v1.5 is an excellent open-source project on MLLMs.
HA-DPO: Our code for the LLaVA-v1.5 part is based on HA-DPO, an influential work in the field of object hallucination in MLLMs. It provided us with valuable inspiration.
LLaMA-Factory: A unified and efficient fine-tuning framework of LLMs. Our implementations for LLaVA-v1.6, Qwen2-VL, and Qwen2.5-VL are based on this framework.

Citation

If you find our model/code/data/paper helpful, please consider citing our papers 📝 and star us ⭐️！

@article{peng2025mitigating,
  title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
  author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
  journal={arXiv preprint arXiv:2507.12455},
  year={2025}
}