Improve model card: Add library name, key features, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +109 -5
README.md CHANGED
@@ -1,19 +1,123 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - psp-dada/SENTINEL
5
  language:
6
  - en
7
- base_model:
8
- - Qwen/Qwen2.5-VL-7B-Instruct
9
  pipeline_tag: image-text-to-text
 
10
  ---
11
 
12
- # Model Card for SENTINEL:<br> Mitigating Object Hallucinations via Sentence-Level Early Intervention <!-- omit in toc -->
 
 
 
13
 
14
  <a href='https://arxiv.org/abs/2507.12455'>
15
  <img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
16
  <a href='https://github.com/pspdada/SENTINEL'>
17
  <img src='https://img.shields.io/badge/Github-Repo-Green'></a>
18
 
19
- For the details of this model, please refer to the [documentation](https://github.com/pspdada/SENTINEL?tab=readme-ov-file#-model-weights) of the GitHub repo.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - psp-dada/SENTINEL
6
  language:
7
  - en
8
+ license: apache-2.0
 
9
  pipeline_tag: image-text-to-text
10
+ library_name: transformers
11
  ---
12
 
13
+ # Model Card for SENTINEL:
14
+ Mitigating Object Hallucinations via Sentence-Level Early Intervention
15
+
16
+ This repository contains the **SENTINEL** model, a fine-tuned version of `Qwen2.5-VL-7B-Instruct` designed to mitigate object hallucinations in Multimodal Large Language Models (MLLMs). SENTINEL introduces a novel framework for **S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning, eliminating the dependency on human annotations for hallucination mitigation.
17
 
18
  <a href='https://arxiv.org/abs/2507.12455'>
19
  <img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
20
  <a href='https://github.com/pspdada/SENTINEL'>
21
  <img src='https://img.shields.io/badge/Github-Repo-Green'></a>
22
 
23
+ ## Key Features
24
+
25
+ * 🧠 **Early intervention halts hallucination propagation**: We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
26
+ * 🔍 **In-domain contextual preference learning without human labels**: SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
27
+ * 💡 **Context matters: rich coherence drives robustness**: By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
28
+ * ♻️ **Iterative contextual bootstrapping for diverse hallucination-free contexts**: Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
29
+ * 📊 **State-of-the-art results across benchmarks**: SENTINEL achieves **up to 92% reduction** in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.
30
+
31
+ ## How to Use
32
+
33
+ You can easily load and use the SENTINEL model with the Hugging Face `transformers` library, combining the base model with the provided LoRA adapter weights.
34
+
35
+ ```python
36
+ from transformers import AutoModelForCausalLM, AutoProcessor
37
+ from peft import PeftModel
38
+ from PIL import Image
39
+ import requests
40
+
41
+ # Load the base model and its processor
42
+ base_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
43
+ model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
44
+ processor = AutoProcessor.from_pretrained(base_model_id)
45
+
46
+ # Load the SENTINEL LoRA adapter weights
47
+ lora_model_id = "psp-dada/Qwen2.5-VL-7B-SENTINEL"
48
+ model = PeftModel.from_pretrained(model, lora_model_id)
49
+ # Optional: merge adapter weights into the base model for direct usage if no further training is planned
50
+ # model = model.merge_and_unload()
51
+
52
+ # Example: Describe an image
53
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-images/resolve/main/r_and_c_cat.png"
54
+ raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
55
+
56
+ messages = [
57
+ {"role": "user", "content": [
58
+ {"type": "image", "image": raw_image},
59
+ {"type": "text", "text": "Describe the image in detail."}
60
+ ]}
61
+ ]
62
+
63
+ # Apply chat template and prepare inputs.
64
+ # The Qwen2.5VLProcessor handles vision and text tokenization.
65
+ text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
66
+ inputs = processor(text=text_input, images=raw_image, return_tensors="pt").to(model.device)
67
+
68
+ # Generate response
69
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
70
+
71
+ # Decode only the newly generated tokens
72
+ output_text = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
73
+ print(output_text)
74
+ ```
75
+
76
+ ## Dataset
77
+
78
+ We present the [**SENTINEL Dataset**](https://huggingface.co/datasets/psp-dada/SENTINEL), a in-domain multimodal preference dataset for mitigating object hallucination constructed **without** human annotation.
79
+
80
+ <details>
81
+ <summary>Dataset details</summary>
82
+
83
+ The SENTINEL dataset records the preference pairs of the `LLaVA-v1.5`, `LLaVA-v1.6`, `Qwen2-VL` and `Qwen2.5-VL` family, enabling robust and scalable hallucination mitigation without external supervision.
84
+
85
+ It contains the following components:
86
+
87
+ * `image_data.jsonl` file
88
+
89
+ This file contains a selection of open-source images extracted from the Visual Genome dataset. It includes only three fields: `image_id`, `image_path`, and `question`, and is used to construct preference training data for hallucination suppression in image captioning tasks.
90
+
91
+ **Note**: If you want to use the data from this file, please make sure to replace the `image_path` field with the path to your local copy of the Visual Genome dataset.
92
+
93
+ * `<model_name>.json` files
94
+
95
+ These files represent the preference training datasets generated after the training data construction step, with each file corresponding to a specific model.
96
+
97
+ They include the necessary fields for **C-DPO training**, such as: `"question"`, `"context"`, `"y_win"`, and `"y_lose"`.
98
+
99
+ <table align="center">
100
+ <p align="center">
101
+ <img src="/docs/figures/dataset.png" width="80%" />
102
+ </p>
103
+ </table>
104
+ </details>
105
+
106
+ ## Acknowledgement
107
+
108
+ * [LLaVA](https://github.com/haotian-liu/LLaVA): LLaVA-v1.5 is an excellent open-source project on MLLMs.
109
+ * [HA-DPO](https://github.com/opendatalab/HA-DPO): Our code for the LLaVA-v1.5 part is based on HA-DPO, an influential work in the field of object hallucination in MLLMs. It provided us with valuable inspiration.
110
+ * [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): A unified and efficient fine-tuning framework of LLMs. Our implementations for LLaVA-v1.6, Qwen2-VL, and Qwen2.5-VL are based on this framework.
111
+
112
+ ## Citation
113
+
114
+ If you find our model/code/data/paper helpful, please consider citing our papers 📝 and star us ⭐️!
115
+
116
+ ```bibtex
117
+ @article{peng2025mitigating,
118
+ title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
119
+ author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
120
+ journal={arXiv preprint arXiv:2507.12455},
121
+ year={2025}
122
+ }
123
+ ```