Improve model card: Add `library_name`, abstract, project page, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +96 -5
README.md CHANGED
@@ -1,19 +1,110 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - psp-dada/SENTINEL
5
  language:
6
  - en
7
- base_model:
8
- - Qwen/Qwen2-VL-7B-Instruct
9
  pipeline_tag: image-text-to-text
 
10
  ---
11
 
12
- # Model Card for SENTINEL:<br> Mitigating Object Hallucinations via Sentence-Level Early Intervention <!-- omit in toc -->
13
 
14
  <a href='https://arxiv.org/abs/2507.12455'>
15
  <img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
16
  <a href='https://github.com/pspdada/SENTINEL'>
17
  <img src='https://img.shields.io/badge/Github-Repo-Green'></a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- For the details of this model, please refer to the [documentation](https://github.com/pspdada/SENTINEL?tab=readme-ov-file#-model-weights) of the GitHub repo.
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-7B-Instruct
4
  datasets:
5
  - psp-dada/SENTINEL
6
  language:
7
  - en
8
+ license: apache-2.0
 
9
  pipeline_tag: image-text-to-text
10
+ library_name: transformers
11
  ---
12
 
13
+ # Model Card for SENTINEL: Mitigating Object Hallucinations via Sentence-Level Early Intervention
14
 
15
  <a href='https://arxiv.org/abs/2507.12455'>
16
  <img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a>
17
  <a href='https://github.com/pspdada/SENTINEL'>
18
  <img src='https://img.shields.io/badge/Github-Repo-Green'></a>
19
+ <a href='https://huggingface.co/collections/psp-dada/sentinel-686ea70912079af142015286'>
20
+ <img src='https://img.shields.io/badge/Project-HuggingFace_Collection-orange'></a>
21
+
22
+ ## Abstract
23
+
24
+ Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose **SENTINEL** (**S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90\% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at this https URL .
25
+
26
+ ## 🚀 Overview
27
+
28
+ **SENTINEL** introduces an automatic, sentence‑level early intervention strategy to prevent and mitigate object hallucinations in multimodal large language models (MLLMs). Key advantages:
29
+
30
+ - **Annotation‑free**: No human labeling required.
31
+ - **Model-agnostic**: Compatible with any MLLM architecture.
32
+ - **Efficient**: Lightweight LoRA fine‑tuning.
33
+
34
+ ## 🔑 Key Features
35
+
36
+ - 🧠 **Early intervention halts hallucination propagation**. We find that hallucinations of MLLMs predominantly arise in early sentences and propagate through the rest of the output. SENTINEL interrupts this chain early to maximize mitigation.
37
+ - 🔍 **In-domain contextual preference learning without human labels**. SENTINEL constructs hallucinated/factual samples via detector cross-validation and builds context-aware preference data without relying on proprietary LLMs or manual annotations.
38
+ - 💡 **Context matters: rich coherence drives robustness**. By prioritizing context-coherent positive samples over hallucinated ones, SENTINEL significantly boosts generalization.
39
+ - ♻️ **Iterative contextual bootstrapping for diverse hallucination-free contexts**. Our pipeline dynamically grows non-hallucinated contexts and expands coverage across varied scenes, improving robustness across generations.
40
+ - 📊 **State-of-the-art results across benchmarks**. SENTINEL achieves **up to 92% reduction** in hallucinations and outperforms prior SOTA methods across Object HalBench, AMBER, and HallusionBench, while maintaining or improving general task performance.
41
+
42
+ ## 📦 Model Weights
43
+
44
+ This model is a LoRA adapter for `Qwen/Qwen2-VL-7B-Instruct`. It can be seamlessly plugged into the corresponding base model for inference or further fine-tuning.
45
+
46
+ ## Usage
47
+
48
+ You can use this model with the Hugging Face `transformers` library. Since this is a LoRA adapter for a Qwen2-VL model, ensure `trust_remote_code=True` is used for proper loading.
49
+
50
+ ```python
51
+ from transformers import AutoModelForCausalLM, AutoProcessor
52
+ from PIL import Image
53
+ import requests
54
+ from io import BytesIO
55
+
56
+ # Load model and processor.
57
+ # For Qwen2-VL models, trust_remote_code=True is required due to custom code.
58
+ model_id = "psp-dada/Qwen2-VL-7B-SENTINEL"
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_id,
61
+ torch_dtype="auto",
62
+ device_map="auto",
63
+ trust_remote_code=True
64
+ )
65
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
66
+
67
+ # Example image from the SENTINEL GitHub repository
68
+ image_url = "https://raw.githubusercontent.com/pspdada/SENTINEL/main/docs/figures/figure1.png"
69
+ response = requests.get(image_url)
70
+ image = Image.open(BytesIO(response.content)).convert("RGB")
71
+
72
+ # Prepare messages following the Qwen-VL chat template
73
+ messages = [
74
+ {
75
+ "role": "user",
76
+ "content": [
77
+ {"type": "image", "image": image},
78
+ {"type": "text", "text": "Describe this image in detail."}
79
+ ],
80
+ }
81
+ ]
82
+
83
+ # Apply chat template and process inputs
84
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
85
+ inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
86
+ inputs = inputs.to(model.device)
87
+
88
+ # Generate response
89
+ generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
90
+
91
+ # Decode and print the output
92
+ # The generated_ids contain the input_ids as prefix. Trim them for clean output.
93
+ output_text = processor.batch_decode(
94
+ generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=False
95
+ )[0]
96
+ print(output_text)
97
+ ```
98
+
99
+ ## 📝 Citation
100
+
101
+ If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!
102
 
103
+ ```bibtex
104
+ @article{peng2025mitigating,
105
+ title={Mitigating Object Hallucinations via Sentence-Level Early Intervention},
106
+ author={Peng, Shangpin and Yang, Senqiao and Jiang, Li and Tian, Zhuotao},
107
+ journal={arXiv preprint arXiv:2507.12455},
108
+ year={2025}
109
+ }
110
+ ```