nielsr HF Staff commited on
Commit
b7fa79e
·
verified ·
1 Parent(s): 81fddc1

Improve model card with tags, links, abstract, and usage example

Browse files

This PR enhances the model card for **PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning** by:

- Adding `pipeline_tag: image-text-to-text` for better discoverability on the Hub.
- Specifying `library_name: transformers` for correct integration and display of usage.
- Updating the paper link to the official Hugging Face paper page.
- Including direct links to the project page and the official GitHub repository for easier access to resources.
- Adding the paper abstract for a comprehensive overview of the model.
- Providing a Python code snippet to demonstrate how to use the model for inference.

These updates significantly improve the model's visibility and usability for the community.

Files changed (1) hide show
  1. README.md +69 -4
README.md CHANGED
@@ -1,14 +1,79 @@
1
  ---
2
- license: mit
3
  datasets:
4
  - PAPOGalaxy/PAPO_train
 
 
 
5
  ---
6
 
7
-
8
  # PAPO Model
9
 
10
  ## Model Source
11
- This is the official model released for our paper **PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning** (arxiv.org/abs/2507.06448)
 
 
 
 
 
 
12
 
13
  ## Model Version
14
- PAPO - No RL_ref
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - PAPOGalaxy/PAPO_train
4
+ license: mit
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  ---
8
 
 
9
  # PAPO Model
10
 
11
  ## Model Source
12
+ This is the official model released for our paper **[Perception-Aware Policy Optimization for Multimodal Reasoning](https://huggingface.co/papers/2507.06448)**.
13
+
14
+ **Project Page**: [https://mikewangwzhl.github.io/PAPO/](https://mikewangwzhl.github.io/PAPO/)
15
+ **Code**: [https://github.com/mikewangwzhl/PAPO](https://github.com/mikewangwzhl/PAPO)
16
+
17
+ ## Abstract
18
+ Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning.
19
 
20
  ## Model Version
21
+ PAPO - No RL_ref
22
+
23
+ ## Usage
24
+ You can use the model with the `transformers` library for multimodal reasoning.
25
+
26
+ ```python
27
+ from transformers import AutoProcessor, AutoModelForCausalLM
28
+ from PIL import Image
29
+ import torch
30
+ import requests
31
+ from io import BytesIO
32
+
33
+ model_id = "PAPOGalaxy/PAPO" # Replace with the actual model ID if different
34
+ processor = AutoProcessor.from_pretrained(model_id)
35
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
36
+
37
+ # Load your image (replace with your image path or a URL)
38
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG"
39
+ image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
40
+
41
+ prompt_text = "Describe this image in detail."
42
+
43
+ # The model expects a specific chat format including image tokens
44
+ chat_history = [
45
+ {"role": "user", "content": [{"type": "image"}, {"text": prompt_text}]},
46
+ ]
47
+
48
+ # Apply the chat template and prepare inputs
49
+ inputs = processor.apply_chat_template(chat_history, return_tensors="pt", add_generation_prompt=True)
50
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
51
+
52
+ # Prepare pixel_values from the image
53
+ pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(model.device)
54
+ inputs["pixel_values"] = pixel_values
55
+
56
+ # Generate output
57
+ with torch.no_grad():
58
+ outputs = model.generate(**inputs, max_new_tokens=200)
59
+
60
+ generated_text = processor.decode(outputs[0], skip_special_tokens=True)
61
+ print(generated_text)
62
+ ```
63
+
64
+ ## Citation
65
+
66
+ ```bibtex
67
+ @misc{wang2025perceptionawarepolicyoptimizationmultimodal,
68
+ title={Perception-Aware Policy Optimization for Multimodal Reasoning},
69
+ author={Zhenhailong Wang and Xuehang Guo and Sofia Stoica and Haiyang Xu and Hongru Wang and Hyeonjeong Ha and Xiusi Chen and Yangyi Chen and Ming Yan and Fei Huang and Heng Ji},
70
+ year={2025},
71
+ eprint={2507.06448},
72
+ archivePrefix={arXiv},
73
+ primaryClass={cs.CL},
74
+ url={https://arxiv.org/abs/2507.06448},
75
+ }
76
+ ```
77
+
78
+ ## License
79
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.