Improve model card: Add pipeline tag, library name, abstract, and links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +47 -5
README.md CHANGED
@@ -1,14 +1,56 @@
1
  ---
2
- license: mit
3
  datasets:
4
  - PAPOGalaxy/PAPO_train
 
 
 
5
  ---
6
 
7
-
8
  # PAPO Model
9
 
10
- ## Model Source
11
- This is the official model released for paper **PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning** (arxiv.org/abs/2507.06448)
 
 
 
 
 
12
 
13
  ## Model Version
14
- PAPO (NO KL_ref)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - PAPOGalaxy/PAPO_train
4
+ license: mit
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  ---
8
 
 
9
  # PAPO Model
10
 
11
+ This is the official model for the paper [**Perception-Aware Policy Optimization for Multimodal Reasoning**](https://huggingface.co/papers/2507.06448).
12
+
13
+ ## Abstract
14
+ Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose **Perception-Aware Policy Optimization (PAPO)**, a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning.
15
+
16
+ **Project Page**: [https://mikewangwzhl.github.io/PAPO/](https://mikewangwzhl.github.io/PAPO/)
17
+ **Code**: [https://github.com/mikewangwzhl/PAPO](https://github.com/mikewangwzhl/PAPO)
18
 
19
  ## Model Version
20
+ PAPO (NO KL_ref)
21
+
22
+ ## Usage
23
+
24
+ This model is designed for multimodal reasoning tasks, taking both image and text inputs to generate text. You can load it using the `transformers` library. For detailed usage instructions, including how to prepare multimodal inputs, please refer to the official project page and code repository.
25
+
26
+ ```python
27
+ from transformers import AutoProcessor, AutoModelForCausalLM
28
+ import torch
29
+
30
+ model_id = "mikewangwzhl/PAPO-7B-NO-KL" # Or the specific model identifier on the Hub
31
+
32
+ # Load processor and model
33
+ processor = AutoProcessor.from_pretrained(model_id)
34
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
35
+
36
+ # Move model to appropriate device (e.g., GPU if available)
37
+ if torch.cuda.is_available():
38
+ model = model.to("cuda")
39
+
40
+ # Example of a text-only generation (if supported by the multimodal model)
41
+ # For full multimodal usage, you would also prepare and pass image data.
42
+ prompt = "Hello, what can you do?"
43
+ messages = [{"role": "user", "content": prompt}]
44
+
45
+ # Apply chat template and get input IDs
46
+ input_ids = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
47
+ if torch.cuda.is_available():
48
+ input_ids = input_ids.to("cuda")
49
+
50
+ # Generate a response
51
+ output_ids = model.generate(input_ids, max_new_tokens=100)
52
+
53
+ # Decode the generated text
54
+ generated_text = processor.decode(output_ids[0], skip_special_tokens=True)
55
+ print(generated_text)
56
+ ```