Improve model card: Add metadata, project page, code, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +67 -5
README.md CHANGED
@@ -1,14 +1,76 @@
1
  ---
2
- license: mit
3
  datasets:
4
  - PAPOGalaxy/PAPO_train
 
 
 
 
 
 
5
  ---
6
 
7
-
8
  # PAPO Model
9
 
10
- ## Model Source
11
- This is the official model released for paper **PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning** (arxiv.org/abs/2507.06448)
 
 
 
 
 
 
 
 
 
 
12
 
13
  ## Model Version
14
- PAPO (γ=0.01)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - PAPOGalaxy/PAPO_train
4
+ license: mit
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ tags:
8
+ - multimodal
9
+ - qwen2_5
10
  ---
11
 
 
12
  # PAPO Model
13
 
14
+ This repository contains the official model checkpoint for the paper **[Perception-Aware Policy Optimization for Multimodal Reasoning](https://arxiv.org/abs/2507.06448)**.
15
+
16
+ ## About
17
+ Perception-Aware Policy Optimization (PAPO) is a novel approach that extends Reinforcement Learning with Verifiable Rewards (RLVR) to significantly enhance multimodal reasoning abilities in Large Language Models (LLMs). Addressing a major bottleneck where up to 67% of errors stem from poor visual perception rather than logical reasoning, PAPO encourages the model to learn to perceive while simultaneously learning to reason. This is achieved entirely through internal supervision signals, eliminating the need for additional data curation, external reward models, or proprietary models.
18
+
19
+ Specifically, PAPO integrates an Implicit Perception Loss as a KL divergence term within the GRPO objective. Despite its simplicity, this yields significant overall improvements (4.4%) on diverse multimodal benchmarks, with even more pronounced gains (approaching 8.0%) on vision-dependent tasks. It also notably reduces perception errors by 30.5%, demonstrating improved perceptual capabilities. PAPO serves as a direct drop-in replacement for GRPO.
20
+
21
+ ## Project Page
22
+ Find more details, visualizations, and updates on the official project page: [https://mikewangwzhl.github.io/PAPO/](https://mikewangwzhl.github.io/PAPO/)
23
+
24
+ ## Code
25
+ The official code repository for PAPO is available here: [https://github.com/mikewangwzhl/PAPO](https://github.com/mikewangwzhl/PAPO)
26
 
27
  ## Model Version
28
+ PAPO (γ=0.01)
29
+
30
+ ## Usage
31
+ This model can be easily loaded and used with the `transformers` library.
32
+
33
+ ```python
34
+ from transformers import AutoProcessor, AutoModelForCausalLM
35
+ from PIL import Image
36
+ import requests
37
+ from io import BytesIO
38
+
39
+ # Replace "PAPOGalaxy/PAPO-Qwen2_5-7B-VL" with the actual model ID if different
40
+ # The official checkpoints are available at: https://huggingface.co/collections/PAPOGalaxy/papo-qwen-686d92dd3d43b1ce698f851a
41
+ model_id = "PAPOGalaxy/PAPO-Qwen2_5-7B-VL"
42
+
43
+ processor = AutoProcessor.from_pretrained(model_id)
44
+ model = AutoModelForCausalLM.from_pretrained(model_id)
45
+
46
+ # Example: Describe an image from a URL
47
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG"
48
+ try:
49
+ image = Image.open(BytesIO(requests.get(image_url).content))
50
+ except Exception as e:
51
+ print(f"Error loading image from URL: {e}. Please ensure the URL is valid or use a local path.")
52
+ exit()
53
+
54
+ prompt = "Describe this image in detail."
55
+
56
+ # Prepare inputs for the model
57
+ inputs = processor(images=image, text=prompt, return_tensors="pt")
58
+
59
+ # Generate a response
60
+ print("Generating response...")
61
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
62
+
63
+ # Decode the generated text
64
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
65
+ print("Generated text:")
66
+ print(generated_text)
67
+
68
+ # Example: Answer a question about an image
69
+ prompt_qa = "What is the main subject of this image?"
70
+ inputs_qa = processor(images=image, text=prompt_qa, return_tensors="pt")
71
+ generated_ids_qa = model.generate(**inputs_qa, max_new_tokens=100)
72
+ generated_text_qa = processor.batch_decode(generated_ids_qa, skip_special_tokens=True)[0]
73
+ print("
74
+ Generated QA response:")
75
+ print(generated_text_qa)
76
+ ```