|
--- |
|
datasets: |
|
- PAPOGalaxy/PAPO_train |
|
license: mit |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- multimodal |
|
- qwen2_5 |
|
--- |
|
|
|
# PAPO Model |
|
|
|
This repository contains the official model checkpoint for the paper **[Perception-Aware Policy Optimization for Multimodal Reasoning](https://arxiv.org/abs/2507.06448)**. |
|
|
|
## About |
|
Perception-Aware Policy Optimization (PAPO) is a novel approach that extends Reinforcement Learning with Verifiable Rewards (RLVR) to significantly enhance multimodal reasoning abilities in Large Language Models (LLMs). Addressing a major bottleneck where up to 67% of errors stem from poor visual perception rather than logical reasoning, PAPO encourages the model to learn to perceive while simultaneously learning to reason. This is achieved entirely through internal supervision signals, eliminating the need for additional data curation, external reward models, or proprietary models. |
|
|
|
Specifically, PAPO integrates an Implicit Perception Loss as a KL divergence term within the GRPO objective. Despite its simplicity, this yields significant overall improvements (4.4%) on diverse multimodal benchmarks, with even more pronounced gains (approaching 8.0%) on vision-dependent tasks. It also notably reduces perception errors by 30.5%, demonstrating improved perceptual capabilities. PAPO serves as a direct drop-in replacement for GRPO. |
|
|
|
## Project Page |
|
Find more details, visualizations, and updates on the official project page: [https://mikewangwzhl.github.io/PAPO/](https://mikewangwzhl.github.io/PAPO/) |
|
|
|
## Code |
|
The official code repository for PAPO is available here: [https://github.com/mikewangwzhl/PAPO](https://github.com/mikewangwzhl/PAPO) |
|
|
|
## Model Version |
|
PAPO (γ=0.01) |
|
|
|
## Usage |
|
This model can be easily loaded and used with the `transformers` library. |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
from PIL import Image |
|
import requests |
|
from io import BytesIO |
|
|
|
# Replace "PAPOGalaxy/PAPO-Qwen2_5-7B-VL" with the actual model ID if different |
|
# The official checkpoints are available at: https://huggingface.co/collections/PAPOGalaxy/papo-qwen-686d92dd3d43b1ce698f851a |
|
model_id = "PAPOGalaxy/PAPO-Qwen2_5-7B-VL" |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
# Example: Describe an image from a URL |
|
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG" |
|
try: |
|
image = Image.open(BytesIO(requests.get(image_url).content)) |
|
except Exception as e: |
|
print(f"Error loading image from URL: {e}. Please ensure the URL is valid or use a local path.") |
|
exit() |
|
|
|
prompt = "Describe this image in detail." |
|
|
|
# Prepare inputs for the model |
|
inputs = processor(images=image, text=prompt, return_tensors="pt") |
|
|
|
# Generate a response |
|
print("Generating response...") |
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
|
|
# Decode the generated text |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print("Generated text:") |
|
print(generated_text) |
|
|
|
# Example: Answer a question about an image |
|
prompt_qa = "What is the main subject of this image?" |
|
inputs_qa = processor(images=image, text=prompt_qa, return_tensors="pt") |
|
generated_ids_qa = model.generate(**inputs_qa, max_new_tokens=100) |
|
generated_text_qa = processor.batch_decode(generated_ids_qa, skip_special_tokens=True)[0] |
|
print(" |
|
Generated QA response:") |
|
print(generated_text_qa) |
|
``` |