Improve model card: Add metadata, project page, code, and usage example

f859981 verified 2 months ago

3.48 kB

	---
	datasets:
	- PAPOGalaxy/PAPO_train
	license: mit
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- multimodal
	- qwen2_5
	---

	# PAPO Model

	This repository contains the official model checkpoint for the paper [Perception-Aware Policy Optimization for Multimodal Reasoning](https://arxiv.org/abs/2507.06448).

	## About
	Perception-Aware Policy Optimization (PAPO) is a novel approach that extends Reinforcement Learning with Verifiable Rewards (RLVR) to significantly enhance multimodal reasoning abilities in Large Language Models (LLMs). Addressing a major bottleneck where up to 67% of errors stem from poor visual perception rather than logical reasoning, PAPO encourages the model to learn to perceive while simultaneously learning to reason. This is achieved entirely through internal supervision signals, eliminating the need for additional data curation, external reward models, or proprietary models.

	Specifically, PAPO integrates an Implicit Perception Loss as a KL divergence term within the GRPO objective. Despite its simplicity, this yields significant overall improvements (4.4%) on diverse multimodal benchmarks, with even more pronounced gains (approaching 8.0%) on vision-dependent tasks. It also notably reduces perception errors by 30.5%, demonstrating improved perceptual capabilities. PAPO serves as a direct drop-in replacement for GRPO.

	## Project Page
	Find more details, visualizations, and updates on the official project page: [https://mikewangwzhl.github.io/PAPO/](https://mikewangwzhl.github.io/PAPO/)

	## Code
	The official code repository for PAPO is available here: [https://github.com/mikewangwzhl/PAPO](https://github.com/mikewangwzhl/PAPO)

	## Model Version
	PAPO (γ=0.01)

	## Usage
	This model can be easily loaded and used with the `transformers` library.

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import requests
	from io import BytesIO

	# Replace "PAPOGalaxy/PAPO-Qwen2_5-7B-VL" with the actual model ID if different
	# The official checkpoints are available at: https://huggingface.co/collections/PAPOGalaxy/papo-qwen-686d92dd3d43b1ce698f851a
	model_id = "PAPOGalaxy/PAPO-Qwen2_5-7B-VL"

	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	# Example: Describe an image from a URL
	image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG"
	try:
	image = Image.open(BytesIO(requests.get(image_url).content))
	except Exception as e:
	print(f"Error loading image from URL: {e}. Please ensure the URL is valid or use a local path.")
	exit()

	prompt = "Describe this image in detail."

	# Prepare inputs for the model
	inputs = processor(images=image, text=prompt, return_tensors="pt")

	# Generate a response
	print("Generating response...")
	generated_ids = model.generate(**inputs, max_new_tokens=512)

	# Decode the generated text
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print("Generated text:")
	print(generated_text)

	# Example: Answer a question about an image
	prompt_qa = "What is the main subject of this image?"
	inputs_qa = processor(images=image, text=prompt_qa, return_tensors="pt")
	generated_ids_qa = model.generate(**inputs_qa, max_new_tokens=100)
	generated_text_qa = processor.batch_decode(generated_ids_qa, skip_special_tokens=True)[0]
	print("
	Generated QA response:")
	print(generated_text_qa)
	```