junyoung-00
/

Phi-3.5-vision-instruct-ChartCap

text-generation

chart-captioning

vision-language-model

Model card Files Files and versions Community

Phi-3.5-vision-instruct-ChartCap / README.md

junyoung-00's picture

Update README.md

a4dafaf verified about 1 month ago

|

history blame contribute delete

2.73 kB

	---
	license: mit
	pipeline_tag: image-to-text
	library_name: transformers
	tags:
	- chart-captioning
	- multimodal
	- vision-language-model
	---

	# ChartCap: Mitigating Hallucination of Dense Chart Captioning

	This repository contains the model presented in the paper [ChartCap: Mitigating Hallucination of Dense Chart Captioning](https://huggingface.co/papers/2508.03164).

	Project Page: (WIP) [https://junyoung-00.github.io/ChartCap/](https://junyoung-00.github.io/ChartCap/)\
	Code: [https://github.com/junyoung-00/ChartCap](https://github.com/junyoung-00/ChartCap)

	## Model Description

	`Phi-3.5-vision-instruct-ChartCap` is a ChartCap-fine-tuned version of microsoft/Phi-3.5-vision-instruct.

	The model aims to generate high-quality, dense captions for charts, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information.

	## How to Use

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import requests
	import torch

	model_id = "junyoung-00/Phi-3.5-vision-instruct-ChartCap"

	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

	# Load an example chart image (URL or local path)
	image_url = "https://your-server.com/example_chart.png"
	image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

	# Define the prompt for dense chart captioning
	prompt = "Please provide a detailed caption for the chart."
	messages = [
	{"role": "user", "content": f"<\|image\|>
	{prompt}"}
	]

	# Apply chat template and prepare inputs
	input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
	# The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <\|image\|> is mapped.
	# For simplicity, we use the standard processor input which handles image embedding.
	inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device)


	# Generate response
	generated_ids = model.generate(**inputs, max_new_tokens=512)

	# Decode and print the output
	response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response.strip())
	```

	## Citation

	If you find this model or the associated research helpful, please cite:

	```bibtex
	@inproceedings{{lim2025chartcap,
	title={{ChartCap: Mitigating Hallucination of Dense Chart Captioning}},
	author={{Junyoung Lim and Jaewoo Ahn and Gunhee Kim}},
	booktitle={{Proceedings of the IEEE/CVF International Conference on Computer Vision}},
	year={{2025}}
	}}
	```