File size: 2,950 Bytes
3c374ca 9001423 8fdcbe3 3c374ca 937db36 3c374ca 43cf888 3c374ca 937db36 3c374ca 43cf888 3c374ca 43cf888 3c374ca 43cf888 3c374ca 43cf888 3c374ca 937db36 3c374ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: mit
pipeline_tag: image-to-text
library_name: transformers
tags:
- chart-captioning
- multimodal
- vision-language-model
---
# ChartCap: Mitigating Hallucination of Dense Chart Captioning
This repository contains the model presented in the paper [**ChartCap: Mitigating Hallucination of Dense Chart Captioning**](https://huggingface.co/papers/2508.03164).
**Project Page:** [https://junyoung-00.github.io/ChartCap/](https://junyoung-00.github.io/ChartCap/)\
**Code:** [https://github.com/junyoung-00/ChartCap](https://github.com/junyoung-00/ChartCap)
## Model Description
`Phi-3.5-vision-instruct-ChartCap` is a ChartCap-fine-tuned version of [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct).
The model aims to generate high-quality, dense captions for charts, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information.
## Required Packages
```bash
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```
## How to Use
```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import torch
model_id = "junyoung-00/Phi-3.5-vision-instruct-ChartCap"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
# Load an example chart image (URL or local path)
image_url = "https://your-server.com/example_chart.png"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Define the prompt for dense chart captioning
prompt = "Please provide a detailed caption for the chart."
messages = [
{"role": "user", "content": f"<|image|>
{prompt}"}
]
# Apply chat template and prepare inputs
input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
# The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped.
# For simplicity, we use the standard processor input which handles image embedding.
inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
# Decode and print the output
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response.strip())
```
## Citation
If you find this model or the associated research helpful, please cite:
```bibtex
@inproceedings{lim2025chartcap,
title = {ChartCap: Mitigating Hallucination of Dense Chart Captioning},
author = {Junyoung Lim and Jaewoo Ahn and Gunhee Kim},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
year = {2025}
}
``` |