license: mit
pipeline_tag: image-to-text
library_name: transformers
tags:
- chart-captioning
- multimodal
- vision-language-model
ChartCap: Mitigating Hallucination of Dense Chart Captioning
This repository contains the model presented in the paper ChartCap: Mitigating Hallucination of Dense Chart Captioning.
Project Page: https://junyoung-00.github.io/ChartCap/ Code: https://github.com/junyoung-00/ChartCap
Model Description
ChartCap is a vision-language model specifically fine-tuned for generating accurate, informative, and hallucination-free captions for charts. It addresses the challenges of existing chart captioning models by leveraging innovations in both data and a novel evaluation metric.
The model aims to generate high-quality, dense captions for various chart types, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information.
Key Features
- Dense Chart Captioning: Generates detailed, type-specific captions that highlight structural elements and key insights from charts.
- Hallucination Mitigation: Designed to reduce the generation of extraneous information not discernible from the chart data.
- Real-world Data: Fine-tuned on
ChartCap, a large-scale dataset of 565K real-world chart images with high-quality, dense captions.
How to Use
You can use the ChartCap model with the Hugging Face transformers library. The model is built upon a Phi-3.5-vision-instruct base, implying a multimodal conversation template.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import torch
# Replace "your_model_id" with the actual model ID from the Hugging Face Hub.
# For example, if this model is hosted at `junyoung-00/ChartCap-Phi3V`, use "junyoung-00/ChartCap-Phi3V".
model_id = "your_model_id"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
# Example image: a bar chart (replace with your chart image URL or local path)
# For a local image: image = Image.open("path/to/your/chart_image.png").convert("RGB")
image_url = "https://junyoung-00.github.io/ChartCap/assets/images/teaser.png" # Example chart image from project page
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Define the prompt for dense chart captioning
prompt = "Describe this chart in detail, focusing on its structural elements and key insights."
messages = [
{"role": "user", "content": f"<|image|>
{prompt}"}
]
# Apply chat template and prepare inputs
input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
# The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped.
# For simplicity, we use the standard processor input which handles image embedding.
inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
# Decode and print the output
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response.strip())
Dataset
This model was fine-tuned on ChartCap, a large-scale dataset featuring 565K real-world chart images paired with type-specific, dense captions. The dataset generation pipeline ensures captions are derived solely from discernible chart data, emphasizing structural elements and key insights to mitigate hallucination.
Citation
If you find this model or the associated research helpful, please consider citing the paper:
@article{Kim2025ChartCapMH,
title={ChartCap: Mitigating Hallucination of Dense Chart Captioning},
author={Junyoung Kim and Suhyang Gwon and Jonghun Kim and Hyeonseop Song and Seung-Hoon Na and Junmo Kim},
journal={arXiv preprint arXiv:2508.03164},
year={2025},
url={https://arxiv.org/abs/2508.03164}
}