--- license: mit pipeline_tag: image-to-text library_name: transformers tags: - chart-captioning - multimodal - vision-language-model --- # ChartCap: Mitigating Hallucination of Dense Chart Captioning This repository contains the model presented in the paper [**ChartCap: Mitigating Hallucination of Dense Chart Captioning**](https://huggingface.co/papers/2508.03164). **Project Page:** (WIP) [https://junyoung-00.github.io/ChartCap/](https://junyoung-00.github.io/ChartCap/)\ **Code:** [https://github.com/junyoung-00/ChartCap](https://github.com/junyoung-00/ChartCap) ## Model Description `Phi-3.5-vision-instruct-ChartCap` is a ChartCap-fine-tuned version of microsoft/Phi-3.5-vision-instruct. The model aims to generate high-quality, dense captions for charts, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information. ## How to Use ```python from transformers import AutoProcessor, AutoModelForCausalLM from PIL import Image import requests import torch model_id = "junyoung-00/Phi-3.5-vision-instruct-ChartCap" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") # Load an example chart image (URL or local path) image_url = "https://your-server.com/example_chart.png" image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") # Define the prompt for dense chart captioning prompt = "Please provide a detailed caption for the chart." messages = [ {"role": "user", "content": f"<|image|> {prompt}"} ] # Apply chat template and prepare inputs input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") # The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped. # For simplicity, we use the standard processor input which handles image embedding. inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device) # Generate response generated_ids = model.generate(**inputs, max_new_tokens=512) # Decode and print the output response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response.strip()) ``` ## Citation If you find this model or the associated research helpful, please cite: ```bibtex @inproceedings{{lim2025chartcap, title={{ChartCap: Mitigating Hallucination of Dense Chart Captioning}}, author={{Junyoung Lim and Jaewoo Ahn and Gunhee Kim}}, booktitle={{Proceedings of the IEEE/CVF International Conference on Computer Vision}}, year={{2025}} }} ```