Enhance model card for ChartCap with metadata, links, and usage example (#1)

Browse files

- Enhance model card for ChartCap with metadata, links, and usage example (677d1665be94894e05d2718da5b94c92d1c8f2be)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +90 -3

README.md CHANGED Viewed

@@ -1,3 +1,90 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: image-to-text
+library_name: transformers
+tags:
+  - chart-captioning
+  - multimodal
+  - vision-language-model
+---
+# ChartCap: Mitigating Hallucination of Dense Chart Captioning
+This repository contains the model presented in the paper [**ChartCap: Mitigating Hallucination of Dense Chart Captioning**](https://huggingface.co/papers/2508.03164).
+**Project Page:** [https://junyoung-00.github.io/ChartCap/](https://junyoung-00.github.io/ChartCap/)
+**Code:** [https://github.com/junyoung-00/ChartCap](https://github.com/junyoung-00/ChartCap)
+## Model Description
+`ChartCap` is a vision-language model specifically fine-tuned for generating accurate, informative, and hallucination-free captions for charts. It addresses the challenges of existing chart captioning models by leveraging innovations in both data and a novel evaluation metric.
+The model aims to generate high-quality, dense captions for various chart types, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information.
+## Key Features
+*   **Dense Chart Captioning:** Generates detailed, type-specific captions that highlight structural elements and key insights from charts.
+*   **Hallucination Mitigation:** Designed to reduce the generation of extraneous information not discernible from the chart data.
+*   **Real-world Data:** Fine-tuned on `ChartCap`, a large-scale dataset of 565K real-world chart images with high-quality, dense captions.
+## How to Use
+You can use the ChartCap model with the Hugging Face `transformers` library. The model is built upon a Phi-3.5-vision-instruct base, implying a multimodal conversation template.
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import requests
+import torch
+# Replace "your_model_id" with the actual model ID from the Hugging Face Hub.
+# For example, if this model is hosted at `junyoung-00/ChartCap-Phi3V`, use "junyoung-00/ChartCap-Phi3V".
+model_id = "your_model_id"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
+# Example image: a bar chart (replace with your chart image URL or local path)
+# For a local image: image = Image.open("path/to/your/chart_image.png").convert("RGB")
+image_url = "https://junyoung-00.github.io/ChartCap/assets/images/teaser.png" # Example chart image from project page
+image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
+# Define the prompt for dense chart captioning
+prompt = "Describe this chart in detail, focusing on its structural elements and key insights."
+messages = [
+    {"role": "user", "content": f"<|image|>
+{prompt}"}
+]
+# Apply chat template and prepare inputs
+input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
+# The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped.
+# For simplicity, we use the standard processor input which handles image embedding.
+inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device)
+# Generate response
+generated_ids = model.generate(**inputs, max_new_tokens=512)
+# Decode and print the output
+response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response.strip())
+```
+## Dataset
+This model was fine-tuned on **ChartCap**, a large-scale dataset featuring 565K real-world chart images paired with type-specific, dense captions. The dataset generation pipeline ensures captions are derived solely from discernible chart data, emphasizing structural elements and key insights to mitigate hallucination.
+## Citation
+If you find this model or the associated research helpful, please consider citing the paper:
+```bibtex
+@article{Kim2025ChartCapMH,
+  title={ChartCap: Mitigating Hallucination of Dense Chart Captioning},
+  author={Junyoung Kim and Suhyang Gwon and Jonghun Kim and Hyeonseop Song and Seung-Hoon Na and Junmo Kim},
+  journal={arXiv preprint arXiv:2508.03164},
+  year={2025},
+  url={https://arxiv.org/abs/2508.03164}
+}
+```