Video-Text-to-Text
File size: 6,215 Bytes
a8a9c2e
4f3f7de
 
a8a9c2e
 
 
4f3f7de
 
5ba3983
 
4f3f7de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ba3983
4f3f7de
5ba3983
4f3f7de
 
 
 
 
 
 
5ba3983
4f3f7de
55783c0
4f3f7de
55783c0
4f3f7de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ba3983
4f3f7de
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
base_model:
- lmsys/vicuna-7b-v1.1
datasets:
- MovieCORE/MovieCORE
- Enxin/MovieChat-1K-test
license: mit
pipeline_tag: video-text-to-text
---

<div align="center">
  <img src="https://github.com/joslefaure/MovieCORE/raw/main/assets/moviecore_icon.png" alt="MovieCORE Icon" width="150"/>
  
  # MovieCORE: COgnitive REasoning in Movies
  
  **A Video Question Answering Dataset for Probing Deeper Cognitive Understanding of Movie Content**
  
  [![arXiv](https://img.shields.io/badge/arXiv-2508.19026-b31b1b.svg)](https://arxiv.org/abs/2508.19026)
  [![Hugging Face Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Paper-HuggingFace-blue)](https://huggingface.co/papers/2508.19026)
  [![Hugging Face Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-HuggingFace-yellow.svg)](https://huggingface.co/datasets/MovieCORE/MovieCORE)
  [![GitHub Code](https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&)](https://github.com/joslefaure/moviecore)
  [![Project Page](https://img.shields.io/badge/Project%20Page-Website-green.svg)](https://joslefaure.github.io/assets/html/moviecore.html)
  [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/joslefaure/MovieCORE/blob/main/LICENSE)
  
  ![MovieCore Dataset Teaser](https://github.com/joslefaure/MovieCORE/raw/main/assets/poster_teaser.png)
</div>

## 📖 Overview

MovieCORE is a comprehensive video question answering (VQA) dataset specifically designed to evaluate and probe deeper cognitive understanding of movie content. Unlike traditional VQA datasets that focus on surface-level visual understanding, MovieCORE challenges models to demonstrate sophisticated reasoning about narrative structures, character development, thematic elements, and complex temporal relationships within cinematic content.

## 🗂️ Data Preparation

The MovieCORE dataset builds upon video content from MovieChat. To get started:

### Video Data
Download the video files from MovieChat's HuggingFace repositories:
- **Training Data**: [MovieChat-1K Train](https://huggingface.co/datasets/Enxin/MovieChat-1K_train)
- **Test Data**: [MovieChat-1K Test](https://huggingface.co/datasets/Enxin/MovieChat-1K-test)

### Annotations
Access our annotations on HuggingFace:
- **MovieCORE Annotations**: [🤗 HuggingFace Dataset](https://huggingface.co/datasets/MovieCORE/MovieCORE/tree/main)

Extract and organize the data according to your model's requirements, then use our annotations for evaluation.

## 🚀 Quick Start

### Installation
```bash
git clone https://github.com/joslefaure/MovieCORE.git
cd MovieCORE
```

## 🎯 Baselines
- We have provided the script to run [HERMES](https://github.com/joslefaure/HERMES) (ICCV'25) on MovieCORE. Please check out the linked project.

## 📊 Evaluation Dimensions

MovieCORE employs a comprehensive multi-dimensional evaluation framework to assess model performance across different aspects of cognitive understanding:

| Dimension | Description |
|-----------|-------------|
| **🎯 Accuracy** | Measures semantic similarity between predicted and ground truth answers |
| **📋 Comprehensiveness** | Assesses coverage of all key aspects mentioned in the ground truth |
| **🧠 Depth** | Evaluates level of reasoning and insight demonstrated in predictions |
| **🔍 Evidence** | Checks quality and relevance of supporting evidence provided |
| **🔗 Coherence** | Measures logical flow, organization, and clarity of responses |

Each dimension provides unique insights into different cognitive capabilities required for deep video understanding.

## 💻 Usage

### Evaluation Script

Evaluate your model's performance on MovieCORE using our evaluation script:

```bash
export OPENAI_API_KEY='your_openai_api_key'
python evaluate_moviecore.py --pred_path path/to/your/predictions.json
```

### 📝 Input Format

Your predictions should follow this JSON structure:

```json
{
    "video_1.mp4": [
        {
            "question": "How does the video depict the unique adaptations of the species in the Sahara Desert, and what roles do these species play in their ecosystem?",
            "answer": "The ground truth answer.",
            "pred": "Your model's prediction.",
            "classification": "the question classification"
        },
        {
            "question": "The second question for video 1?",
            "answer": "The ground truth answer.",
            "pred": "Your model's prediction.",
            "classification": "the question classification"
        }
    ],
    "video_2.mp4": [
        {
            "question": "The only question for video 2",
            "answer": "The ground truth answer.",
            "pred": "Your model's prediction.",
            "classification": "the question classification"
        }
    ]
}
```

### 📈 Output

The evaluation script provides:
- Overall scores across all dimensions
- Classification-specific performance metrics
- Detailed breakdowns for comprehensive analysis

## 📚 Citation

If you use MovieCORE in your research, please cite our paper:

```bibtex
@misc{faure2025moviecorecognitivereasoningmovies,
      title={MovieCORE: COgnitive REasoning in Movies}, 
      author={Gueter Josmy Faure and Min-Hung Chen and Jia-Fong Yeh and Ying Cheng and Hung-Ting Su and Yung-Hao Tang and Shang-Hong Lai and Winston H. Hsu},
      year={2025},
      eprint={2508.19026},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.19026}, 
}
```

## 🤝 Contributing

We welcome contributions to MovieCORE! Please feel free to:
- Report issues or bugs
- Suggest improvements or new features
- Submit baseline implementations
- Provide feedback on the evaluation framework

## 📄 License

This dataset is provided under the MIT License. See [LICENSE](https://github.com/joslefaure/MovieCORE/blob/main/LICENSE) for more details.

---

<div align="center">
  <p>🎬 <strong>Advancing Video Understanding Through Cognitive Evaluation</strong> 🎬</p>
  
  **[\ud83d\udcd6 Paper](https://arxiv.org/abs/2508.19026v1) | [\ud83e\udd17 Dataset](https://huggingface.co/datasets/MovieCORE/MovieCORE) | [\ud83d\udcbb Code](https://github.com/joslefaure/moviecore)**
</div>