Add pipeline tag
Browse filesThis PR adds the `pipeline_tag` to the model card, ensuring the model can be found at https://huggingface.co/models?pipeline_tag=video-text-to-text.
README.md
CHANGED
@@ -1,22 +1,22 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: mit
|
|
|
4 |
---
|
5 |
|
6 |
-
|
7 |
-
|
8 |
<a href='https://arxiv.org/abs/2501.13919v1'><img src='https://img.shields.io/badge/arXiv-paper-red'></a><a href='https://ruili33.github.io/tpo_website/'><img src='https://img.shields.io/badge/project-TPO-blue'></a>
|
9 |
<a href='https://huggingface.co/collections/ruili0/temporal-preference-optimization-67874b451f65db189fa35e10'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
|
10 |
<a href='https://github.com/ruili33/TPO'><img src='https://img.shields.io/badge/github-repository-purple'></a>
|
11 |
<img src="cvpr_figure_TPO.png"></img>
|
12 |
# LLaVA-Video-7B-Qwen2-TPO
|
13 |
|
14 |
-
LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://
|
15 |
by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
|
16 |
benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
|
17 |
Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
|
18 |
|
19 |
-
|
|
|
20 |
|
21 |
## Evaluation Results
|
22 |
| **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
|
@@ -121,4 +121,4 @@ This project utilizes certain datasets and checkpoints that are subject to their
|
|
121 |
|
122 |
[1]. Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., ... & Lu, Y. (2024). NVILA: Efficient Frontier Visual Language Models. arXiv preprint arXiv:2412.04468.
|
123 |
|
124 |
-
[2]. Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., & Li, C. (2024). Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713.
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
+
pipeline_tag: video-text-to-text
|
5 |
---
|
6 |
|
|
|
|
|
7 |
<a href='https://arxiv.org/abs/2501.13919v1'><img src='https://img.shields.io/badge/arXiv-paper-red'></a><a href='https://ruili33.github.io/tpo_website/'><img src='https://img.shields.io/badge/project-TPO-blue'></a>
|
8 |
<a href='https://huggingface.co/collections/ruili0/temporal-preference-optimization-67874b451f65db189fa35e10'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
|
9 |
<a href='https://github.com/ruili33/TPO'><img src='https://img.shields.io/badge/github-repository-purple'></a>
|
10 |
<img src="cvpr_figure_TPO.png"></img>
|
11 |
# LLaVA-Video-7B-Qwen2-TPO
|
12 |
|
13 |
+
LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://huggingface.co/papers/2501.13919v1), optimized
|
14 |
by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
|
15 |
benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
|
16 |
Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
|
17 |
|
18 |
+
Project page: https://ruili33.github.io/tpo_website/
|
19 |
+
Code: https://github.com/ruili33/TPO
|
20 |
|
21 |
## Evaluation Results
|
22 |
| **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
|
|
|
121 |
|
122 |
[1]. Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., ... & Lu, Y. (2024). NVILA: Efficient Frontier Visual Language Models. arXiv preprint arXiv:2412.04468.
|
123 |
|
124 |
+
[2]. Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., & Li, C. (2024). Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713.
|