ruili0
/

LongVA-7B-TPO

Video-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

ruili0 commited on Jan 23

Commit

d27f9a5

·

verified ·

1 Parent(s): 9c67ba5

Update README.md

Files changed (1) hide show

README.md +4 -7

README.md CHANGED Viewed

@@ -24,10 +24,8 @@ benchmarks, demonstrating an average performance improvement of 2% compared to L
 ## Evaluation Results
 | **Model**                           | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Short)** | **VideoMME (Medium)** | **VideoMME (Long)** | **VideoMME (Average)** |
 |-------------------------------------|----------|---------------------|----------|----------------------|-----------------------|----------------------|-------------------------|
-| **LongLLaVA [1]**                  | 7B       | -                  | 56.3     | 61.9/66.2           | 51.4/54.7            | 45.4/50.3           | 52.9/57.1              |
-| **Video-CCAM [2]**                | 14B      | -                  | 63.1     | 62.2/66.0           | 50.6/56.3            | 46.7/49.9           | 53.2/57.4              |
-| **LongVA-7B [3]**                 | 7B       | 51.3               | 58.8     | 61.3/61.6           | 50.4/53.6            | 46.2/47.6           | 52.6/54.3              |
-| **LongVA-TPO (ours)**              | 7B       | **54.2**              | 61.7     | 63.1/66.6           | 54.8/55.3            | 47.4/47.9           | **55.1**/56.6              |
 ##  Get Started
@@ -94,6 +92,5 @@ This project utilizes certain datasets and checkpoints that are subject to their
 **References:**
-[1]. Wang, X., Song, D., Chen, S., Zhang, C., & Wang, B. (2024). LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture. arXiv preprint arXiv:2409.02889.
-[2]. Fei, J., Li, D., Deng, Z., Wang, Z., Liu, G., & Wang, H. (2024). Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023.
-[3]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.

 ## Evaluation Results
 | **Model**                           | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Short)** | **VideoMME (Medium)** | **VideoMME (Long)** | **VideoMME (Average)** |
 |-------------------------------------|----------|---------------------|----------|----------------------|-----------------------|----------------------|-------------------------|
+| **LongVA-7B [1]**                 | 7B       | 51.3               | 58.8     | 61.3/61.6           | 50.4/53.6            | 46.2/47.6           | 52.6/54.3              |
+| **LongVA-TPO (ours)**              | 7B       | **54.2**              | **61.7**     | **63.1/66.6**           | **54.8/55.3**            | **47.4/47.9**           | **55.1/56.6**              |
 ##  Get Started
 **References:**
+[1]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.