ruili0 commited on
Commit
d27f9a5
·
verified ·
1 Parent(s): 9c67ba5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -24,10 +24,8 @@ benchmarks, demonstrating an average performance improvement of 2% compared to L
24
  ## Evaluation Results
25
  | **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Short)** | **VideoMME (Medium)** | **VideoMME (Long)** | **VideoMME (Average)** |
26
  |-------------------------------------|----------|---------------------|----------|----------------------|-----------------------|----------------------|-------------------------|
27
- | **LongLLaVA [1]** | 7B | - | 56.3 | 61.9/66.2 | 51.4/54.7 | 45.4/50.3 | 52.9/57.1 |
28
- | **Video-CCAM [2]** | 14B | - | 63.1 | 62.2/66.0 | 50.6/56.3 | 46.7/49.9 | 53.2/57.4 |
29
- | **LongVA-7B [3]** | 7B | 51.3 | 58.8 | 61.3/61.6 | 50.4/53.6 | 46.2/47.6 | 52.6/54.3 |
30
- | **LongVA-TPO (ours)** | 7B | **54.2** | 61.7 | 63.1/66.6 | 54.8/55.3 | 47.4/47.9 | **55.1**/56.6 |
31
 
32
  ## Get Started
33
 
@@ -94,6 +92,5 @@ This project utilizes certain datasets and checkpoints that are subject to their
94
 
95
  **References:**
96
 
97
- [1]. Wang, X., Song, D., Chen, S., Zhang, C., & Wang, B. (2024). LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture. arXiv preprint arXiv:2409.02889.
98
- [2]. Fei, J., Li, D., Deng, Z., Wang, Z., Liu, G., & Wang, H. (2024). Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023.
99
- [3]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.
 
24
  ## Evaluation Results
25
  | **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Short)** | **VideoMME (Medium)** | **VideoMME (Long)** | **VideoMME (Average)** |
26
  |-------------------------------------|----------|---------------------|----------|----------------------|-----------------------|----------------------|-------------------------|
27
+ | **LongVA-7B [1]** | 7B | 51.3 | 58.8 | 61.3/61.6 | 50.4/53.6 | 46.2/47.6 | 52.6/54.3 |
28
+ | **LongVA-TPO (ours)** | 7B | **54.2** | **61.7** | **63.1/66.6** | **54.8/55.3** | **47.4/47.9** | **55.1/56.6** |
 
 
29
 
30
  ## Get Started
31
 
 
92
 
93
  **References:**
94
 
95
+
96
+ [1]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.