Update README.md
Browse files
README.md
CHANGED
@@ -6,6 +6,12 @@ base_model:
|
|
6 |
- Qwen/Qwen2.5-0.5B
|
7 |
- openai/clip-vit-large-patch14-336
|
8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
# Visual Language Model Based on Qwen and CLIP
|
10 |
|
11 |
This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering.
|
|
|
6 |
- Qwen/Qwen2.5-0.5B
|
7 |
- openai/clip-vit-large-patch14-336
|
8 |
---
|
9 |
+
|
10 |
+
# Note that this is a model library with errors.
|
11 |
+
# In subsequent learning, I found that my model only used one visual token, which was a fatal mistake that resulted in a decrease in the performance of the model.
|
12 |
+
# I will revise this model library and release a new model when I have time in the future.
|
13 |
+
|
14 |
+
|
15 |
# Visual Language Model Based on Qwen and CLIP
|
16 |
|
17 |
This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering.
|