Safetensors
lbourdois commited on
Commit
af874b9
·
verified ·
1 Parent(s): e314b28

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +79 -65
README.md CHANGED
@@ -1,66 +1,80 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - liuhaotian/LLaVA-CC3M-Pretrain-595K
5
- base_model:
6
- - Qwen/Qwen2.5-0.5B
7
- - openai/clip-vit-large-patch14-336
8
- ---
9
-
10
- # Note that this is a model library with errors.
11
- # In subsequent learning, I found that my model only used one visual token, which was a fatal mistake that resulted in a decrease in the performance of the model.
12
- # I will revise this model library and release a new model when I have time in the future.
13
-
14
-
15
- # Visual Language Model Based on Qwen and CLIP
16
-
17
- This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering.
18
-
19
- ## Training Details
20
-
21
- - The model utilizes the visual encoder from `openai/clip-vit-base-patch32` combined with `qwen2.5-0.5B` as the language model, using a Multi-Layer Perceptron (MLP) layer for alignment. The alignment layer was trained separately for four epochs on the pre-training dataset, but no significant loss improvement was observed after the second epoch.
22
- - It was trained for three epochs on the 150K LLaVA instruction fine-tuning dataset, with a token length of 1024 in the first epoch and 2048 in the second and third epochs. The visual encoder was frozen during training, allowing for the training of the alignment layer and the language model.
23
- - Finally, it underwent three epochs of training on the 665K LLaVA instruction dataset, maintaining a consistent token length of 2048 across all epochs, similar to the setup for the 150K instruction fine-tuning dataset. The visual encoder remained frozen throughout these epochs.
24
- - Model hallucinations still exist, as such a small model finds it challenging to overfit on a large dataset. Therefore, its answer accuracy cannot be compared to that of the full LLaVA model. However, as a small visual language model trained from scratch, it demonstrates the powerful multimodal learning capability of transformers in visual language interactions. I will publish all of my training code and model files for researchers interested in visual language models.
25
-
26
- ### Training Resource Consumption
27
- - Training consumed resources: H20*1*67h (for reference only).
28
-
29
- ### Uploading Issues
30
-
31
- I attempted to use Hugging Face's PyTorch classes for uploading, but I found that it did not adequately record all of my weights, leading to issues during model inference. Therefore, it is recommended to load the model using PyTorch.
32
-
33
- If you do not have an image, you can download one from the repository; it is a small bird with red and black feathers.
34
-
35
-
36
- ![a small bird with red and black](./bird.jpeg)
37
-
38
-
39
- ### Loading Instructions
40
-
41
- Below are the steps to load the model using PyTorch:
42
-
43
- 1. Download the `qwenva.py` file and the `qwenva.pth` weights from the repository, ensuring that both the weight and model architecture files are in the same directory.
44
- 2. Import the model and processor from the `qwenva` file:
45
-
46
- ```python
47
- from qwenva import model, processor
48
- from PIL import Image
49
- import torch
50
-
51
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
- image = Image.open("./bird.jpeg")
53
- input_ = processor("please describe the image", image)
54
- input_ = {k: v.to(device) for k, v in input_.items()}
55
- model.to(device)
56
- image_idx = torch.tensor(input_['input_ids'].shape[1] - 1).unsqueeze(0)
57
- generated_ids = model.generate(
58
- **input_,
59
- max_length=512,
60
- )
61
- generated_ids = generated_ids[0][input_['input_ids'].size(1):]
62
- response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
63
- print(response)
64
-
65
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  "The image features a beautiful red bird perched on a branch, surrounded by leaves. The bird appears to be looking down, possibly observing its surroundings. The leaves and branches of the tree provide a natural and natural environment for the bird to rest and observe its environment."
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - liuhaotian/LLaVA-CC3M-Pretrain-595K
5
+ base_model:
6
+ - Qwen/Qwen2.5-0.5B
7
+ - openai/clip-vit-large-patch14-336
8
+ language:
9
+ - zho
10
+ - eng
11
+ - fra
12
+ - spa
13
+ - por
14
+ - deu
15
+ - ita
16
+ - rus
17
+ - jpn
18
+ - kor
19
+ - vie
20
+ - tha
21
+ - ara
22
+ ---
23
+
24
+ # Note that this is a model library with errors.
25
+ # In subsequent learning, I found that my model only used one visual token, which was a fatal mistake that resulted in a decrease in the performance of the model.
26
+ # I will revise this model library and release a new model when I have time in the future.
27
+
28
+
29
+ # Visual Language Model Based on Qwen and CLIP
30
+
31
+ This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering.
32
+
33
+ ## Training Details
34
+
35
+ - The model utilizes the visual encoder from `openai/clip-vit-base-patch32` combined with `qwen2.5-0.5B` as the language model, using a Multi-Layer Perceptron (MLP) layer for alignment. The alignment layer was trained separately for four epochs on the pre-training dataset, but no significant loss improvement was observed after the second epoch.
36
+ - It was trained for three epochs on the 150K LLaVA instruction fine-tuning dataset, with a token length of 1024 in the first epoch and 2048 in the second and third epochs. The visual encoder was frozen during training, allowing for the training of the alignment layer and the language model.
37
+ - Finally, it underwent three epochs of training on the 665K LLaVA instruction dataset, maintaining a consistent token length of 2048 across all epochs, similar to the setup for the 150K instruction fine-tuning dataset. The visual encoder remained frozen throughout these epochs.
38
+ - Model hallucinations still exist, as such a small model finds it challenging to overfit on a large dataset. Therefore, its answer accuracy cannot be compared to that of the full LLaVA model. However, as a small visual language model trained from scratch, it demonstrates the powerful multimodal learning capability of transformers in visual language interactions. I will publish all of my training code and model files for researchers interested in visual language models.
39
+
40
+ ### Training Resource Consumption
41
+ - Training consumed resources: H20*1*67h (for reference only).
42
+
43
+ ### Uploading Issues
44
+
45
+ I attempted to use Hugging Face's PyTorch classes for uploading, but I found that it did not adequately record all of my weights, leading to issues during model inference. Therefore, it is recommended to load the model using PyTorch.
46
+
47
+ If you do not have an image, you can download one from the repository; it is a small bird with red and black feathers.
48
+
49
+
50
+ ![a small bird with red and black](./bird.jpeg)
51
+
52
+
53
+ ### Loading Instructions
54
+
55
+ Below are the steps to load the model using PyTorch:
56
+
57
+ 1. Download the `qwenva.py` file and the `qwenva.pth` weights from the repository, ensuring that both the weight and model architecture files are in the same directory.
58
+ 2. Import the model and processor from the `qwenva` file:
59
+
60
+ ```python
61
+ from qwenva import model, processor
62
+ from PIL import Image
63
+ import torch
64
+
65
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
66
+ image = Image.open("./bird.jpeg")
67
+ input_ = processor("please describe the image", image)
68
+ input_ = {k: v.to(device) for k, v in input_.items()}
69
+ model.to(device)
70
+ image_idx = torch.tensor(input_['input_ids'].shape[1] - 1).unsqueeze(0)
71
+ generated_ids = model.generate(
72
+ **input_,
73
+ max_length=512,
74
+ )
75
+ generated_ids = generated_ids[0][input_['input_ids'].size(1):]
76
+ response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
77
+ print(response)
78
+
79
+
80
  "The image features a beautiful red bird perched on a branch, surrounded by leaves. The bird appears to be looking down, possibly observing its surroundings. The leaves and branches of the tree provide a natural and natural environment for the bird to rest and observe its environment."