desiree commited on
Commit
4e870b7
·
verified ·
1 Parent(s): 9b48f2a

Unsloth Model Card

Browse files
Files changed (1) hide show
  1. README.md +12 -146
README.md CHANGED
@@ -1,155 +1,21 @@
1
  ---
2
- base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
3
  tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - mllama
8
  license: apache-2.0
9
  language:
10
- - en
11
- pipeline_tag: image-text-to-text
12
- widget:
13
- - src: "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_captioning/sample_image.png"
14
- text: "What is this image about?"
15
- example_title: "Sample Image Input"
16
  ---
17
 
18
- ## Overview
19
 
20
- This repository provides a custom inference pipeline for our finetuned vision-language model. The pipeline:
 
 
21
 
22
- - **Accepts Image + Text Input:**
23
- The model processes both an image (or an image placeholder) and a text prompt.
24
 
25
- - **Generates Responses:**
26
- It outputs a textual response based on the input prompt and any visual cues. You can ask it to provide detailed reasoning if desired.
27
-
28
- - **Optimized for Efficiency:**
29
- The model is loaded in 4-bit precision, making it more memory-efficient without significantly compromising performance.
30
-
31
- ---
32
-
33
- ## What the Model Does
34
-
35
- 1. **Image + Text Understanding:**
36
- It takes an image along with a text instruction. In our code, we often use a dummy image as a placeholder.
37
-
38
- 2. **Instruction Following:**
39
- The model is fine-tuned to follow instructions. For example, you can ask it to describe the image, provide step-by-step reasoning, or answer specific questions about the image.
40
-
41
- 3. **Efficient Inference:**
42
- With 4-bit quantization, the model uses less GPU memory, making it suitable for environments with limited VRAM.
43
-
44
- 4. **Flexible Prompting:**
45
- The final output depends on your prompt. Ask for step-by-step reasoning, concise answers, or detailed descriptions based on your needs.
46
-
47
-
48
-
49
- instructions: >
50
- This YAML file contains metadata, instructions, code, and explanation
51
- for using a custom pipeline with a finetuned vision-language model.
52
-
53
- **Setup Steps**
54
- 1. Install dependencies with: `pip install transformers Pillow`.
55
- 2. Load your model and tokenizer via Unsloth.
56
- 3. Place your dummy image (e.g. "Image_Editor.png") in the same folder.
57
- 4. Run the code in the `code` section to create and test the custom pipeline.
58
-
59
- code: |
60
- ```python
61
- from PIL import Image
62
- from transformers.pipelines import Pipeline
63
-
64
- # Open your dummy image (ensure "Image_Editor.png" is in your working directory)
65
- dummy_image = Image.open("Image_Editor.png")
66
-
67
- # Make sure your model and tokenizer are already loaded.
68
- # For example:
69
- # from unsloth import FastVisionModel, is_bf16_supported
70
- # model, tokenizer = FastVisionModel.from_pretrained(
71
- # "unsloth/Llama-3.2-11B-Vision-Instruct",
72
- # load_in_4bit=True,
73
- # use_gradient_checkpointing="unsloth",
74
- # )
75
-
76
- # --- Monkey-patch the tokenizer if it lacks pad_token_id ---
77
- if not hasattr(tokenizer, "pad_token_id"):
78
- tokenizer.pad_token_id = tokenizer.eos_token_id if hasattr(tokenizer, "eos_token_id") else 0
79
-
80
- class CustomImageTextToTextPipeline(Pipeline):
81
- """
82
- A custom pipeline that accepts inputs as a list of dictionaries with "role" and "content".
83
- It constructs a prompt that includes an image placeholder (using dummy_image) and tokenizes
84
- the prompt along with the image.
85
- """
86
- def __init__(self, model, tokenizer, dummy_image, **kwargs):
87
- super().__init__(model=model, tokenizer=tokenizer, **kwargs)
88
- self.dummy_image = dummy_image
89
- # Determine device from the model parameters.
90
- self.device = next(model.parameters()).device
91
-
92
- def _sanitize_parameters(self, **kwargs):
93
- # Required to instantiate the pipeline.
94
- return {}, kwargs, {}
95
-
96
- def preprocess(self, inputs, **kwargs):
97
- """
98
- Expects inputs as a list of dicts with keys "role" and "content".
99
- Constructs a chat prompt with an image placeholder and tokenizes it.
100
- """
101
- if isinstance(inputs, list):
102
- message = inputs[0]
103
- elif isinstance(inputs, dict):
104
- message = inputs
105
- else:
106
- raise ValueError("Input must be a dict or a list of dicts.")
107
-
108
- text = message.get("content", "")
109
- # Create a chat prompt using your expected format.
110
- messages = [{
111
- "role": message.get("role", "user"),
112
- "content": [
113
- {"type": "image"}, # Image placeholder
114
- {"type": "text", "text": text} # Your input text
115
- ]
116
- }]
117
- # Use the tokenizer's chat template method to construct the final prompt.
118
- input_text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True)
119
- # Tokenize the prompt and dummy image.
120
- model_inputs = self.tokenizer(
121
- self.dummy_image,
122
- input_text,
123
- add_special_tokens=False,
124
- return_tensors="pt"
125
- ).to(self.device)
126
- return model_inputs
127
-
128
- def _forward(self, model_inputs):
129
- # Generate the model output.
130
- return self.model.generate(
131
- **model_inputs,
132
- max_new_tokens=128,
133
- use_cache=True,
134
- temperature=1.5,
135
- min_p=0.1,
136
- )
137
-
138
- def postprocess(self, model_outputs, **kwargs):
139
- # Decode the generated tokens to human-readable text.
140
- return self.tokenizer.decode(model_outputs[0], skip_special_tokens=True)
141
-
142
- # Create an instance of the custom pipeline (do not specify a device if using Accelerate).
143
- custom_pipe = CustomImageTextToTextPipeline(
144
- model=model,
145
- tokenizer=tokenizer,
146
- dummy_image=dummy_image
147
- )
148
-
149
- # Test the pipeline using your input format.
150
- messages = [{
151
- "role": "user",
152
- "content": "what is this image look like? Please explain your reasoning step-by-step before giving your answer."
153
- }]
154
- result = custom_pipe(messages)
155
- print(result)
 
1
  ---
2
+ base_model: unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
3
  tags:
4
+ - text-generation-inference
5
+ - transformers
6
+ - unsloth
7
+ - mllama
8
  license: apache-2.0
9
  language:
10
+ - en
 
 
 
 
 
11
  ---
12
 
13
+ # Uploaded finetuned model
14
 
15
+ - **Developed by:** desiree
16
+ - **License:** apache-2.0
17
+ - **Finetuned from model :** unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
18
 
19
+ This mllama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
20
 
21
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)