desiree
/

Llama-3.2-11B-Vision-OpenThoughts

@@ -1,155 +1,21 @@
 ---
-base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
 tags:
-  - text-generation-inference
-  - transformers
-  - unsloth
-  - mllama
 license: apache-2.0
 language:
-  - en
-pipeline_tag: image-text-to-text
-widget:
-  - src: "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_captioning/sample_image.png"
-    text: "What is this image about?"
-    example_title: "Sample Image Input"
 ---
-## Overview
-This repository provides a custom inference pipeline for our finetuned vision-language model. The pipeline:
-- **Accepts Image + Text Input:**
-  The model processes both an image (or an image placeholder) and a text prompt.
-- **Generates Responses:**
-  It outputs a textual response based on the input prompt and any visual cues. You can ask it to provide detailed reasoning if desired.
-- **Optimized for Efficiency:**
-  The model is loaded in 4-bit precision, making it more memory-efficient without significantly compromising performance.
----
-## What the Model Does
-1. **Image + Text Understanding:**
-   It takes an image along with a text instruction. In our code, we often use a dummy image as a placeholder.
-2. **Instruction Following:**
-   The model is fine-tuned to follow instructions. For example, you can ask it to describe the image, provide step-by-step reasoning, or answer specific questions about the image.
-3. **Efficient Inference:**
-   With 4-bit quantization, the model uses less GPU memory, making it suitable for environments with limited VRAM.
-4. **Flexible Prompting:**
-   The final output depends on your prompt. Ask for step-by-step reasoning, concise answers, or detailed descriptions based on your needs.
-instructions: >
-  This YAML file contains metadata, instructions, code, and explanation
-  for using a custom pipeline with a finetuned vision-language model.
-  **Setup Steps**
-  1. Install dependencies with: `pip install transformers Pillow`.
-  2. Load your model and tokenizer via Unsloth.
-  3. Place your dummy image (e.g. "Image_Editor.png") in the same folder.
-  4. Run the code in the `code` section to create and test the custom pipeline.
-code: |
-  ```python
-  from PIL import Image
-  from transformers.pipelines import Pipeline
-  # Open your dummy image (ensure "Image_Editor.png" is in your working directory)
-  dummy_image = Image.open("Image_Editor.png")
-  # Make sure your model and tokenizer are already loaded.
-  # For example:
-  # from unsloth import FastVisionModel, is_bf16_supported
-  # model, tokenizer = FastVisionModel.from_pretrained(
-  #     "unsloth/Llama-3.2-11B-Vision-Instruct",
-  #     load_in_4bit=True,
-  #     use_gradient_checkpointing="unsloth",
-  # )
-  # --- Monkey-patch the tokenizer if it lacks pad_token_id ---
-  if not hasattr(tokenizer, "pad_token_id"):
-      tokenizer.pad_token_id = tokenizer.eos_token_id if hasattr(tokenizer, "eos_token_id") else 0
-  class CustomImageTextToTextPipeline(Pipeline):
-      """
-      A custom pipeline that accepts inputs as a list of dictionaries with "role" and "content".
-      It constructs a prompt that includes an image placeholder (using dummy_image) and tokenizes
-      the prompt along with the image.
-      """
-      def __init__(self, model, tokenizer, dummy_image, **kwargs):
-          super().__init__(model=model, tokenizer=tokenizer, **kwargs)
-          self.dummy_image = dummy_image
-          # Determine device from the model parameters.
-          self.device = next(model.parameters()).device
-      def _sanitize_parameters(self, **kwargs):
-          # Required to instantiate the pipeline.
-          return {}, kwargs, {}
-      def preprocess(self, inputs, **kwargs):
-          """
-          Expects inputs as a list of dicts with keys "role" and "content".
-          Constructs a chat prompt with an image placeholder and tokenizes it.
-          """
-          if isinstance(inputs, list):
-              message = inputs[0]
-          elif isinstance(inputs, dict):
-              message = inputs
-          else:
-              raise ValueError("Input must be a dict or a list of dicts.")
-          text = message.get("content", "")
-          # Create a chat prompt using your expected format.
-          messages = [{
-              "role": message.get("role", "user"),
-              "content": [
-                  {"type": "image"},  # Image placeholder
-                  {"type": "text", "text": text}  # Your input text
-              ]
-          }]
-          # Use the tokenizer's chat template method to construct the final prompt.
-          input_text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True)
-          # Tokenize the prompt and dummy image.
-          model_inputs = self.tokenizer(
-              self.dummy_image,
-              input_text,
-              add_special_tokens=False,
-              return_tensors="pt"
-          ).to(self.device)
-          return model_inputs
-      def _forward(self, model_inputs):
-          # Generate the model output.
-          return self.model.generate(
-              **model_inputs,
-              max_new_tokens=128,
-              use_cache=True,
-              temperature=1.5,
-              min_p=0.1,
-          )
-      def postprocess(self, model_outputs, **kwargs):
-          # Decode the generated tokens to human-readable text.
-          return self.tokenizer.decode(model_outputs[0], skip_special_tokens=True)
-  # Create an instance of the custom pipeline (do not specify a device if using Accelerate).
-  custom_pipe = CustomImageTextToTextPipeline(
-      model=model,
-      tokenizer=tokenizer,
-      dummy_image=dummy_image
-  )
-  # Test the pipeline using your input format.
-  messages = [{
-      "role": "user",
-      "content": "what is this image look like? Please explain your reasoning step-by-step before giving your answer."
-  }]
-  result = custom_pipe(messages)
-  print(result)

 ---
+base_model: unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
 tags:
+- text-generation-inference
+- transformers
+- unsloth
+- mllama
 license: apache-2.0
 language:
+- en
 ---
+# Uploaded finetuned  model
+- **Developed by:** desiree
+- **License:** apache-2.0
+- **Finetuned from model :** unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
+This mllama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)