File size: 4,159 Bytes
4cf49e2 12cf828 4cf49e2 12cf828 4cf49e2 12cf828 4cf49e2 b7dcaa9 c16b7c5 b7dcaa9 7ea25a5 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 4cf49e2 391c1bc 630043f c16b7c5 391c1bc c16b7c5 391c1bc b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 c0a63a5 630043f c0a63a5 391c1bc 0669884 391c1bc b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 c16b7c5 b7dcaa9 391c1bc c16b7c5 391c1bc 0669884 391c1bc c16b7c5 b7dcaa9 391c1bc c16b7c5 391c1bc b7dcaa9 391c1bc c16b7c5 391c1bc b7dcaa9 c16b7c5 391c1bc b7dcaa9 391c1bc b7dcaa9 391c1bc c0a63a5 12cf828 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
language: en
tags:
- vqa
- engineering-drawing
- visual-question-answering
license: mit
metrics:
- accuracy
- f1
model_categories:
- text-to-image
- image-to-text
base_model: microsoft/Florence-2-base-ft
task: Visual Question Answering (VQA)
architecture: Causal Language Model (CLM)
framework: Hugging Face Transformers
---
# Florence 2 VQA - Engineering Drawings
## Model Overview
The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the image.
---
## Model Details
- **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
- **Task**: Visual Question Answering (VQA)
- **Architecture**: Causal Language Model (CLM)
- **Framework**: Hugging Face Transformers
---
## How to Use the Model
### **Install Dependencies**
Make sure you have the required libraries installed:
```bash
pip install transformers torch datasets pillow gradio
```
### **Load the Model**
To load the model and processor for inference, use the following code:
```python
from transformers import AutoConfig, AutoModelForCausalLM
import torch
# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
"fauzail/Florence-2-VQA",
config=config,
trust_remote_code=True
).to(device)
```
### **Load the Processor**
```python
from transformers import AutoProcessor
# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)
```
### **Define the Prediction Function**
Once the model and processor are loaded, define a prediction function that takes an image and question as input:
```python
def predict(image_path, question):
from PIL import Image
# Load and preprocess the image
image = Image.open(image_path).convert("RGB")
# Prepare inputs using the processor
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
# Generate the output from the model
outputs = model.generate(**inputs)
# Decode the output tokens into a human-readable format
answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
return answer
```
### **Test It for Example**
Now, test the model using an image and a question:
```python
image_path = "test.png" # Replace with your image path
question = "Tell me in detail about the image?"
# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)
```
### **Alternative: Use Gradio for Interactive Web Interface**
If you prefer an interactive interface, you can use Gradio to deploy the model:
```python
import gradio as gr
from PIL import Image
# Define the prediction function for Gradio
def predict(image, question):
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
outputs = model.generate(**inputs)
return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Create the Gradio interface
interface = gr.Interface(
fn=predict,
inputs=["image", "text"],
outputs="text",
title="Florence 2 VQA - Engineering Drawings",
description="Upload an engineering drawing and ask a related question."
)
# Launch the Gradio interface
interface.launch()
```
---
## Training Details
- **Preprocessing**:
- Images were resized and normalized.
- Text data (questions and answers) was tokenized using the Florence tokenizer.
- **Hyperparameters**:
- **Learning Rate**: `1e-6`
- **Batch Size**: `2`
- **Gradient Accumulation Steps**: `4`
- **Epochs**: `10`
Training was performed using mixed precision for efficiency.
--- |