File size: 4,159 Bytes
4cf49e2
 
 
12cf828
 
 
 
4cf49e2
12cf828
 
4cf49e2
12cf828
 
4cf49e2
 
 
 
 
 
b7dcaa9
c16b7c5
b7dcaa9
7ea25a5
c16b7c5
b7dcaa9
c16b7c5
 
b7dcaa9
 
 
 
c16b7c5
b7dcaa9
c16b7c5
b7dcaa9
c16b7c5
b7dcaa9
 
 
 
4cf49e2
391c1bc
 
630043f
c16b7c5
391c1bc
c16b7c5
391c1bc
b7dcaa9
 
c16b7c5
b7dcaa9
 
c16b7c5
 
b7dcaa9
 
c16b7c5
b7dcaa9
 
 
 
 
 
c16b7c5
c0a63a5
630043f
c0a63a5
 
 
 
 
 
 
391c1bc
0669884
391c1bc
 
 
 
 
b7dcaa9
 
c16b7c5
b7dcaa9
 
c16b7c5
b7dcaa9
 
c16b7c5
b7dcaa9
 
c16b7c5
b7dcaa9
 
 
391c1bc
 
 
c16b7c5
391c1bc
0669884
391c1bc
 
 
c16b7c5
b7dcaa9
 
 
391c1bc
c16b7c5
391c1bc
 
 
 
 
b7dcaa9
391c1bc
c16b7c5
391c1bc
b7dcaa9
 
 
 
c16b7c5
391c1bc
b7dcaa9
 
 
 
 
 
 
391c1bc
 
b7dcaa9
391c1bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a63a5
 
12cf828
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language: en
tags:
- vqa
- engineering-drawing
- visual-question-answering
license: mit
metrics:
- accuracy
- f1
model_categories:
- text-to-image
- image-to-text
base_model: microsoft/Florence-2-base-ft
task: Visual Question Answering (VQA)
architecture: Causal Language Model (CLM)
framework: Hugging Face Transformers
---

# Florence 2 VQA - Engineering Drawings

## Model Overview
The **Florence 2 VQA** model is fine-tuned for visual question answering (VQA) tasks, specifically for **engineering drawings**. It takes both an **image** (e.g., a technical drawing) and a **textual question** as input, and generates a text-based answer related to the content of the image.

---

## Model Details
- **Base Model**: [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)  
- **Task**: Visual Question Answering (VQA)  
- **Architecture**: Causal Language Model (CLM)  
- **Framework**: Hugging Face Transformers  

---

## How to Use the Model

### **Install Dependencies**
Make sure you have the required libraries installed:
```bash
pip install transformers torch datasets pillow gradio

```

### **Load the Model**

To load the model and processor for inference, use the following code:

```python
from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
    "fauzail/Florence-2-VQA",
    config=config,
    trust_remote_code=True
).to(device)

```
### **Load the Processor**

```python
from transformers import AutoProcessor

# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)

```

### **Define the Prediction Function**

Once the model and processor are loaded, define a prediction function that takes an image and question as input:

```python
def predict(image_path, question):
    from PIL import Image

    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")

    # Prepare inputs using the processor
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)

    # Generate the output from the model
    outputs = model.generate(**inputs)

    # Decode the output tokens into a human-readable format
    answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer
```

### **Test It for Example**

Now, test the model using an image and a question:

```python
image_path = "test.png"  # Replace with your image path
question = "Tell me in detail about the image?"

# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)
```

### **Alternative: Use Gradio for Interactive Web Interface**

If you prefer an interactive interface, you can use Gradio to deploy the model:

```python
import gradio as gr
from PIL import Image

# Define the prediction function for Gradio
def predict(image, question):
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
    outputs = model.generate(**inputs)
    return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Create the Gradio interface
interface = gr.Interface(
    fn=predict,
    inputs=["image", "text"],
    outputs="text",
    title="Florence 2 VQA - Engineering Drawings",
    description="Upload an engineering drawing and ask a related question."
)

# Launch the Gradio interface
interface.launch()
```

---

## Training Details
- **Preprocessing**:
  - Images were resized and normalized.
  - Text data (questions and answers) was tokenized using the Florence tokenizer.  
- **Hyperparameters**:
  - **Learning Rate**: `1e-6`  
  - **Batch Size**: `2`  
  - **Gradient Accumulation Steps**: `4`  
  - **Epochs**: `10`  

Training was performed using mixed precision for efficiency.



---