---
license: apache-2.0
---

# BLIP-2 SnapGarden

BLIP-2 SnapGarden is a fine-tuned version of the BLIP-2 model, specifically adapted for the SnapGarden dataset to answer Q&A about plants. 
This model is designed to generate small descriptions of images, enhancing the capabilities of image captioning tasks.

## Model Overview

BLIP-2 (Bootstrapping Language-Image Pre-training) is a state-of-the-art model that bridges the gap between vision and language understanding. 
By lora fine-tuning BLIP-2 on the SnapGarden dataset, this model has learned to generate captions that are contextually relevant and descriptive, making it suitable for applications in image understanding and accessibility tools.

## SnapGarden Dataset

The SnapGarden dataset is a curated collection of images focusing on various plant species, gardening activities, and related scenes. 
It provides a diverse set of images with corresponding captions, making it ideal for training models in the domain of botany and gardening.

## Model Details

Model Name: *BLIP-2 SnapGarden*
Base Model: *BLIP-2*
Fine-tuning Dataset: *Baran657/SnapGarden_v0.6*
Task: *VQA*

## Usage
To use this model with the Hugging Face transformers library:

#### Running the model on CPU

<details>
<summary> Click to expand </summary>

```python
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())
```
</details>

#### Running the model on GPU

##### In full precision 

<details>
<summary> Click to expand </summary>

```python
# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())
```
</details>

##### In half precision (`float16`)

<details>
<summary> Click to expand </summary>

```python
# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())
```
</details>

##### In 8-bit precision (`int8`)

<details>
<summary> Click to expand </summary>

```python
# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden")
model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())
```
</details>

## Applications

Botanical Research: Assisting researchers in identifying and describing house plant species.
Educational Tools: Providing descriptive content for educational materials in botany.
Accessibility: Enhancing image descriptions for visually impaired individuals in gardening contexts.
Limitations
While BLIP-2 SnapGarden performs good in generating captions for plant-related images, it may not generalize effectively to images outside the gardening domain. 
Users should be cautious when applying this model to unrelated image datasets. In addition, the training of this model can be optimized and will be done towards the end of this week.

## License

This model is distributed under the Apache 2.0 License.

## Acknowledgements

The original BLIP-2 model for providing the foundational architecture.
The creators of the SnapGarden dataset for their valuable contribution to the field.
For more details and updates, please visit the Hugging Face model page.