--- license: apache-2.0 --- # BLIP-2 SnapGarden BLIP-2 SnapGarden is a fine-tuned version of the BLIP-2 model, specifically adapted for the SnapGarden dataset to answer Q&A about plants. This model is designed to generate small descriptions of images, enhancing the capabilities of image captioning tasks. ## Model Overview BLIP-2 (Bootstrapping Language-Image Pre-training) is a state-of-the-art model that bridges the gap between vision and language understanding. By lora fine-tuning BLIP-2 on the SnapGarden dataset, this model has learned to generate captions that are contextually relevant and descriptive, making it suitable for applications in image understanding and accessibility tools. ## SnapGarden Dataset The SnapGarden dataset is a curated collection of images focusing on various plant species, gardening activities, and related scenes. It provides a diverse set of images with corresponding captions, making it ideal for training models in the domain of botany and gardening. ## Model Details Model Name: *BLIP-2 SnapGarden* Base Model: *BLIP-2* Fine-tuning Dataset: *Baran657/SnapGarden_v0.6* Task: *VQA* ## Usage To use this model with the Hugging Face transformers library: #### Running the model on CPU
Click to expand ```python import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') question = "how many dogs are in the picture?" inputs = processor(raw_image, question, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip()) ```
#### Running the model on GPU ##### In full precision
Click to expand ```python # pip install accelerate import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", device_map="auto") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') question = "how many dogs are in the picture?" inputs = processor(raw_image, question, return_tensors="pt").to("cuda") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip()) ```
##### In half precision (`float16`)
Click to expand ```python # pip install accelerate import torch import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", torch_dtype=torch.float16, device_map="auto") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') question = "how many dogs are in the picture?" inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip()) ```
##### In 8-bit precision (`int8`)
Click to expand ```python # pip install accelerate bitsandbytes import torch import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained("Baran657/blip_2_snapgarden") model = Blip2ForConditionalGeneration.from_pretrained("Baran657/blip_2_snapgarden", load_in_8bit=True, device_map="auto") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') question = "how many dogs are in the picture?" inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True).strip()) ```
## Applications Botanical Research: Assisting researchers in identifying and describing house plant species. Educational Tools: Providing descriptive content for educational materials in botany. Accessibility: Enhancing image descriptions for visually impaired individuals in gardening contexts. Limitations While BLIP-2 SnapGarden performs good in generating captions for plant-related images, it may not generalize effectively to images outside the gardening domain. Users should be cautious when applying this model to unrelated image datasets. In addition, the training of this model can be optimized and will be done towards the end of this week. ## License This model is distributed under the Apache 2.0 License. ## Acknowledgements The original BLIP-2 model for providing the foundational architecture. The creators of the SnapGarden dataset for their valuable contribution to the field. For more details and updates, please visit the Hugging Face model page.