|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
# rmfg |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
<img src="https://i.pinimg.com/736x/7e/46/a6/7e46a6881623dfd3e1a2a5a2ae692374.jpg" width="300"> |
|
|
|
|
|
|
|
## Example |
|
|
|
**Image** |
|
<img src="https://media-cldnry.s-nbcnews.com/image/upload/t_fit-760w,f_auto,q_auto:best/rockcms/2023-12/231202-elon-musk-mjf-1715-fc0be2.jpg" width="300"> |
|
**Output** |
|
> A man in a black cowboy hat and sunglasses stands in front of a white car, holding a microphone and speaking into it. |
|
|
|
----------------------------------------------------------------------------------- |
|
|
|
- underfit, doesn't perform well |
|
- this marks the beginning of my tiny vision language model series, with this model serving as a prelude to what's to come in the next few days. |
|
|
|
``` |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from PIL import Image |
|
|
|
model_id = "aloobun/rmfg" |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, trust_remote_code=True |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
image = Image.open('692374.jpg') |
|
enc_image = model.encode_image(image) |
|
print(model.answer_question(enc_image, "Describe this image.", tokenizer)) |
|
``` |