Image-Text-to-Text
Transformers
Safetensors
vqa
vlm
Inference Endpoints

mehmetkeremturkcan/FemtoVLM-Femto

FemtoVLM: Tiniest Vision Language Models

FemtoVLM is the smallest visual question answering/captioning model in the world. It accepts image and text inputs to produce text outputs. It's designed for efficiency. FemtoVLM can answer questions about images and describe visual content. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance.

FemtoVLM comes in three sizes: 116M (femto), 143M (tiny), 160M (base), 225M (dino). All models are trained for image captioning and question answering in real-world contexts. FemtoVLM cannot perform optical character recognition (OCR), multi-turn question-answering, or scientific question answering.

Setup

pip install git+https://github.com/facebookresearch/schedule_free.git
pip install peft
git clone https://github.com/mkturkcan/seers.git
cd seers/seers/
git clone https://huggingface.co/mehmetkeremturkcan/FemtoVLM-Femto

Test

Run, in the seers/seers folder,

python femtovlm_inference.py

Train

seers training code is public! Run

python femtovlm_train.py
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for mehmetkeremturkcan/FemtoVLM-Femto

Finetuned
(22)
this model

Datasets used to train mehmetkeremturkcan/FemtoVLM-Femto

Collection including mehmetkeremturkcan/FemtoVLM-Femto