A newer version of this model is available: embedl/Cosmos-Reason2-2B-W4A16-Edge2

Cosmos-Reason2-2B-W4A16

Cosmos-Reason2-2B Benchmark Results

Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.

This model was created by quantizing the base language model to INT4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.

For more efficient inference, Embedl’s proprietary optimizations and architectural enhancements require patching vLLM. These updates will be released at a later date. For now, the model can be used with vLLM through the NVIDIA Jetson container.


Model Details

Field Value
Base Model nvidia/Cosmos-Reason2-2B
Input / Output Text + Image / Video → Text
Release Date 2026-02-13
Version 1.0
Optimizations Quantization (W4A16)
Developers Embedl
Licenses Upstream: NVIDIA Open Model License, Additional Information: Apache License 2.0, Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs

Optimizations

  • Quantization (W4A16) - large reduction in memory footprint and latency.

Accuracy

For comparative evaluation, we report benchmark scores using the Physical AI Bench Reason Task.

We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52

Overall + Category Scores

Model Overall Embodied Reasoning Common Sense
nvidia/Cosmos-Reason2-2B 50.60 53.93 47.19
embedl/Cosmos-Reason2-2B-NVFP4A16 49.84 50.16 49.50
embedl/Cosmos-Reason2-2B-W4A16 48.68 50.49 46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2 50.58 53.61 47.52

Subcategory Scores

Model AV Physical World Time Space Agibot HoloAssist RoboFail RoboVQA BridgeData V2
nvidia/Cosmos-Reason2-2B 44.00 46.90 45.30 55.00 34.00 60.00 49.00 90.91 42.00
embedl/Cosmos-Reason2-2B-NVFP4A16 44.00 45.13 52.01 52.50 28.00 58.00 51.00 84.55 32.00
embedl/Cosmos-Reason2-2B-W4A16 36.00 47.79 44.30 53.75 36.00 61.00 42.00 80.91 44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2 45.00 44.25 48.66 52.50 32.00 59.00 54.00 85.45 43.00

Performance

On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Screenshot Edge Inference Benchmarks

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Video Inference

vLLM image: NVIDIA vLLM 0.14.0 for Jetson

Test Hardware: NVIDIA Jetson AGX Orin

--gpu-memory-utilization and --max-num-seqs should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Test Hardware: NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super

gpu_memory_utilization and max_num_seqs should be adapted to system specifications (i.e., available RAM).

from vllm import LLM, SamplingParams

if __name__ == "__main__":

    model = "embedl/Cosmos-Reason2-2B-W4A16"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": video_url, "fps": 4},
                },
                {
                    "type": "text",
                    "text": "Describe this video in detail.",
                },
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {
                "count": 1,
                "num_frames": 12,
                "width": 1920,
                "height": 1080,
            },
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        # System-specific settings - Adapt depending on available RAM
        disable_log_stats=False,
        gpu_memory_utilization=0.75,
        max_num_seqs=2,
    )

    output = llm.chat(
        messages,
        sampling_params=SamplingParams(temperature=0.0, max_tokens=256),
    )
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA L4 GPU

Adapted from nvidia/Cosmos-Reason2-2B.

import torch
import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-W4A16"
    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
    )
    processor: transformers.Qwen3VLProcessor = (
        transformers.AutoProcessor.from_pretrained(model_name)
    )
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    video_messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_url, "fps": 4},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    # Process inputs
    inputs = processor.apply_chat_template(
        video_messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
        fps=4,
    )
    inputs = inputs.to(model.device)

    # Run inference
    generated_ids = model.generate(**inputs, max_new_tokens=4096)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(
            inputs.input_ids, generated_ids, strict=False
        )
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License


Contact

Enterprise & Commercial Inquiries contact@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:

  • Engineering support for on-prem/edge deployments
  • Early access & partner co-marketing opportunities

Contact: contact@embedl.com


Downloads last month
8,539
Safetensors
Model size
2B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-2B-W4A16

Quantized
(5)
this model

Collections including embedl/Cosmos-Reason2-2B-W4A16