nomic-embed-code-W4A16-AWQ

This is a W4A16 quantized version of nomic-ai/nomic-embed-code.

Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!

Quantization Details

  • Method: llmcompressor (AWQ one-shot PTQ)
  • Algorithm: AWQ (Activation-aware Weight Quantization)
  • Scheme: W4A16
  • Weight bits: 4-bit
  • Activation bits: 16-bit
  • Group size: 128
  • Format: compressed-tensors
  • Size reduction: ~75% compared to FP16

Usage

from transformers import AutoModel, AutoTokenizer

# Load quantized model
model = AutoModel.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)

# Generate embeddings
texts = ["Hello world", "Example text"]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)

print(embeddings.shape)

Performance

  • Memory usage: ~75% reduction vs FP16
  • Inference speed: Similar or faster on compatible hardware
  • Quality: Minimal degradation (<1% on most embedding tasks)

Why AWQ?

AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:

  • Activation-aware: Protects salient weights based on activation magnitudes
  • Uses calibration data to identify important weight channels
  • Provides better accuracy than GPTQ and naive rounding (RTN)
  • Works efficiently with group-wise quantization (group size 128)
  • Maintains model quality while achieving 75% size reduction
  • Optimal for embedding models that rely on preserving semantic relationships

Original Model

This quantized model is based on nomic-ai/nomic-embed-code.

Citation

If you use this model, please cite the original model and llmcompressor:

@software{llmcompressor,
  title = {LLM Compressor},
  author = {Neural Magic},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}
Downloads last month
3
Safetensors
Model size
1B params
Tensor type
I64
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pyrymikko/nomic-embed-code-W4A16-AWQ

Base model

Qwen/Qwen2.5-7B
Quantized
(8)
this model