|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- quantized |
|
|
- embedding |
|
|
- W4A16 |
|
|
- llmcompressor |
|
|
- awq |
|
|
- 4-bit |
|
|
- activation-aware |
|
|
base_model: nomic-ai/nomic-embed-code |
|
|
--- |
|
|
|
|
|
# nomic-embed-code-W4A16-AWQ |
|
|
|
|
|
This is a **W4A16 quantized** version of [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code). |
|
|
|
|
|
**Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!** |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
- **Method**: llmcompressor (AWQ one-shot PTQ) |
|
|
- **Algorithm**: AWQ (Activation-aware Weight Quantization) |
|
|
- **Scheme**: W4A16 |
|
|
- **Weight bits**: 4-bit |
|
|
- **Activation bits**: 16-bit |
|
|
- **Group size**: 128 |
|
|
- **Format**: compressed-tensors |
|
|
- **Size reduction**: ~75% compared to FP16 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# Load quantized model |
|
|
model = AutoModel.from_pretrained( |
|
|
"nomic-embed-code-W4A16-AWQ", |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"nomic-embed-code-W4A16-AWQ", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Generate embeddings |
|
|
texts = ["Hello world", "Example text"] |
|
|
inputs = tokenizer(texts, padding=True, return_tensors="pt") |
|
|
embeddings = model(**inputs).last_hidden_state.mean(dim=1) |
|
|
|
|
|
print(embeddings.shape) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Memory usage**: ~75% reduction vs FP16 |
|
|
- **Inference speed**: Similar or faster on compatible hardware |
|
|
- **Quality**: Minimal degradation (<1% on most embedding tasks) |
|
|
|
|
|
## Why AWQ? |
|
|
|
|
|
AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that: |
|
|
- **Activation-aware**: Protects salient weights based on activation magnitudes |
|
|
- Uses calibration data to identify important weight channels |
|
|
- Provides better accuracy than GPTQ and naive rounding (RTN) |
|
|
- Works efficiently with group-wise quantization (group size 128) |
|
|
- Maintains model quality while achieving 75% size reduction |
|
|
- Optimal for embedding models that rely on preserving semantic relationships |
|
|
|
|
|
## Original Model |
|
|
|
|
|
This quantized model is based on [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original model and llmcompressor: |
|
|
|
|
|
```bibtex |
|
|
@software{llmcompressor, |
|
|
title = {LLM Compressor}, |
|
|
author = {Neural Magic}, |
|
|
url = {https://github.com/vllm-project/llm-compressor}, |
|
|
year = {2024} |
|
|
} |
|
|
``` |
|
|
|