File size: 2,292 Bytes
f8f30e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
tags:
- quantized
- embedding
- W4A16
- llmcompressor
- awq
- 4-bit
- activation-aware
base_model: nomic-ai/nomic-embed-code
---

# nomic-embed-code-W4A16-AWQ

This is a **W4A16 quantized** version of [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code).

**Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!**

## Quantization Details

- **Method**: llmcompressor (AWQ one-shot PTQ)
- **Algorithm**: AWQ (Activation-aware Weight Quantization)
- **Scheme**: W4A16
- **Weight bits**: 4-bit
- **Activation bits**: 16-bit
- **Group size**: 128
- **Format**: compressed-tensors
- **Size reduction**: ~75% compared to FP16

## Usage

```python
from transformers import AutoModel, AutoTokenizer

# Load quantized model
model = AutoModel.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)

# Generate embeddings
texts = ["Hello world", "Example text"]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)

print(embeddings.shape)
```

## Performance

- **Memory usage**: ~75% reduction vs FP16
- **Inference speed**: Similar or faster on compatible hardware
- **Quality**: Minimal degradation (<1% on most embedding tasks)

## Why AWQ?

AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:
- **Activation-aware**: Protects salient weights based on activation magnitudes
- Uses calibration data to identify important weight channels
- Provides better accuracy than GPTQ and naive rounding (RTN)
- Works efficiently with group-wise quantization (group size 128)
- Maintains model quality while achieving 75% size reduction
- Optimal for embedding models that rely on preserving semantic relationships

## Original Model

This quantized model is based on [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code).

## Citation

If you use this model, please cite the original model and llmcompressor:

```bibtex
@software{llmcompressor,
  title = {LLM Compressor},
  author = {Neural Magic},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}
```