Upload W4A16-AWQ quantized nomic-ai/nomic-embed-code

f8f30e9 verified about 1 month ago

2.29 kB

	---
	license: mit
	tags:
	- quantized
	- embedding
	- W4A16
	- llmcompressor
	- awq
	- 4-bit
	- activation-aware
	base_model: nomic-ai/nomic-embed-code
	---

	# nomic-embed-code-W4A16-AWQ

	This is a W4A16 quantized version of [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code).

	Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!

	## Quantization Details

	- Method: llmcompressor (AWQ one-shot PTQ)
	- Algorithm: AWQ (Activation-aware Weight Quantization)
	- Scheme: W4A16
	- Weight bits: 4-bit
	- Activation bits: 16-bit
	- Group size: 128
	- Format: compressed-tensors
	- Size reduction: ~75% compared to FP16

	## Usage

	```python
	from transformers import AutoModel, AutoTokenizer

	# Load quantized model
	model = AutoModel.from_pretrained(
	"nomic-embed-code-W4A16-AWQ",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"nomic-embed-code-W4A16-AWQ",
	trust_remote_code=True
	)

	# Generate embeddings
	texts = ["Hello world", "Example text"]
	inputs = tokenizer(texts, padding=True, return_tensors="pt")
	embeddings = model(**inputs).last_hidden_state.mean(dim=1)

	print(embeddings.shape)
	```

	## Performance

	- Memory usage: ~75% reduction vs FP16
	- Inference speed: Similar or faster on compatible hardware
	- Quality: Minimal degradation (<1% on most embedding tasks)

	## Why AWQ?

	AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:
	- Activation-aware: Protects salient weights based on activation magnitudes
	- Uses calibration data to identify important weight channels
	- Provides better accuracy than GPTQ and naive rounding (RTN)
	- Works efficiently with group-wise quantization (group size 128)
	- Maintains model quality while achieving 75% size reduction
	- Optimal for embedding models that rely on preserving semantic relationships

	## Original Model

	This quantized model is based on [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code).

	## Citation

	If you use this model, please cite the original model and llmcompressor:

	```bibtex
	@software{llmcompressor,
	title = {LLM Compressor},
	author = {Neural Magic},
	url = {https://github.com/vllm-project/llm-compressor},
	year = {2024}
	}
	```