--- license: mit tags: - quantized - embedding - W4A16 - llmcompressor - awq - 4-bit - activation-aware base_model: nomic-ai/nomic-embed-code --- # nomic-embed-code-W4A16-AWQ This is a **W4A16 quantized** version of [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code). **Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!** ## Quantization Details - **Method**: llmcompressor (AWQ one-shot PTQ) - **Algorithm**: AWQ (Activation-aware Weight Quantization) - **Scheme**: W4A16 - **Weight bits**: 4-bit - **Activation bits**: 16-bit - **Group size**: 128 - **Format**: compressed-tensors - **Size reduction**: ~75% compared to FP16 ## Usage ```python from transformers import AutoModel, AutoTokenizer # Load quantized model model = AutoModel.from_pretrained( "nomic-embed-code-W4A16-AWQ", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "nomic-embed-code-W4A16-AWQ", trust_remote_code=True ) # Generate embeddings texts = ["Hello world", "Example text"] inputs = tokenizer(texts, padding=True, return_tensors="pt") embeddings = model(**inputs).last_hidden_state.mean(dim=1) print(embeddings.shape) ``` ## Performance - **Memory usage**: ~75% reduction vs FP16 - **Inference speed**: Similar or faster on compatible hardware - **Quality**: Minimal degradation (<1% on most embedding tasks) ## Why AWQ? AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that: - **Activation-aware**: Protects salient weights based on activation magnitudes - Uses calibration data to identify important weight channels - Provides better accuracy than GPTQ and naive rounding (RTN) - Works efficiently with group-wise quantization (group size 128) - Maintains model quality while achieving 75% size reduction - Optimal for embedding models that rely on preserving semantic relationships ## Original Model This quantized model is based on [nomic-ai/nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code). ## Citation If you use this model, please cite the original model and llmcompressor: ```bibtex @software{llmcompressor, title = {LLM Compressor}, author = {Neural Magic}, url = {https://github.com/vllm-project/llm-compressor}, year = {2024} } ```