EliasOenal
/

Mistral-Small-24B-Instruct-2501-W8A8-dynamic

Text Generation

text-generation-inference

Inference Endpoints

8-bit precision

compressed-tensors

Model card Files Files and versions Community

EliasOenal commited on 12 days ago

Commit

8bb7311

·

verified ·

1 Parent(s): f2e6aaf

Update README.md

Files changed (1) hide show

README.md +44 -3

README.md CHANGED Viewed

@@ -1,3 +1,44 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+tags:
+- mistral
+- mistral-small
+- w8a8
+- vllm
+base_model: mistralai/Mistral-Small-24B-Instruct-2501
+library_name: transformers
+datasets:
+- neuralmagic/LLM_compression_calibration
+---
+# Mistral-Small-24B-Instruct-2501-W8A8-dynamic
+## Model Overview
+- **Model Architecture:** Mistral-Small-24B-Instruct-2501
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** INT8
+  - **Activation quantization:** INT8
+- **Release Date:** 2/12/2025
+- **Version:** 1.0
+- **Model Developers:** Elias Oenal
+Quantized version of [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501).
+### Model Optimizations
+This model was obtained by quantizing the weights and activations to W8A8 data type, ready for inference with vLLM.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
+## Creation
+This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) and the [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) dataset.