TheStageAI
/

Elastic-Mistral-7B-Instruct-v0.3

Text Generation

text2text-generation

Model card Files Files and versions Community

hypothetical commited on Apr 8

Commit

eba5c8b

verified ·

1 Parent(s): ee415d9

Update README.md (#2)

Browse files

- Update README.md (c95c6cc7178956a7f7486e94fbdb9c3a5b1f722d)

Files changed (1) hide show

README.md +31 -20

README.md CHANGED Viewed

@@ -7,8 +7,7 @@ pipeline_tag: text2text-generation
 # Elastic models
-Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
-ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
 * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
@@ -25,10 +24,10 @@ __Goals of elastic models:__
 * Provide clear quality and latency benchmarks
 * Provide interface of HF libraries: transformers and diffusers with a single line of code
 * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
 > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 ## Inference
 To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
@@ -38,21 +37,28 @@ import torch
 from transformers import AutoTokenizer
 from elastic_models.transformers import AutoModelForCausalLM
 model_name = "mistralai/Mistral-7B-Instruct-v0.3"
-token = ''
 device = torch.device("cuda")
-tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
-    token=token,
-    cache_dir="/mnt/rnd/huggingface_cache",
     torch_dtype=torch.bfloat16,
     attn_implementation="sdpa"
 ).to(device)
 model.generation_config.pad_token_id = tokenizer.eos_token_id
 prompt = "Describe basics of DNNs quantization."
 inputs = tokenizer(prompt, return_tensors="pt")
 inputs.to(device)
@@ -65,12 +71,13 @@ output = tokenizer.batch_decode(
     skip_special_tokens=True,
     clean_up_tokenization_spaces=False
 )[0]
 print(f"# Q:\n{prompt}\n")
 print(f"# A:\n{output}\n")
 ```
----
-### System requirements
 __GPUs__: H100, L40s
@@ -78,17 +85,14 @@ __OS__: Linux #TODO
 __Python__: 3.10-3.12
----
-### Installation
 ```shell
 pip install thestage
 pip install elastic_models
 ```
-Then go to app.thestage.ai, login and generate API token from your profile page.
-Set up API token as follows:
 ```shell
 thestage config set --api-token <YOUR_API_TOKEN>
@@ -96,6 +100,7 @@ thestage config set --api-token <YOUR_API_TOKEN>
 Congrats, now you can use accelerated models!
 ## Benchmarks
@@ -113,7 +118,7 @@ For quality evaluation we have used: #TODO link to github
 | Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |
-> __MMLU__: Evaluates/shows ...
 > __MMLU__: Evaluates/shows ...
@@ -121,28 +126,34 @@ For quality evaluation we have used: #TODO link to github
 > __PIQA__: Evaluates/shows ...
 ### Latency benchmarks
 We have profiled models in different scenarios:
-__100 input/300 output tok/s__:
 | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
 |-----------|-----|---|---|----|----------|------------|
 | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
 | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
-__1000 input/1000 output tok/s__:
 | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
 |-----------|-----|---|---|----|----------|------------|
 | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
 | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
 ## Links
 * __Platform__: [app.thestage.ai](app.thestage.ai)
 * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
-* __Contact email__: [email protected]

 # Elastic models
+Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
 * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
 * Provide clear quality and latency benchmarks
 * Provide interface of HF libraries: transformers and diffusers with a single line of code
 * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
+* Provide the best models and service for self-hosting.
 > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 ## Inference
 To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 from transformers import AutoTokenizer
 from elastic_models.transformers import AutoModelForCausalLM
+# Currently we require to have your HF token
+# as we use original weights for part of layers and
+# model confugaration as well
 model_name = "mistralai/Mistral-7B-Instruct-v0.3"
+hf_token = ''
+hf_cache_dir = ''
 device = torch.device("cuda")
+# Create mode
+tokenizer = AutoTokenizer.from_pretrained(
+    model_name, token=hf_token
+)
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
+    token=hf_token,
+    cache_dir=hf_cache_dir,
     torch_dtype=torch.bfloat16,
     attn_implementation="sdpa"
 ).to(device)
 model.generation_config.pad_token_id = tokenizer.eos_token_id
+# Inference simple as transformers library
 prompt = "Describe basics of DNNs quantization."
 inputs = tokenizer(prompt, return_tensors="pt")
 inputs.to(device)
     skip_special_tokens=True,
     clean_up_tokenization_spaces=False
 )[0]
+# Validate answer
 print(f"# Q:\n{prompt}\n")
 print(f"# A:\n{output}\n")
 ```
+### Installation
 __GPUs__: H100, L40s
 __Python__: 3.10-3.12
+To work with our models
 ```shell
 pip install thestage
 pip install elastic_models
 ```
+Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
 ```shell
 thestage config set --api-token <YOUR_API_TOKEN>
 Congrats, now you can use accelerated models!
+----
 ## Benchmarks
 | Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |
+> __MMLU__: Evaluates/shows {MMLU}
 > __MMLU__: Evaluates/shows ...
 > __PIQA__: Evaluates/shows ...
 ### Latency benchmarks
 We have profiled models in different scenarios:
+<table>
+<tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
+<tr><td>
 | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
 |-----------|-----|---|---|----|----------|------------|
 | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
 | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
+</td><td>
 | GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
 |-----------|-----|---|---|----|----------|------------|
 | H100      | 189 | 0 | 0 | 0  | 48       | 0          |
 | L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
+</td></tr> </table>
 ## Links
 * __Platform__: [app.thestage.ai](app.thestage.ai)
 * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
+* __Contact email__: [email protected]