Update README.md (#2)
Browse files- Update README.md (c95c6cc7178956a7f7486e94fbdb9c3a5b1f722d)
README.md
CHANGED
@@ -7,8 +7,7 @@ pipeline_tag: text2text-generation
|
|
7 |
|
8 |
# Elastic models
|
9 |
|
10 |
-
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
|
11 |
-
ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
|
12 |
|
13 |
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
|
14 |
|
@@ -25,10 +24,10 @@ __Goals of elastic models:__
|
|
25 |
* Provide clear quality and latency benchmarks
|
26 |
* Provide interface of HF libraries: transformers and diffusers with a single line of code
|
27 |
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
|
|
|
28 |
|
29 |
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
|
30 |
|
31 |
-
|
32 |
## Inference
|
33 |
|
34 |
To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
|
@@ -38,21 +37,28 @@ import torch
|
|
38 |
from transformers import AutoTokenizer
|
39 |
from elastic_models.transformers import AutoModelForCausalLM
|
40 |
|
|
|
|
|
|
|
41 |
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
|
42 |
-
|
43 |
-
|
44 |
device = torch.device("cuda")
|
45 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
|
46 |
|
|
|
|
|
|
|
|
|
47 |
model = AutoModelForCausalLM.from_pretrained(
|
48 |
model_name,
|
49 |
-
token=
|
50 |
-
cache_dir=
|
51 |
torch_dtype=torch.bfloat16,
|
52 |
attn_implementation="sdpa"
|
53 |
).to(device)
|
54 |
model.generation_config.pad_token_id = tokenizer.eos_token_id
|
55 |
|
|
|
56 |
prompt = "Describe basics of DNNs quantization."
|
57 |
inputs = tokenizer(prompt, return_tensors="pt")
|
58 |
inputs.to(device)
|
@@ -65,12 +71,13 @@ output = tokenizer.batch_decode(
|
|
65 |
skip_special_tokens=True,
|
66 |
clean_up_tokenization_spaces=False
|
67 |
)[0]
|
|
|
|
|
68 |
print(f"# Q:\n{prompt}\n")
|
69 |
print(f"# A:\n{output}\n")
|
70 |
```
|
71 |
|
72 |
-
|
73 |
-
### System requirements
|
74 |
|
75 |
__GPUs__: H100, L40s
|
76 |
|
@@ -78,17 +85,14 @@ __OS__: Linux #TODO
|
|
78 |
|
79 |
__Python__: 3.10-3.12
|
80 |
|
81 |
-
|
82 |
-
---
|
83 |
-
### Installation
|
84 |
|
85 |
```shell
|
86 |
pip install thestage
|
87 |
pip install elastic_models
|
88 |
```
|
89 |
|
90 |
-
Then go to app.thestage.ai, login and generate API token from your profile page.
|
91 |
-
Set up API token as follows:
|
92 |
|
93 |
```shell
|
94 |
thestage config set --api-token <YOUR_API_TOKEN>
|
@@ -96,6 +100,7 @@ thestage config set --api-token <YOUR_API_TOKEN>
|
|
96 |
|
97 |
Congrats, now you can use accelerated models!
|
98 |
|
|
|
99 |
|
100 |
## Benchmarks
|
101 |
|
@@ -113,7 +118,7 @@ For quality evaluation we have used: #TODO link to github
|
|
113 |
| Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
|
114 |
|
115 |
|
116 |
-
> __MMLU__: Evaluates/shows
|
117 |
|
118 |
> __MMLU__: Evaluates/shows ...
|
119 |
|
@@ -121,28 +126,34 @@ For quality evaluation we have used: #TODO link to github
|
|
121 |
|
122 |
> __PIQA__: Evaluates/shows ...
|
123 |
|
124 |
-
|
125 |
### Latency benchmarks
|
126 |
|
127 |
We have profiled models in different scenarios:
|
128 |
|
129 |
-
|
|
|
|
|
|
|
130 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
131 |
|-----------|-----|---|---|----|----------|------------|
|
132 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
133 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
134 |
|
135 |
|
136 |
-
|
|
|
|
|
137 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
138 |
|-----------|-----|---|---|----|----------|------------|
|
139 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
140 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
141 |
|
|
|
|
|
142 |
|
143 |
## Links
|
144 |
|
145 |
* __Platform__: [app.thestage.ai](app.thestage.ai)
|
146 |
* __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
|
147 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
148 |
-
* __Contact email__: [email protected]
|
|
|
7 |
|
8 |
# Elastic models
|
9 |
|
10 |
+
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
|
|
|
11 |
|
12 |
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
|
13 |
|
|
|
24 |
* Provide clear quality and latency benchmarks
|
25 |
* Provide interface of HF libraries: transformers and diffusers with a single line of code
|
26 |
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
|
27 |
+
* Provide the best models and service for self-hosting.
|
28 |
|
29 |
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
|
30 |
|
|
|
31 |
## Inference
|
32 |
|
33 |
To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
|
|
|
37 |
from transformers import AutoTokenizer
|
38 |
from elastic_models.transformers import AutoModelForCausalLM
|
39 |
|
40 |
+
# Currently we require to have your HF token
|
41 |
+
# as we use original weights for part of layers and
|
42 |
+
# model confugaration as well
|
43 |
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
|
44 |
+
hf_token = ''
|
45 |
+
hf_cache_dir = ''
|
46 |
device = torch.device("cuda")
|
|
|
47 |
|
48 |
+
# Create mode
|
49 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
50 |
+
model_name, token=hf_token
|
51 |
+
)
|
52 |
model = AutoModelForCausalLM.from_pretrained(
|
53 |
model_name,
|
54 |
+
token=hf_token,
|
55 |
+
cache_dir=hf_cache_dir,
|
56 |
torch_dtype=torch.bfloat16,
|
57 |
attn_implementation="sdpa"
|
58 |
).to(device)
|
59 |
model.generation_config.pad_token_id = tokenizer.eos_token_id
|
60 |
|
61 |
+
# Inference simple as transformers library
|
62 |
prompt = "Describe basics of DNNs quantization."
|
63 |
inputs = tokenizer(prompt, return_tensors="pt")
|
64 |
inputs.to(device)
|
|
|
71 |
skip_special_tokens=True,
|
72 |
clean_up_tokenization_spaces=False
|
73 |
)[0]
|
74 |
+
|
75 |
+
# Validate answer
|
76 |
print(f"# Q:\n{prompt}\n")
|
77 |
print(f"# A:\n{output}\n")
|
78 |
```
|
79 |
|
80 |
+
### Installation
|
|
|
81 |
|
82 |
__GPUs__: H100, L40s
|
83 |
|
|
|
85 |
|
86 |
__Python__: 3.10-3.12
|
87 |
|
88 |
+
To work with our models
|
|
|
|
|
89 |
|
90 |
```shell
|
91 |
pip install thestage
|
92 |
pip install elastic_models
|
93 |
```
|
94 |
|
95 |
+
Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
|
|
|
96 |
|
97 |
```shell
|
98 |
thestage config set --api-token <YOUR_API_TOKEN>
|
|
|
100 |
|
101 |
Congrats, now you can use accelerated models!
|
102 |
|
103 |
+
----
|
104 |
|
105 |
## Benchmarks
|
106 |
|
|
|
118 |
| Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
|
119 |
|
120 |
|
121 |
+
> __MMLU__: Evaluates/shows {MMLU}
|
122 |
|
123 |
> __MMLU__: Evaluates/shows ...
|
124 |
|
|
|
126 |
|
127 |
> __PIQA__: Evaluates/shows ...
|
128 |
|
|
|
129 |
### Latency benchmarks
|
130 |
|
131 |
We have profiled models in different scenarios:
|
132 |
|
133 |
+
<table>
|
134 |
+
<tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
|
135 |
+
<tr><td>
|
136 |
+
|
137 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
138 |
|-----------|-----|---|---|----|----------|------------|
|
139 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
140 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
141 |
|
142 |
|
143 |
+
|
144 |
+
</td><td>
|
145 |
+
|
146 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
147 |
|-----------|-----|---|---|----|----------|------------|
|
148 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
149 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
150 |
|
151 |
+
</td></tr> </table>
|
152 |
+
|
153 |
|
154 |
## Links
|
155 |
|
156 |
* __Platform__: [app.thestage.ai](app.thestage.ai)
|
157 |
* __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
|
158 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
159 |
+
* __Contact email__: [email protected]
|