hypothetical commited on
Commit
eba5c8b
·
verified ·
1 Parent(s): ee415d9

Update README.md (#2)

Browse files

- Update README.md (c95c6cc7178956a7f7486e94fbdb9c3a5b1f722d)

Files changed (1) hide show
  1. README.md +31 -20
README.md CHANGED
@@ -7,8 +7,7 @@ pipeline_tag: text2text-generation
7
 
8
  # Elastic models
9
 
10
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
11
- ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
12
 
13
  * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
14
 
@@ -25,10 +24,10 @@ __Goals of elastic models:__
25
  * Provide clear quality and latency benchmarks
26
  * Provide interface of HF libraries: transformers and diffusers with a single line of code
27
  * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
 
28
 
29
  > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
30
 
31
-
32
  ## Inference
33
 
34
  To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
@@ -38,21 +37,28 @@ import torch
38
  from transformers import AutoTokenizer
39
  from elastic_models.transformers import AutoModelForCausalLM
40
 
 
 
 
41
  model_name = "mistralai/Mistral-7B-Instruct-v0.3"
42
- token = ''
43
-
44
  device = torch.device("cuda")
45
- tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
46
 
 
 
 
 
47
  model = AutoModelForCausalLM.from_pretrained(
48
  model_name,
49
- token=token,
50
- cache_dir="/mnt/rnd/huggingface_cache",
51
  torch_dtype=torch.bfloat16,
52
  attn_implementation="sdpa"
53
  ).to(device)
54
  model.generation_config.pad_token_id = tokenizer.eos_token_id
55
 
 
56
  prompt = "Describe basics of DNNs quantization."
57
  inputs = tokenizer(prompt, return_tensors="pt")
58
  inputs.to(device)
@@ -65,12 +71,13 @@ output = tokenizer.batch_decode(
65
  skip_special_tokens=True,
66
  clean_up_tokenization_spaces=False
67
  )[0]
 
 
68
  print(f"# Q:\n{prompt}\n")
69
  print(f"# A:\n{output}\n")
70
  ```
71
 
72
- ---
73
- ### System requirements
74
 
75
  __GPUs__: H100, L40s
76
 
@@ -78,17 +85,14 @@ __OS__: Linux #TODO
78
 
79
  __Python__: 3.10-3.12
80
 
81
-
82
- ---
83
- ### Installation
84
 
85
  ```shell
86
  pip install thestage
87
  pip install elastic_models
88
  ```
89
 
90
- Then go to app.thestage.ai, login and generate API token from your profile page.
91
- Set up API token as follows:
92
 
93
  ```shell
94
  thestage config set --api-token <YOUR_API_TOKEN>
@@ -96,6 +100,7 @@ thestage config set --api-token <YOUR_API_TOKEN>
96
 
97
  Congrats, now you can use accelerated models!
98
 
 
99
 
100
  ## Benchmarks
101
 
@@ -113,7 +118,7 @@ For quality evaluation we have used: #TODO link to github
113
  | Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
114
 
115
 
116
- > __MMLU__: Evaluates/shows ...
117
 
118
  > __MMLU__: Evaluates/shows ...
119
 
@@ -121,28 +126,34 @@ For quality evaluation we have used: #TODO link to github
121
 
122
  > __PIQA__: Evaluates/shows ...
123
 
124
-
125
  ### Latency benchmarks
126
 
127
  We have profiled models in different scenarios:
128
 
129
- __100 input/300 output tok/s__:
 
 
 
130
  | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
131
  |-----------|-----|---|---|----|----------|------------|
132
  | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
133
  | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
134
 
135
 
136
- __1000 input/1000 output tok/s__:
 
 
137
  | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
138
  |-----------|-----|---|---|----|----------|------------|
139
  | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
140
  | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
141
 
 
 
142
 
143
  ## Links
144
 
145
  * __Platform__: [app.thestage.ai](app.thestage.ai)
146
  * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
147
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
148
- * __Contact email__: [email protected]
 
7
 
8
  # Elastic models
9
 
10
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
 
11
 
12
  * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
13
 
 
24
  * Provide clear quality and latency benchmarks
25
  * Provide interface of HF libraries: transformers and diffusers with a single line of code
26
  * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
27
+ * Provide the best models and service for self-hosting.
28
 
29
  > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
30
 
 
31
  ## Inference
32
 
33
  To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 
37
  from transformers import AutoTokenizer
38
  from elastic_models.transformers import AutoModelForCausalLM
39
 
40
+ # Currently we require to have your HF token
41
+ # as we use original weights for part of layers and
42
+ # model confugaration as well
43
  model_name = "mistralai/Mistral-7B-Instruct-v0.3"
44
+ hf_token = ''
45
+ hf_cache_dir = ''
46
  device = torch.device("cuda")
 
47
 
48
+ # Create mode
49
+ tokenizer = AutoTokenizer.from_pretrained(
50
+ model_name, token=hf_token
51
+ )
52
  model = AutoModelForCausalLM.from_pretrained(
53
  model_name,
54
+ token=hf_token,
55
+ cache_dir=hf_cache_dir,
56
  torch_dtype=torch.bfloat16,
57
  attn_implementation="sdpa"
58
  ).to(device)
59
  model.generation_config.pad_token_id = tokenizer.eos_token_id
60
 
61
+ # Inference simple as transformers library
62
  prompt = "Describe basics of DNNs quantization."
63
  inputs = tokenizer(prompt, return_tensors="pt")
64
  inputs.to(device)
 
71
  skip_special_tokens=True,
72
  clean_up_tokenization_spaces=False
73
  )[0]
74
+
75
+ # Validate answer
76
  print(f"# Q:\n{prompt}\n")
77
  print(f"# A:\n{output}\n")
78
  ```
79
 
80
+ ### Installation
 
81
 
82
  __GPUs__: H100, L40s
83
 
 
85
 
86
  __Python__: 3.10-3.12
87
 
88
+ To work with our models
 
 
89
 
90
  ```shell
91
  pip install thestage
92
  pip install elastic_models
93
  ```
94
 
95
+ Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
 
96
 
97
  ```shell
98
  thestage config set --api-token <YOUR_API_TOKEN>
 
100
 
101
  Congrats, now you can use accelerated models!
102
 
103
+ ----
104
 
105
  ## Benchmarks
106
 
 
118
  | Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
119
 
120
 
121
+ > __MMLU__: Evaluates/shows {MMLU}
122
 
123
  > __MMLU__: Evaluates/shows ...
124
 
 
126
 
127
  > __PIQA__: Evaluates/shows ...
128
 
 
129
  ### Latency benchmarks
130
 
131
  We have profiled models in different scenarios:
132
 
133
+ <table>
134
+ <tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
135
+ <tr><td>
136
+
137
  | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
138
  |-----------|-----|---|---|----|----------|------------|
139
  | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
140
  | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
141
 
142
 
143
+
144
+ </td><td>
145
+
146
  | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
147
  |-----------|-----|---|---|----|----------|------------|
148
  | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
149
  | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
150
 
151
+ </td></tr> </table>
152
+
153
 
154
  ## Links
155
 
156
  * __Platform__: [app.thestage.ai](app.thestage.ai)
157
  * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
158
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
159
+ * __Contact email__: [email protected]