hypothetical commited on
Commit
ee415d9
·
verified ·
1 Parent(s): 65336cd

Create README.md (#1)

Browse files

- Create README.md (693de0bcd85881d6fcb50df209f2fde9894c08eb)

Files changed (1) hide show
  1. README.md +148 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - mistralai/Mistral-7B-Instruct-v0.3
5
+ pipeline_tag: text2text-generation
6
+ ---
7
+
8
+ # Elastic models
9
+
10
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
11
+ ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
12
+
13
+ * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
14
+
15
+ * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
16
+
17
+ * __M__: Faster model, with accuracy degradation less than 1.5%.
18
+
19
+ * __S__: The fastest model, with accuracy degradation less than 2%.
20
+
21
+
22
+ __Goals of elastic models:__
23
+
24
+ * Provide flexibility in cost vs quality selection for inference
25
+ * Provide clear quality and latency benchmarks
26
+ * Provide interface of HF libraries: transformers and diffusers with a single line of code
27
+ * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
28
+
29
+ > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
30
+
31
+
32
+ ## Inference
33
+
34
+ To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import AutoTokenizer
39
+ from elastic_models.transformers import AutoModelForCausalLM
40
+
41
+ model_name = "mistralai/Mistral-7B-Instruct-v0.3"
42
+ token = ''
43
+
44
+ device = torch.device("cuda")
45
+ tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
46
+
47
+ model = AutoModelForCausalLM.from_pretrained(
48
+ model_name,
49
+ token=token,
50
+ cache_dir="/mnt/rnd/huggingface_cache",
51
+ torch_dtype=torch.bfloat16,
52
+ attn_implementation="sdpa"
53
+ ).to(device)
54
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
55
+
56
+ prompt = "Describe basics of DNNs quantization."
57
+ inputs = tokenizer(prompt, return_tensors="pt")
58
+ inputs.to(device)
59
+
60
+ generate_ids = model.generate(**inputs, max_length=500)
61
+ input_len = inputs['input_ids'].shape[1]
62
+ generate_ids = generate_ids[:, input_len:]
63
+ output = tokenizer.batch_decode(
64
+ generate_ids,
65
+ skip_special_tokens=True,
66
+ clean_up_tokenization_spaces=False
67
+ )[0]
68
+ print(f"# Q:\n{prompt}\n")
69
+ print(f"# A:\n{output}\n")
70
+ ```
71
+
72
+ ---
73
+ ### System requirements
74
+
75
+ __GPUs__: H100, L40s
76
+
77
+ __OS__: Linux #TODO
78
+
79
+ __Python__: 3.10-3.12
80
+
81
+
82
+ ---
83
+ ### Installation
84
+
85
+ ```shell
86
+ pip install thestage
87
+ pip install elastic_models
88
+ ```
89
+
90
+ Then go to app.thestage.ai, login and generate API token from your profile page.
91
+ Set up API token as follows:
92
+
93
+ ```shell
94
+ thestage config set --api-token <YOUR_API_TOKEN>
95
+ ```
96
+
97
+ Congrats, now you can use accelerated models!
98
+
99
+
100
+ ## Benchmarks
101
+
102
+ Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
103
+
104
+ ### Quality benchmarks
105
+
106
+ For quality evaluation we have used: #TODO link to github
107
+
108
+ | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
109
+ |---------------|---|---|---|----|----------|------------|
110
+ | MMLU | 0 | 0 | 0 | 0 | 0 | 0 |
111
+ | PIQA | 0 | 0 | 0 | 0 | 0 | 0 |
112
+ | Arc Challenge | 0 | 0 | 0 | 0 | 0 | 0 |
113
+ | Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
114
+
115
+
116
+ > __MMLU__: Evaluates/shows ...
117
+
118
+ > __MMLU__: Evaluates/shows ...
119
+
120
+ > __Arc Challenge__: Evaluates/shows ...
121
+
122
+ > __PIQA__: Evaluates/shows ...
123
+
124
+
125
+ ### Latency benchmarks
126
+
127
+ We have profiled models in different scenarios:
128
+
129
+ __100 input/300 output tok/s__:
130
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
131
+ |-----------|-----|---|---|----|----------|------------|
132
+ | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
133
+ | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
134
+
135
+
136
+ __1000 input/1000 output tok/s__:
137
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
138
+ |-----------|-----|---|---|----|----------|------------|
139
+ | H100 | 189 | 0 | 0 | 0 | 48 | 0 |
140
+ | L40s | 79 | 0 | 0 | 0 | 42 | 0 |
141
+
142
+
143
+ ## Links
144
+
145
+ * __Platform__: [app.thestage.ai](app.thestage.ai)
146
+ * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
147
+ * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
148
+ * __Contact email__: [email protected]