LHC88 commited on
Commit
2c82be4
·
1 Parent(s): f13dda8

added examples in readme

Browse files
Files changed (1) hide show
  1. README.md +75 -117
README.md CHANGED
@@ -26,10 +26,10 @@ tags:
26
  **DISCLAIMER**
27
  *Tool calling template is a work in progress*
28
 
29
- Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
30
  This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
31
 
32
- Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.
33
  Perfect for:
34
  - Fast response conversational agents.
35
  - Low latency function calling.
@@ -38,7 +38,7 @@ Perfect for:
38
 
39
  For enterprises that need specialized capabilities (increased context, particular modalities, domain specific knowledge, etc.), we will be releasing commercial models beyond what Mistral AI contributes to the community.
40
 
41
- This release demonstrates our commitment to open source, serving as a strong base model.
42
 
43
  Learn more about Mistral Small in our [blog post](https://mistral.ai/news/mistral-small-3/).
44
 
@@ -99,7 +99,7 @@ Model developper: Mistral AI Team
99
  **Note**:
100
 
101
  - Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance
102
- ([Qwen2.5-32B-Instruct](https://qwenlm.github.io/blog/qwen2.5/), [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Gemma-2-27B-IT](https://huggingface.co/google/gemma-2-27b-it)).
103
  - Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
104
 
105
  ### Basic Instruct Template (V7-Tekken)
@@ -115,7 +115,6 @@ Model developper: Mistral AI Team
115
 
116
  The model can be used with the following frameworks;
117
  - [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
118
- - [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
119
 
120
  ### vLLM
121
 
@@ -124,7 +123,7 @@ to implement production-ready inference pipelines.
124
 
125
  **Note 1**: We recommond using a relatively low temperature, such as `temperature=0.15`.
126
 
127
- **Note 2**: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend the following
128
  system prompt:
129
 
130
  ```
@@ -134,6 +133,11 @@ When you're not sure about some information, you say that you don't have the inf
134
  If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""
135
  ```
136
 
 
 
 
 
 
137
  **_Installation_**
138
 
139
  Make sure you install [`vLLM >= 0.6.4`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4):
@@ -152,7 +156,7 @@ You can also make use of a ready-to-go [docker image](https://github.com/vllm-pr
152
 
153
  #### Server
154
 
155
- We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client setting.
156
 
157
  1. Spin up a server:
158
 
@@ -160,7 +164,7 @@ We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client set
160
  vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
161
  ```
162
 
163
- **Note:** Running Mistral-Small-24B-Instruct-2501 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
164
 
165
  2. To ping the client you can use a simple Python snippet.
166
 
@@ -209,7 +213,9 @@ print(response.json()["choices"][0]["message"]["content"])
209
 
210
  Mistral-Small-24-Instruct-2501 is excellent at function / tool calling tasks via vLLM. *E.g.:*
211
 
212
- #### Prompt template
 
 
213
  Jinja is a powerful and flexible template engine for Python. It allows developers to create dynamic content by separating the structure of a document from its varying parts. Jinja templates are widely used in web development, configuration management, and data processing tasks.
214
 
215
  Key features of Jinja templates include:
@@ -266,117 +272,69 @@ jq --rawfile template chat_template_with_tools.jinja '.chat_template = $template
266
  <summary>Tool Calling Example</summary>
267
 
268
  ```py
269
- import requests
270
  import json
271
- from huggingface_hub import hf_hub_download
272
- from datetime import datetime, timedelta
273
-
274
- url = "http://<your-url>:8000/v1/chat/completions"
275
- headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
276
-
277
- model = "mistralai/Mistral-Small-24B-Instruct-2501"
278
 
279
-
280
- def load_system_prompt(repo_id: str, filename: str) -> str:
281
- file_path = hf_hub_download(repo_id=repo_id, filename=filename)
282
- with open(file_path, "r") as file:
283
- system_prompt = file.read()
284
- today = datetime.today().strftime("%Y-%m-%d")
285
- yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
286
- model_name = repo_id.split("/")[-1]
287
- return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
288
-
289
-
290
- SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
291
-
292
-
293
- tools = [
294
- {
295
- "type": "function",
296
- "function": {
297
- "name": "get_current_weather",
298
- "description": "Get the current weather in a given location",
299
- "parameters": {
300
- "type": "object",
301
- "properties": {
302
- "city": {
303
- "type": "string",
304
- "description": "The city to find the weather for, e.g. 'San Francisco'",
305
- },
306
- "state": {
307
- "type": "string",
308
- "description": "The state abbreviation, e.g. 'CA' for California",
309
- },
310
- "unit": {
311
- "type": "string",
312
- "description": "The unit for temperature",
313
- "enum": ["celsius", "fahrenheit"],
314
- },
315
- },
316
- "required": ["city", "state", "unit"],
317
- },
318
- },
319
- },
320
- {
321
- "type": "function",
322
- "function": {
323
- "name": "rewrite",
324
- "description": "Rewrite a given text for improved clarity",
325
- "parameters": {
326
- "type": "object",
327
- "properties": {
328
- "text": {
329
- "type": "string",
330
- "description": "The input text to rewrite",
331
- }
332
- },
333
  },
334
- },
335
- },
336
- ]
337
-
338
- messages = [
339
- {"role": "system", "content": SYSTEM_PROMPT},
340
- {
341
- "role": "user",
342
- "content": "Could you please make the below article more concise?\n\nOpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership.",
343
- },
344
- {
345
- "role": "assistant",
346
- "content": "",
347
- "tool_calls": [
348
- {
349
- "id": "bbc5b7ede",
350
- "type": "function",
351
- "function": {
352
- "name": "rewrite",
353
- "arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}',
354
- },
355
  }
356
- ],
357
- },
358
- {
359
- "role": "tool",
360
- "content": '{"action":"rewrite","outcome":"OpenAI is a FOR-profit company."}',
361
- "tool_call_id": "bbc5b7ede",
362
- "name": "rewrite",
 
 
 
363
  },
364
- {
365
- "role": "assistant",
366
- "content": "---\n\nOpenAI is a FOR-profit company.",
367
- },
368
- {
369
- "role": "user",
370
- "content": "Can you tell me what the temperature will be in Dallas, in Fahrenheit?",
371
- },
372
- ]
373
-
374
- data = {"model": model, "messages": messages, "tools": tools}
375
-
376
- response = requests.post(url, headers=headers, data=json.dumps(data))
377
- import ipdb; ipdb.set_trace()
378
- print(response.json()["choices"][0]["message"]["tool_calls"])
379
- # [{'id': '8PdihwL6d', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}'}}]
380
  ```
381
 
382
  </details>
@@ -443,13 +401,13 @@ chatbot(messages)
443
 
444
  ### Ollama
445
 
446
- [Ollama](https://github.com/ollama/ollama) can run this model locally on MacOS, Windows and Linux.
447
 
448
  ```
449
  ollama run mistral-small
450
  ```
451
 
452
- 4-bit quantization (aliased to default):
453
  ```
454
  ollama run mistral-small:24b-instruct-2501-q4_K_M
455
  ```
 
26
  **DISCLAIMER**
27
  *Tool calling template is a work in progress*
28
 
29
+ Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
30
  This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
31
 
32
+ Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.
33
  Perfect for:
34
  - Fast response conversational agents.
35
  - Low latency function calling.
 
38
 
39
  For enterprises that need specialized capabilities (increased context, particular modalities, domain specific knowledge, etc.), we will be releasing commercial models beyond what Mistral AI contributes to the community.
40
 
41
+ This release demonstrates our commitment to open source, serving as a strong base model.
42
 
43
  Learn more about Mistral Small in our [blog post](https://mistral.ai/news/mistral-small-3/).
44
 
 
99
  **Note**:
100
 
101
  - Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance
102
+ ([Qwen2.5-32B-Instruct](https://qwenlm.github.io/blog/qwen2.5/), [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Gemma-2-27B-IT](https://huggingface.co/google/gemma-2-27b-it)).
103
  - Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
104
 
105
  ### Basic Instruct Template (V7-Tekken)
 
115
 
116
  The model can be used with the following frameworks;
117
  - [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
 
118
 
119
  ### vLLM
120
 
 
123
 
124
  **Note 1**: We recommond using a relatively low temperature, such as `temperature=0.15`.
125
 
126
+ **Note 2**: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend the following
127
  system prompt:
128
 
129
  ```
 
133
  If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""
134
  ```
135
 
136
+ **Note 3**: Make sure to add the following SampleParam at inference time Tool Calling to work.
137
+ ```json
138
+ "skip_special_tokens": False
139
+ ```
140
+
141
  **_Installation_**
142
 
143
  Make sure you install [`vLLM >= 0.6.4`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4):
 
156
 
157
  #### Server
158
 
159
+ We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client setting.
160
 
161
  1. Spin up a server:
162
 
 
164
  vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
165
  ```
166
 
167
+ **Note:** Running Mistral-Small-24B-Instruct-2501 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
168
 
169
  2. To ping the client you can use a simple Python snippet.
170
 
 
213
 
214
  Mistral-Small-24-Instruct-2501 is excellent at function / tool calling tasks via vLLM. *E.g.:*
215
 
216
+
217
+
218
+ #### Prompt template
219
  Jinja is a powerful and flexible template engine for Python. It allows developers to create dynamic content by separating the structure of a document from its varying parts. Jinja templates are widely used in web development, configuration management, and data processing tasks.
220
 
221
  Key features of Jinja templates include:
 
272
  <summary>Tool Calling Example</summary>
273
 
274
  ```py
275
+ from openai import OpenAI
276
  import json
 
 
 
 
 
 
 
277
 
278
+ client = OpenAI(base_url="http://localhost:1337/v1")
279
+
280
+
281
+ def get_weather(location: str, unit: str):
282
+ return f"Weather {location} in {unit} is bad!"
283
+ def get_gold_price(currency: str = "USD"):
284
+ return f"Getting the gold price in {currency} is enourmous!"
285
+ tool_functions = {"get_weather": get_weather, "get_gold_price": get_gold_price}
286
+
287
+ tools = [{
288
+ "type": "function",
289
+ "function": {
290
+ "name": "get_weather",
291
+ "description": "Get the current weather in a given location",
292
+ "parameters": {
293
+ "type": "object",
294
+ "properties": {
295
+ "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
296
+ "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
  },
298
+ "required": ["location", "unit"]
299
+ }
300
+ }
301
+ },
302
+ {
303
+ "type": "function",
304
+ "function": {
305
+ "name": "get_gold_price",
306
+ "description": "Get the current gold price in wanted currency (default to USD).",
307
+ "parameters": {
308
+ "type": "object",
309
+ "properties": {
310
+ "currency": {"type": "string", "description": "Currency code e.g. USD or EUR."}
 
 
 
 
 
 
 
 
311
  }
312
+ }
313
+ }
314
+ }]
315
+
316
+ response = client.chat.completions.create(
317
+ model="uai/lm-base",
318
+ messages=[{"role": "user", "content": "What's the weather like in San Francisco? And whats the current gold price?"}],
319
+ temperature=0,
320
+ extra_body={
321
+ "skip_special_tokens": False
322
  },
323
+ tools=tools,
324
+ tool_choice="auto"
325
+ )
326
+
327
+ print(f"Function called: {response.choices[0]}")
328
+ tool_calls = response.choices[0].message.tool_calls
329
+
330
+ for index, tool_call in enumerate(tool_calls):
331
+ call_response = tool_call.function
332
+ print(f"{index}. Function called: {call_response.name}")
333
+ print(f"Arguments: {call_response.arguments}")
334
+ if index == 0:
335
+ print(f"Result: {get_weather(**json.loads(call_response.arguments))}")
336
+ elif index == 1:
337
+ print(f"Result: {get_gold_price(**json.loads(call_response.arguments))}")
 
338
  ```
339
 
340
  </details>
 
401
 
402
  ### Ollama
403
 
404
+ [Ollama](https://github.com/ollama/ollama) can run this model locally on MacOS, Windows and Linux.
405
 
406
  ```
407
  ollama run mistral-small
408
  ```
409
 
410
+ 4-bit quantization (aliased to default):
411
  ```
412
  ollama run mistral-small:24b-instruct-2501-q4_K_M
413
  ```