added examples in readme
Browse files
README.md
CHANGED
@@ -26,10 +26,10 @@ tags:
|
|
26 |
**DISCLAIMER**
|
27 |
*Tool calling template is a work in progress*
|
28 |
|
29 |
-
Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
|
30 |
This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
|
31 |
|
32 |
-
Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.
|
33 |
Perfect for:
|
34 |
- Fast response conversational agents.
|
35 |
- Low latency function calling.
|
@@ -38,7 +38,7 @@ Perfect for:
|
|
38 |
|
39 |
For enterprises that need specialized capabilities (increased context, particular modalities, domain specific knowledge, etc.), we will be releasing commercial models beyond what Mistral AI contributes to the community.
|
40 |
|
41 |
-
This release demonstrates our commitment to open source, serving as a strong base model.
|
42 |
|
43 |
Learn more about Mistral Small in our [blog post](https://mistral.ai/news/mistral-small-3/).
|
44 |
|
@@ -99,7 +99,7 @@ Model developper: Mistral AI Team
|
|
99 |
**Note**:
|
100 |
|
101 |
- Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance
|
102 |
-
([Qwen2.5-32B-Instruct](https://qwenlm.github.io/blog/qwen2.5/), [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Gemma-2-27B-IT](https://huggingface.co/google/gemma-2-27b-it)).
|
103 |
- Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
|
104 |
|
105 |
### Basic Instruct Template (V7-Tekken)
|
@@ -115,7 +115,6 @@ Model developper: Mistral AI Team
|
|
115 |
|
116 |
The model can be used with the following frameworks;
|
117 |
- [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
|
118 |
-
- [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
|
119 |
|
120 |
### vLLM
|
121 |
|
@@ -124,7 +123,7 @@ to implement production-ready inference pipelines.
|
|
124 |
|
125 |
**Note 1**: We recommond using a relatively low temperature, such as `temperature=0.15`.
|
126 |
|
127 |
-
**Note 2**: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend the following
|
128 |
system prompt:
|
129 |
|
130 |
```
|
@@ -134,6 +133,11 @@ When you're not sure about some information, you say that you don't have the inf
|
|
134 |
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""
|
135 |
```
|
136 |
|
|
|
|
|
|
|
|
|
|
|
137 |
**_Installation_**
|
138 |
|
139 |
Make sure you install [`vLLM >= 0.6.4`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4):
|
@@ -152,7 +156,7 @@ You can also make use of a ready-to-go [docker image](https://github.com/vllm-pr
|
|
152 |
|
153 |
#### Server
|
154 |
|
155 |
-
We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client setting.
|
156 |
|
157 |
1. Spin up a server:
|
158 |
|
@@ -160,7 +164,7 @@ We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client set
|
|
160 |
vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
|
161 |
```
|
162 |
|
163 |
-
**Note:** Running Mistral-Small-24B-Instruct-2501 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
|
164 |
|
165 |
2. To ping the client you can use a simple Python snippet.
|
166 |
|
@@ -209,7 +213,9 @@ print(response.json()["choices"][0]["message"]["content"])
|
|
209 |
|
210 |
Mistral-Small-24-Instruct-2501 is excellent at function / tool calling tasks via vLLM. *E.g.:*
|
211 |
|
212 |
-
|
|
|
|
|
213 |
Jinja is a powerful and flexible template engine for Python. It allows developers to create dynamic content by separating the structure of a document from its varying parts. Jinja templates are widely used in web development, configuration management, and data processing tasks.
|
214 |
|
215 |
Key features of Jinja templates include:
|
@@ -266,117 +272,69 @@ jq --rawfile template chat_template_with_tools.jinja '.chat_template = $template
|
|
266 |
<summary>Tool Calling Example</summary>
|
267 |
|
268 |
```py
|
269 |
-
import
|
270 |
import json
|
271 |
-
from huggingface_hub import hf_hub_download
|
272 |
-
from datetime import datetime, timedelta
|
273 |
-
|
274 |
-
url = "http://<your-url>:8000/v1/chat/completions"
|
275 |
-
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
|
276 |
-
|
277 |
-
model = "mistralai/Mistral-Small-24B-Instruct-2501"
|
278 |
|
279 |
-
|
280 |
-
|
281 |
-
|
282 |
-
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
-
|
291 |
-
|
292 |
-
|
293 |
-
|
294 |
-
|
295 |
-
|
296 |
-
|
297 |
-
|
298 |
-
"description": "Get the current weather in a given location",
|
299 |
-
"parameters": {
|
300 |
-
"type": "object",
|
301 |
-
"properties": {
|
302 |
-
"city": {
|
303 |
-
"type": "string",
|
304 |
-
"description": "The city to find the weather for, e.g. 'San Francisco'",
|
305 |
-
},
|
306 |
-
"state": {
|
307 |
-
"type": "string",
|
308 |
-
"description": "The state abbreviation, e.g. 'CA' for California",
|
309 |
-
},
|
310 |
-
"unit": {
|
311 |
-
"type": "string",
|
312 |
-
"description": "The unit for temperature",
|
313 |
-
"enum": ["celsius", "fahrenheit"],
|
314 |
-
},
|
315 |
-
},
|
316 |
-
"required": ["city", "state", "unit"],
|
317 |
-
},
|
318 |
-
},
|
319 |
-
},
|
320 |
-
{
|
321 |
-
"type": "function",
|
322 |
-
"function": {
|
323 |
-
"name": "rewrite",
|
324 |
-
"description": "Rewrite a given text for improved clarity",
|
325 |
-
"parameters": {
|
326 |
-
"type": "object",
|
327 |
-
"properties": {
|
328 |
-
"text": {
|
329 |
-
"type": "string",
|
330 |
-
"description": "The input text to rewrite",
|
331 |
-
}
|
332 |
-
},
|
333 |
},
|
334 |
-
|
335 |
-
|
336 |
-
|
337 |
-
|
338 |
-
|
339 |
-
|
340 |
-
{
|
341 |
-
"
|
342 |
-
"
|
343 |
-
|
344 |
-
|
345 |
-
|
346 |
-
|
347 |
-
"tool_calls": [
|
348 |
-
{
|
349 |
-
"id": "bbc5b7ede",
|
350 |
-
"type": "function",
|
351 |
-
"function": {
|
352 |
-
"name": "rewrite",
|
353 |
-
"arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}',
|
354 |
-
},
|
355 |
}
|
356 |
-
|
357 |
-
}
|
358 |
-
|
359 |
-
|
360 |
-
|
361 |
-
|
362 |
-
|
|
|
|
|
|
|
363 |
},
|
364 |
-
|
365 |
-
|
366 |
-
|
367 |
-
|
368 |
-
|
369 |
-
|
370 |
-
|
371 |
-
|
372 |
-
|
373 |
-
|
374 |
-
|
375 |
-
|
376 |
-
|
377 |
-
|
378 |
-
print(
|
379 |
-
# [{'id': '8PdihwL6d', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}'}}]
|
380 |
```
|
381 |
|
382 |
</details>
|
@@ -443,13 +401,13 @@ chatbot(messages)
|
|
443 |
|
444 |
### Ollama
|
445 |
|
446 |
-
[Ollama](https://github.com/ollama/ollama) can run this model locally on MacOS, Windows and Linux.
|
447 |
|
448 |
```
|
449 |
ollama run mistral-small
|
450 |
```
|
451 |
|
452 |
-
4-bit quantization (aliased to default):
|
453 |
```
|
454 |
ollama run mistral-small:24b-instruct-2501-q4_K_M
|
455 |
```
|
|
|
26 |
**DISCLAIMER**
|
27 |
*Tool calling template is a work in progress*
|
28 |
|
29 |
+
Mistral Small 3 ( 2501 ) sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models!
|
30 |
This model is an instruction-fine-tuned version of the base model: [Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501).
|
31 |
|
32 |
+
Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.
|
33 |
Perfect for:
|
34 |
- Fast response conversational agents.
|
35 |
- Low latency function calling.
|
|
|
38 |
|
39 |
For enterprises that need specialized capabilities (increased context, particular modalities, domain specific knowledge, etc.), we will be releasing commercial models beyond what Mistral AI contributes to the community.
|
40 |
|
41 |
+
This release demonstrates our commitment to open source, serving as a strong base model.
|
42 |
|
43 |
Learn more about Mistral Small in our [blog post](https://mistral.ai/news/mistral-small-3/).
|
44 |
|
|
|
99 |
**Note**:
|
100 |
|
101 |
- Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance
|
102 |
+
([Qwen2.5-32B-Instruct](https://qwenlm.github.io/blog/qwen2.5/), [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Gemma-2-27B-IT](https://huggingface.co/google/gemma-2-27b-it)).
|
103 |
- Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
|
104 |
|
105 |
### Basic Instruct Template (V7-Tekken)
|
|
|
115 |
|
116 |
The model can be used with the following frameworks;
|
117 |
- [`vllm`](https://github.com/vllm-project/vllm): See [here](#vllm)
|
|
|
118 |
|
119 |
### vLLM
|
120 |
|
|
|
123 |
|
124 |
**Note 1**: We recommond using a relatively low temperature, such as `temperature=0.15`.
|
125 |
|
126 |
+
**Note 2**: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend the following
|
127 |
system prompt:
|
128 |
|
129 |
```
|
|
|
133 |
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""
|
134 |
```
|
135 |
|
136 |
+
**Note 3**: Make sure to add the following SampleParam at inference time Tool Calling to work.
|
137 |
+
```json
|
138 |
+
"skip_special_tokens": False
|
139 |
+
```
|
140 |
+
|
141 |
**_Installation_**
|
142 |
|
143 |
Make sure you install [`vLLM >= 0.6.4`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4):
|
|
|
156 |
|
157 |
#### Server
|
158 |
|
159 |
+
We recommand that you use Mistral-Small-24B-Instruct-2501 in a server/client setting.
|
160 |
|
161 |
1. Spin up a server:
|
162 |
|
|
|
164 |
vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
|
165 |
```
|
166 |
|
167 |
+
**Note:** Running Mistral-Small-24B-Instruct-2501 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
|
168 |
|
169 |
2. To ping the client you can use a simple Python snippet.
|
170 |
|
|
|
213 |
|
214 |
Mistral-Small-24-Instruct-2501 is excellent at function / tool calling tasks via vLLM. *E.g.:*
|
215 |
|
216 |
+
|
217 |
+
|
218 |
+
#### Prompt template
|
219 |
Jinja is a powerful and flexible template engine for Python. It allows developers to create dynamic content by separating the structure of a document from its varying parts. Jinja templates are widely used in web development, configuration management, and data processing tasks.
|
220 |
|
221 |
Key features of Jinja templates include:
|
|
|
272 |
<summary>Tool Calling Example</summary>
|
273 |
|
274 |
```py
|
275 |
+
from openai import OpenAI
|
276 |
import json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
277 |
|
278 |
+
client = OpenAI(base_url="http://localhost:1337/v1")
|
279 |
+
|
280 |
+
|
281 |
+
def get_weather(location: str, unit: str):
|
282 |
+
return f"Weather {location} in {unit} is bad!"
|
283 |
+
def get_gold_price(currency: str = "USD"):
|
284 |
+
return f"Getting the gold price in {currency} is enourmous!"
|
285 |
+
tool_functions = {"get_weather": get_weather, "get_gold_price": get_gold_price}
|
286 |
+
|
287 |
+
tools = [{
|
288 |
+
"type": "function",
|
289 |
+
"function": {
|
290 |
+
"name": "get_weather",
|
291 |
+
"description": "Get the current weather in a given location",
|
292 |
+
"parameters": {
|
293 |
+
"type": "object",
|
294 |
+
"properties": {
|
295 |
+
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
|
296 |
+
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
297 |
},
|
298 |
+
"required": ["location", "unit"]
|
299 |
+
}
|
300 |
+
}
|
301 |
+
},
|
302 |
+
{
|
303 |
+
"type": "function",
|
304 |
+
"function": {
|
305 |
+
"name": "get_gold_price",
|
306 |
+
"description": "Get the current gold price in wanted currency (default to USD).",
|
307 |
+
"parameters": {
|
308 |
+
"type": "object",
|
309 |
+
"properties": {
|
310 |
+
"currency": {"type": "string", "description": "Currency code e.g. USD or EUR."}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
311 |
}
|
312 |
+
}
|
313 |
+
}
|
314 |
+
}]
|
315 |
+
|
316 |
+
response = client.chat.completions.create(
|
317 |
+
model="uai/lm-base",
|
318 |
+
messages=[{"role": "user", "content": "What's the weather like in San Francisco? And whats the current gold price?"}],
|
319 |
+
temperature=0,
|
320 |
+
extra_body={
|
321 |
+
"skip_special_tokens": False
|
322 |
},
|
323 |
+
tools=tools,
|
324 |
+
tool_choice="auto"
|
325 |
+
)
|
326 |
+
|
327 |
+
print(f"Function called: {response.choices[0]}")
|
328 |
+
tool_calls = response.choices[0].message.tool_calls
|
329 |
+
|
330 |
+
for index, tool_call in enumerate(tool_calls):
|
331 |
+
call_response = tool_call.function
|
332 |
+
print(f"{index}. Function called: {call_response.name}")
|
333 |
+
print(f"Arguments: {call_response.arguments}")
|
334 |
+
if index == 0:
|
335 |
+
print(f"Result: {get_weather(**json.loads(call_response.arguments))}")
|
336 |
+
elif index == 1:
|
337 |
+
print(f"Result: {get_gold_price(**json.loads(call_response.arguments))}")
|
|
|
338 |
```
|
339 |
|
340 |
</details>
|
|
|
401 |
|
402 |
### Ollama
|
403 |
|
404 |
+
[Ollama](https://github.com/ollama/ollama) can run this model locally on MacOS, Windows and Linux.
|
405 |
|
406 |
```
|
407 |
ollama run mistral-small
|
408 |
```
|
409 |
|
410 |
+
4-bit quantization (aliased to default):
|
411 |
```
|
412 |
ollama run mistral-small:24b-instruct-2501-q4_K_M
|
413 |
```
|