README.md · DavidAU/How-To-Use-Reasoning-Thinking-Models-and-Create-Them at 92db3b4b7aa695da89a915eed638bfbd016906f1

metadata

license: apache-2.0
language:
  - en
tags:
  - How to use reasoning models.
  - How to use thinking models.
  - How to create reasoninng models.
  - deepseek
  - reasoning
  - reason
  - thinking
  - all use cases
  - creative
  - fiction writing
  - plot generation
  - sub-plot generation
  - fiction writing
  - story generation
  - scene continue
  - storytelling
  - fiction story
  - romance
  - all genres
  - story
  - writing
  - vivid writing
  - fiction
  - roleplaying
  - bfloat16
  - float32
  - float16
  - role play
  - sillytavern
  - backyard
  - lmstudio
  - Text Generation WebUI
  - llama 3
  - mistral
  - llama 3.1
  - qwen 2.5
  - context 128k
  - mergekit
  - merge
pipeline_tag: text-generation

How-To-Use-Reasoning-Thinking-Models-and-Create-Them - DOCUMENT

This document covers suggestions and methods to get the most out of "Reasoning/Thinking" models, including tips/track for generation, parameters/samplers, System Prompt/Role settings, as well as links to "Reasoning/Thinking Models" and How to create your own (via adapters).

This is a live document and updates will occur often.

This document and the information contained in it can be used for ANY "Reasoning/Thinking" model - at my repo and/or other repos.

LINKS to models and adapters:

#1 All Reasoning/Thinking Models - including MOEs - (collection) (GGUF):

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-deepseek-models-with-thinking-reasoning-67a41ec81d9df996fd1cdd60 ]

#2 All Reasoning/Thinking Models - including MOES - (collection) (Source Code to generation GGUF, EXL2, AWQ, GPTQ, HQQ, etc etc and direct usage):

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-source-files-for-gguf-exl2-awq-gptq-67b296c5f09f3b49a6aa2704 ]

#3 All Adapters (collection) - Turn a "regular" model into a "thinking/reasoning" model:

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-adapters-loras-any-model-to-reasoning-67bdb1a7156a97f6ec42ce36 ]

These collections will update over time. Newest items are usually at the bottom of each collection.

Support: Document about Parameters, Samplers and How to Set These:

For additional generational support, general questions, and detailed parameter info and a lot more see also:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

Support: AI Auto-Correct Engine (software patch for SillyTavern Front End)

AI Auto-Correct Engine (built, and programmed by DavidAU) auto-corrects AI generation in real-time, including modification of the live generation stream to and from the AI... creating a two way street of information that operates, changes, and edits automatically. This system is for all GGUF, EXL2, HQQ, and other quants/compressions and full source models too.

Below is an example generation using a standard GGUF (and standard AI app), but auto-corrected via this engine. The engine is an API level system.

Software Link:

https://huggingface.co/DavidAU/AI_Autocorrect__Auto-Creative-Enhancement__Auto-Low-Quant-Optimization__gguf-exl2-hqq-SOFTWARE

MAIN: How To Use Reasoning / Thinking Models 101

Special Operation Instructions:

Template Considerations:

For most reasoning/thinking models your template CHOICE is critical, as well as your System Prompt/Role setting(s) - below.

For most models you will need: Llama 3 Instruct or Chat, Chatml and/or Command-R OR standard "Jinja Autoloaded Template" (this is contained in the quant and will autoload in SOME AI Apps).

The last one is usually the BEST CHOICE for a reasoning / thinking model (and in many cases other models too).

In Lmstudio, this option appears in the lower left, "template to use -> Manual or "Jinja Template".

This option/setting it will vary from AI/LLM app.

A "Jinja" template is usually in the model's "source code" / "full precision" version and located usually in "tokenizer_config.json" file (usually the very BOTTOM/END of the file) which is then "copied" to the GGUF quants and available to "AI/LLM" apps.

Here is a Qwen 2.5 version example (DO NOT USE: I have added spacing/breaks for readablity):


"chat_template": "{% if not add_generation_prompt is defined %}
  {% set add_generation_prompt = false %}
  {% endif %}
  {% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}
  {%- for message in messages %}
  {%- if message['role'] == 'system' %}
  {% set ns.system_prompt = message['content'] %}
  {%- endif %}
  {%- endfor %}
  {{bos_token}}
  {{ns.system_prompt}}
  {%- for message in messages %}
  {%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}
  {{'<｜User｜>' + message['content']}}
    {%- endif %}
    {%- if message['role'] == 'assistant' and message['content'] is none %}
    {%- set ns.is_tool = false -%}
    {%- for tool in message['tool_calls']%}
    {%- if not ns.is_first %}
    {{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n'
      + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}
        {%- set ns.is_first = true -%}
        {%- else %}
        {{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' 
          + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' 
          + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}
            {%- endif %}
            {%- endfor %}
            {%- endif %}
            {%- if message['role'] == 'assistant' and message['content'] is not none %}
            {%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}
              {%- set ns.is_tool = false -%}
              {%- else %}
              {% set content = message['content'] %}
              {% if '' in content %}
              {% set content = content.split('')[-1] %}
              {% endif %}
              {{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}
                {%- endif %}{%- endif %}
                {%- if message['role'] == 'tool' %}
                {%- set ns.is_tool = true -%}
                {%- if ns.is_output_first %}
                {{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
                  {%- set ns.is_output_first = false %}
                  {%- else %}
                  {{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
                    {%- endif %}
                    {%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}
                      {% endif %}
                      {% if add_generation_prompt and not ns.is_tool %}
                      {{'<｜Assistant｜>'}}
                        {% endif %}"

In some cases you may need to set a "tokenizer" too - depending on the LLM/AI app - to work with specific reasoning/thinking models. Usually this is NOT an issue as this is auto-detected/set, but if you are getting strange results then this might be the cause.

Additional Section "General Notes" is at the end of this document.

TEMP/SETTINGS:

Set Temp between 0 and .8, higher than this "think" functions will activate differently. The most "stable" temp seems to be .6, with a variance of +-0.05. Lower for more "logic" reasoning, raise it for more "creative" reasoning (max .8 or so). Also set context to at least 4096, to account for "thoughts" generation.
For temps 1+,2+ etc etc, thought(s) will expand, and become deeper and richer.
Set "repeat penalty" to 1.02 to 1.07 (recommended) .

PROMPTS:

If you enter a prompt without implied "step by step" requirements (ie: Generate a scene, write a story, give me 6 plots for xyz), "thinking" (one or more) MAY activate AFTER first generation. (IE: Generate a scene -> scene will generate, followed by suggestions for improvement in "thoughts")
If you enter a prompt where "thinking" is stated or implied (ie puzzle, riddle, solve this, brainstorm this idea etc), "thoughts" process(es) in Deepseek will activate almost immediately. Sometimes you need to regen it to activate.
You will also get a lot of variations - some will continue the generation, others will talk about how to improve it, and some (ie generation of a scene) will cause the characters to "reason" about this situation. In some cases, the model will ask you to continue generation / thoughts too.
In some cases the model's "thoughts" may appear in the generation itself.
State the word size length max IN THE PROMPT for best results, especially for activation of "thinking." (see examples below)
Sometimes the "censorship" (from Deepseek) will activate, regen the prompt to clear it.
You may want to try your prompt once at "default" or "safe" temp settings, another at temp 1.2, and a third at 2.5 as an example. This will give you a broad range of "reasoning/thoughts/problem" solving.

GENERATION - THOUGHTS/REASONING:

It may take one or more regens for "thinking" to "activate." (depending on the prompt)
Model can generate a LOT of "thoughts". Sometimes the most interesting ones are 3,4,5 or more levels deep.
Many times the "thoughts" are unique and very different from one another.
Temp/rep pen settings can affect reasoning/thoughts too.
Change up or add directives/instructions or increase the detail level(s) in your prompt to improve reasoning/thinking.
Adding to your prompt: "think outside the box", "brainstorm X number of ideas", "focus on the most uncommon approaches" can drastically improve your results.

GENERAL SUGGESTIONS:

I have found opening a "new chat" per prompt works best with "thinking/reasoning activation", with temp .6, rep pen 1.05 ... THEN "regen" as required.
Sometimes the model will really really get completely unhinged and you need to manually stop it.
Depending on your AI app, "thoughts" may appear with "< THINK >" and "</ THINK >" tags AND/OR the AI will generate "thoughts" directly in the main output or later output(s).
Although quant q4KM was used for testing/examples, higher quants will provide better generation / more sound "reasoning/thinking".

ADDITIONAL SUPPORT:

For additional generational support, general questions, and detailed parameter info and a lot more see also:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

Recommended Settings (all) - For usage with "Think" / "Reasoning":

temp: .6 , rep pen: 1.07 (range : 1.02 to 1.12), rep pen range: 64, top_k: 40, top_p: .95, min_p: .05

Temp of 1+, 2+, 3+ will result in much deeper, richer and "more interesting" thoughts and reasoning.

Model behaviour may change with other parameter(s) and/or sampler(s) activated - especially the "thinking/reasoning" process.

System Role / System Prompt - Augment The Model's Power:

If you set / have a system prompt this will affect both "generation" and "thinking/reasoning".

SIMPLE:

This is the generic system prompt used for generation and testing:

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

This System Role/Prompt will give you "basic thinking/reasoning":

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

ADVANCED:

Logical and Creative - these will SIGNFICANTLY alter the output, and many times improve it too.

This will also cause more thoughts, deeper thoughts, and in many cases more detailed/stronger thoughts too.

Keep in mind you may also want to test the model with NO system prompt at all - including the default one.

Special Credit to: Eric Hartford, Cognitivecomputations ; these are based on his work.

CRITICAL:

Copy and paste exactly as shown, preserve formatting and line breaks.

SIDE NOTE:

These can be used in ANY Deepseek / Thinking model, including models not at this repo.

These, if used in a "non thinking" model, will also alter model performance too.

You are an AI assistant developed by the world wide community of ai experts.

Your primary directive is to provide well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Scientific and Logical Approach: Your explanations should reflect the depth and precision of the greatest scientific minds.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.

CREATIVE:

You are an AI assistant developed by a world wide community of ai experts.

Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.

General Notes:

These are general notes that have been collected from my various repos and/or from various experiences with both specific models and all models.

These notes may assist you with other model(s) operation(s).

From :

https://huggingface.co/DavidAU/L3.1-MOE-2X8B-Deepseek-DeepHermes-e32-uncensored-abliterated-13.7B-gguf

Due to how this model is configured, I suggest 2-4 generations depending on your use case(s) as each will vary widely in terms of context, thinking/reasoning and response.

Likewise, again depending on how your prompt is worded, it may take 1-4 regens for "thinking" to engage, however sometimes the model will generate a response, then think/reason and improve on this response and continue again. This is in part from "Deepseek" parts in the model.

If you raise temp over .9, you may want to consider 4+ generations.

Note on "reasoning/thinking" this will activate depending on the wording in your prompt(s) and also temp selected.

There can also be variations because of how the models interact per generation.

Also, as general note:

If you are getting "long winded" generation/thinking/reasoning you may want to breakdown the "problem(s)" to solve into one or more prompts. This will allow the model to focus more strongly, and in some case give far better answers.

IE:

If you ask it to generate 6 general plots for a story VS generate one plot with these specific requirements - you may get better results.

From :

https://huggingface.co/DavidAU/Qwen2.5-MOE-6x1.5B-DeepSeek-Reasoning-e32-gguf

Temp of .4 to .8 is suggested, however it will still operate at much higher temps like 1.8, 2.6 etc.

Depending on your prompt change temp SLOWLY: IE: .41,.42,.43 ... etc etc.

Likewise, because these are small models, it may do a tonne of "thinking"/"reasoning" and then "forget" to finish a / the task(s). In this case, prompt the model to "Complete the task XYZ with the 'reasoning plan' above" .

Likewise it may function better if you breakdown the reasoning/thinking task(s) into smaller pieces :

"IE: Instead of asking for 6 plots FOR theme XYZ, ASK IT for ONE plot for theme XYZ at a time".

Also set context limit at 4k minimum, 8K+ suggested.