update personal preference ranking

e0a0bf0 verified 5 months ago

8.25 kB

	---
	license: other
	license_name: apache-2.0-or-mnpl-0.1
	license_link: https://mistral.ai/licences/MNPL-0.1.md
	tags:
	- code
	- generation
	- debugging
	- editing
	pipeline_tag: text-generation
	---

	# Code Logic Debugger v0.1

	Hardware requirements for ChatGPT GPT-4o level inference speed for the models in this repo: >=24 GB VRAM.

	Note: The following results are based on my day-to-day workflows only on an RTX 3090. My goal was to run private models that could beat GPT-4o and Claude-3.5 in code debugging and generation to ‘load balance’ between OpenAI/Anthropic’s free plan and local models to avoid hitting rate limits, and to upload as few lines of my code and ideas to their servers as possible.

	An example of a complex debugging scenario is where you build library A on top of library B that requires library C as a dependency but the root cause was a variable in library C. In this case, the following workflow guided me to correctly identify the problem.

	<br>

	## Throughput

	![](./model_v0.1_throughput_comparison.png)

	IQ here refers to Importance Matrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/). For more info on the techique, please see [this GitHub discussion](https://github.com/ggerganov/llama.cpp/discussions/5006/).

	<br>

	## Personal Preference Ranking

	Evaluated on two programming tasks: debugging and generation. It may be a bit subjective. `DeepSeekV2 Coder Instruct` is ranked lower because DeepSeek's Privacy Policy says that they may collect "text input, prompt" and there's no way around it.


	Code debugging/editing prompt template used:
	```
	<code>
	<current output>
	<the problem description of the current output>
	<expected output (in English is fine)>
	<any hints>
	Think step by step. Solve this problem without removing any existing functionalities, logic, or checks, except any incorrect code that interferes with your edits.
	```

	\| Rank \| Model Name \| Token Speed (tokens/s) \| Debugging Performance \| Code Generation Performance \| Notes \|
	\|----------\|----------------------------------------------\|----------------------------\|------------------------------------------------------------------------\|-----------------------------------------------------------------------\|-------------------------------------------------------------------------------------------\|
	\| 1* \| codestral-22b-v0.1-IQ6_K.gguf (this repo) \| 34.21 \| Excellent at complex debugging, often surpasses GPT-4o and Claude-3.5 \| Good, but may not be par with GPT-4o \| One of the best overall for debugging in my workflow, use Balanced Mode. \|
	\| 1* \| Claude-3.5-Sonnet \| N/A \| Poor in complex debugging compared to Codestral \| Excellent, better in design and more creative than GPT-4o in code generation \| Great for code generation, but weaker in debugging. \|
	\| 1* \| GPT-4o \| N/A \| Good at complex debugging but can be outperformed by Codestral \| Excellent, generally reliable for code generation, more knowledgable \| Balanced performance between code debugging and generation. \|
	\| 4 \| DeepSeekV2 Coder Instruct \| N/A \| Good, but outputs the same code in complex scenarios \| Excellent at general code generation, rivals GPT-4o \| Excellent at code generation, but has data privacy concerns as per Privacy Policy. \|
	\| 5* \| Qwen2-7b-Instruct bf16 \| 78.22 \| Average, can think of correct approaches \| Sometimes helps generate new ideas \| High speed, useful for generating ideas. \|
	\| 5* \| AutoCoder.IQ4_K.gguf (this repo) \| 26.43 \| Excellent at solutions that require one to few lines of edits \| Generates useful short code segments \| Try Precise Mode or Balanced Mode. \|
	\| 7 \| GPT-4o-mini \| N/A \| Decent, but struggles with complex debugging tasks \| Reliable for shorter or simpler code generation tasks \| Suitable for less complex coding tasks. \|
	\| 8 \| Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf \| 2.55 \| Poor, occasionally helps generate ideas \| --- \| Speed is a significant limitation. \|
	\| 9 \| Trinity-2-Codestral-22B-Q6_K_L \| N/A \| Poor, similar issues to DeepSeekV2 in outputing the same code \| --- \| Similar problem to DeepSeekV2, not recommended for my complex tasks. \|
	\| 10 \| DeepSeekV2 Coder Lite Instruct Q_8L \| N/A \| Poor, repeats code similar to other models in its family \| Not as effective in my context \| Not recommended overall based on my criteria. \|


	<br>

	## Generation Kwargs

	Balanced Mode:
	```python
	generation_kwargs = {
	"max_tokens":8192,
	"stop":["<\|EOT\|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<\|start_header_id\|>", "<\|end_header_id\|>", "<\|eot_id\|>"],
	"temperature":0.7,
	"stream":True,
	"top_k":50,
	"top_p":0.95,
	}
	```

	Precise Mode:
	```python
	generation_kwargs = {
	"max_tokens":8192,
	"stop":["<\|EOT\|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<\|start_header_id\|>", "<\|end_header_id\|>", "<\|eot_id\|>"],
	"temperature":0.0,
	"stream":True,
	"top_p":1.0,
	}
	```

	Qwen2 7B:
	```python
	generation_kwargs = {
	"max_tokens":8192,
	"stop":["<\|EOT\|>", "</s>", "<｜end▁of▁sentence｜>", "<eos>", "<\|start_header_id\|>", "<\|end_header_id\|>", "<\|eot_id\|>"],
	"temperature":0.4,
	"stream":True,
	"top_k":20,
	"top_p":0.8,
	}
	```

	Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.

	<br>

	## New Discoveries

	The following are tested in my workflow, but may not generalize well to other workflows.

	- In general, if there's an error in the code, copy pasting the last few rows of stacktrace (without the library stacktrace) to the LLM seems to work.
	- Adding "Reflect." after a failed attempt at code generation sometimes allows Claude-3.5-Sonnet to generate the correct version.
	- If GPT-4o reasons correctly in its first response and the conversation is then continued with GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.

	<br>

	## License

	A reminder that `codestral-22b-v0.1-IQ6_K.gguf` should only be used for non-commercial projects.

	Please use `Qwen2-7b-Instruct bf16` and `AutoCoder.IQ4_K.gguf` as alternatives for commericial activities.

	<br>

	## Download

	```
	pip install -U "huggingface_hub[cli]"
	```

	Commercial use:
	```
	huggingface-cli download FredZhang7/claudegpt-code-logic-debugger-v0.1 --include "AutoCoder.IQ4_K.gguf" --local-dir ./
	```

	Non-commercial (e.g. testing, research, personal, or evaluation purposes) use:
	```
	huggingface-cli download FredZhang7/claudegpt-code-logic-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
	```