make ggufs please.

#1
by drmcbride - opened

we need gguf!

Jinx org

You can try to build gguf by yourself using this tool. It is online and does not need local resources. Have fun!

Best,
Jinx Team

I try run the Qwen3-235B-A22B-Thinking-2507 and the 32b made by Jinx-org, they both response without <think> but with </think>, I try quants version and f16 32b version, they all has this problem.

Could you please describe your process step by step? For example, your environment setup, IDE version, the commands you're running, and the output you're seeing.

let me try:

  1. on linux with ik_llama.cpp
  2. python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--Jinx-org--Jinx-Qwen3-235B-A22B-Thinking-2507/snaps
    hots/fe1b7faefb33dd8d321eac938ed1db862e29035b --outfile Jinx-Qwen3-235B-A22B-Thinking-2507.gguf --outtype bf16
  3. llama-server --jinja --threads 16 --threads-batch 32 --no-mmap -m Jinx-Qwen3-30B-A3B-Thinking-2507.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.01 -c 32768 -np 1 -fmoe -ub 4096 -b 4096
  4. ask any question in browser , get response without <think>

Could you please check if your downloaded transformer weight works correctly with this script?

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jinx-org/Jinx-Qwen3-235B-A22B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)

respectfully if people are having issues why don’t you just do the ggufs u will gather much more attention from local users if you do

hi @Jeol

I like to, but my device can not run Jinx-Qwen3-30B-A3B-Thinking-2507 or Jinx-org/Jinx-Qwen3-235B-A22B-Thinking-2507 safetensors type.

Sorry...

Hi @drmcbride , you are right. I should do this. I will add gguf of each model before the end of this week.

i promise people are gonna use them and then people will talk about your jinx org

To minimize the workload, I'd rather not run quantization for every possible setup. What's your preferred quantization approach, or do you have suggestions for the most effective configurations to prioritize?

I use ik_llama.cpp to get best speed, you can try the Secret Recipe for ik_llama.cpp from https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

for most of people, they use normal llama.cpp or lm studio or ollama, the best options is unsloth version.

I guess k5_m is good for start, for bigger model like Jinx-org/Jinx-Qwen3-235B-A22B-Thinking-2507 K3/K2 could be useful for some small memory case.

Sign up or log in to comment