Yes 2 machines a DGX Spark and a Linux workstation and they connected but a 100Gbps dedicated RoCE network
Csaba Kecskemeti PRO
AI & ML interests
Recent Activity
Organizations
I've managed to setup a distributed inference infra at home using a DGX Spark (128GB unified gddr6) and a linux workstation with an RTX 6000 Pro (96GB gddr7) connected via 100Gbps RoCEv2. The model I've used (https://lnkd.in/gx6J7YuB) is about 140GB so could not fit either of the GPU. Full setup and tutorial soon on devquasar.com
Screen recording:
https://lnkd.in/gKM9H5GJ
https://github.com/csabakecskemeti/ministral-3_dequantizer_fp8-bf16
(The instruct model weights are in FP8)
I've used this:
https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8/tree/main/inference
Hoped I can make it work on my CPU... :P
@ubergarm you might have the resources!? 😀
SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK)
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
Currently uploading an INT8 version of Deepseek 3.2 Speciale:
DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8
I cannot test this I'm on AMD
"AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support"
(I assumed it can fall back to some non optimized kernel but seems not)
If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome.
If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me.
Thanks all!
Deep-TOON
My goal was to token efficiently handle json structures with complex embeddings.
So this is what I've built on the weekend. Feel free try:
https://pypi.org/project/deep-toon/0.1.0/
(I believe it has overthinking it a bit :) )
https://youtu.be/Iqu5s9aFaXA?si=QWZe293iTKf_3ELU
DevQuasar/deepseek-ai.DeepSeek-R1-0528-GGUF
https://youtu.be/4F8g_LThli0?si=MGba2SUTHt6xYw3T
Quants uploading now
Big thanks to @ngxson !
Quants DevQuasar/meta-llama.Llama-4-Scout-17B-16E-Instruct-GGUF
The system varies (different motherboard and CPU ... but that probably that has little effect on the inference performance).
https://devquasar.com/gpu-gguf-inference-comparison/
the exact models user are in the page
I'd welcome results from other GPUs is you have access do anything else you've need in the post. Hopefully this is useful information everyone.
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | pp512 | 12207.44 ± 481.67 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 143.18 ± 0.18 |
Comparison with others GPUs
http://devquasar.com/gpu-gguf-inference-comparison/
Follow-up
With the smaller context length dataset the training has succeeded.
nvidia/Llama-3_3-Nemotron-Super-49B-v1
GGUFs:
DevQuasar/nvidia.Llama-3_3-Nemotron-Super-49B-v1-GGUF
Enjoy!
DevQuasar/CohereForAI.c4ai-command-a-03-2025-GGUF
6.7t/s on a 3gpu setup (4080 + 2x3090)
(q3, q4 currently uploading)
No success so far, the training data contains some larger contexts and it fails just before complete the first epoch.
(dataset: DevQuasar/brainstorm-v3.1_vicnua_1k)
If anyone has further suggestion to the bnb config (with ROCm on MI100)?
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
Now testing with my other dataset that is smaller seems I have a lower memory need
DevQuasar/brainstorm_vicuna_1k
It's failed by the morning, need to find more room to decrease the memory