Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

#3
by mradermacher - opened

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

457.4g after warming up.

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

grafik.png

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

grafik.png

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

grafik.png

Yes it is clearly streaming from SSD now:

grafik.png

Once the quantisation tasks are interrupted it should work without SSD streaming again.

Thanks! Takes longer than I thought to get this merged. But I guess I'll hold back then and keep the downloaded model on /gpool for the time being.

Although the error (can not map tensor) is not normally indicative of a format problem - the converter simply doesn't recoghnize that tensor. Anyway, I think I will crap out and kindly ask you to convert it for me when the patch is merged :)

@mradermacher I created a new node named nico2. You can access it over SSH using [email protected]:2111 as I copied the autorrized_keys from nico1. For WireGuard I opened UDP port 7104 as port 7103 was already in use by nico1.

nico1 currently has the following specifications:

CPU: AMD Ryzen Threadripper 3970X (32 cores 64 threads)
RAM: 256 GB DDR4 ECC
Storage: ADATA SX8200PNP 2 TB ZFS formated with 88% empty.
PSU: Seagate PRIME PX-2200 2200W ATX 3.1
Cooler: be quiet! Silent Loop 2
LAN 0: 10 Gbit internet access and intranet access to nico1 over 10 Gbit internet switch
LAN 1: 10 Gbit intranet access to nico1 over 10 Gbit intranet switch (recommended to use for transfer between nico1 and nico2)
GPU: RTX 3080 (currently not attached to your container)
OS: Debian 12

Thanks! Takes longer than I thought to get this merged.

And what is even worse: The pull request is still in draft stage with a lot of fundamental discussions still going on so it doesn't look like it is getting merge anytime soon. And this despite there not beeing a single code change for the past 3 weeks.

But I guess I'll hold back then and keep the downloaded model on /gpool for the time being.

Let's at least give it a try if it works with BF16 because FP8 not working is absolutely expected. I'm now using the following command to convert it. I expect this to take a long time and it will make use of one of the GPUs but not enough that it would conflict with imatrix computation:

CUDA_VISIBLE_DEVICES=1 venv/bin/python fp8_cast_bf16.py --input-fp8-hf-path /HDD/Z1-Zero --output-bf16-hf-path /HDD/Z1-Zero-BF16

Although the error (can not map tensor) is not normally indicative of a format problem - the converter simply doesn't recognize that tensor. Anyway, I think I will crap out and kindly ask you to convert it for me when the patch is merged :)

Well unless fp8_cast_bf16.py processes this exact tensor...

for weight_name, weight in current_state_dict.items():
    if weight_name.endswith("_scale_inv"):
        continue
    elif weight.element_size() == 1:  # FP8 weight
        scale_inv_name = f"{weight_name}_scale_inv"
        try:
            # Get scale_inv from the correct file

lucky that i sneaked in a "normally" there anyway, the fp8 tp b f16 step is what i was missing

but how did you even get the idea of checking the safetensor against the base model? you looked at the commit history?

somewhat unrelated, I've cleaned up llama.cpp usage, and it should now be possible to use any custom llama.cpp variant per-job. i'd even support if somebody else (cough) would take over maintaining any llama.cpp forks we might want to use. all that is required is to have some llama.cpp source directory with a build directory under it where it was built (I use cmake).

Is there something special I need to do to convert a deepseek model, or does the above just indicate that I am trying to convert a broken model? (it's Z1-Zero in /gpool).

All the DeepSeek V3/R1 based models are a pain to convert to a GGUF. You first need to convert them from FP8 to BF16 which makes them use 1.3 TB of storage and then convert them to a 1.3 TB GGUF. It is not possible to directly convert the source model to GGUF without going over BF16.

https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6 This is another method that saves you a step.

@mradermacher I just got wake on LAN working for nico2 so you can execute /root/wakeNico2.sh on nico1 to turn on nico2 should it be off.

To shut down the nico2 host execute /root/shutdownHost.sh

You can access nico2 from nico1 using ssh [email protected].
You can access nico1 from nico2 using ssh [email protected].

Like on nico1 there is /host/proc/meminfo, /host/proc/stat, /host/proc/uptimeand/host/proc/vmstat` to check stats of the host. But given that currently your LXC container is usualy the only thing running on CastlePeak it is unlikely you need that.

memlock is set to unlimited and nproc to 4000 like on nico1

I changed the sysctl.conf on CastlePeak as following to match what we set on StormPeak:

# allow TCP with buffers up to 128 MiB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# set default buffer size during socket creation to 2 MiB
net.core.rmem_default = 2097152
net.core.wmem_default = 2097152

# increase TCP autotuning buffer limits
net.ipv4.tcp_rmem = 4096 4194304 67108864
net.ipv4.tcp_wmem = 4096 4194304 67108864

# Sets maximum size of the network interface's receive queue
net.core.netdev_max_backlog = 30000

# Use improved TCP congestion control algorithm
net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr

I now know the probable cause why the RPC setup was so slow. When I checked the CastlePeak BIOS settings to enable wake on LAN I realized that PCie3 port currently hosting ther RTX 3080 GPU was set in x4x4x4x4 mode instead of x16 due to me previously plugging in 3x Intel ARC GPUs using a PCIe bifurcation into that PCIe port a few weeks ago and I forgot to change it back.

but how did you even get the idea of checking the safetensor against the base model? you looked at the commit history?

I for sure did not believe him that he was able to finetune DeepSeek R1 as none of the public framework currently supports it on a setup with less than 1500 GB of GPU memory which is only possible to get on AMD in a single node and AMD is not DeepSeek R1 compatible so I obviously wanted to know if he did finetune it or is just a fraud. The commit message "Duplicate from deepseek-ai/DeepSeek-R1" visible in the model card and the commit history instantly gave it away that he just cloned DeepSeek R1. But even without it I obviously would have compared hashes with all other publicly released V3/R1 models.

What is much more interesting is what he uploaded now. Did he really accidentally claimed a DeepSeek R1 clone to be his DeepSeek R1 Zero finetune and now upload a real model he created using his own custom finetuning code or is it yet another lie. At a first glance it looks real so I downloaded it to gpool but I only spent like a minute investigating it so far as I was busy setting up nico2.

The trending models are probably much more desired than the daily ones. In the end, I would assume there should be considerable overlap, but since out methods are distinct, they should complement each other.

Great to know so I will queue them using priority 0 as well.

I recently started to only statically quantize some bigger models, intending to queue imatrix jobs at a later date.

Just make sure not to forget about them

Well, if the output changes, we should make new imatrix ones. If the imatrix has missing entries for tensors that are quantized to few bits, then llama-quantize will likely crash or abort.

Yes we unfortunately likely have to compute the imatrix of all the DeepSeek V2/V3/R1 models again. For V3/R1 we will unfortunately need RPC to do so in Q8.

Nope. Xpath, cool. Why would you even need... ah, you are essentially parsing the html page. Right... the trending info is not available via the api? That totally escaped me.

There is https://huggingface.co/models-json?pipeline_tag=text-generation&library=safetensors&p=1&sort=trending&withCount=false which should contain the same information and to which I might switch soon. Keep in mind that before my Python script I copied some JavaScript code into the Firefox development console so getting the data out of HTML using XPath was easier. I often prefer getting data out of HTML. HTML has usually less rate limit issues and XPath is a well-defined well established standard while JSON Path got standardized not even a year ago.

I need to add those external locations to the llmjob runner though, so they are visible inside the sandbox. Maybe it's slowly starting to get time for a config file...

That would be cool especially should we decide to add external storage to rich1 as well.

Love your optimism, but they are far from done, and any time we could get more :) I plan to have one running at all times, maybe in addition to the two normal jobs, leaving the source on /gpool, which essentially gives it a background priority.

Having one with gpool as source running all the time additionally the two normal jobs make a lot of sense as HDDs are much slower than SSDs.

That in itself is just a friendly notice by your kernel, not an indication of a bug. It's not uncommon to increase it on a busy server. When a lot of complicated transaction have been made or simply there is a lot of bulk I/O, transactions can indeed hang for a long time. And it is being used quite a bit at the moment.

That explains why it happened when I was hfd massive models with 8 threads to it... I was quite worried about it and relieved to know that this is normal for BTRFS HDDs.

While we are at it, could you mount my / without discard and have a daily or so fstrim cronjob instead? Deleting large ggufs regularly causes long hangs. It's really just a minor optimisation, but it bugs me a lot for some reason :) I even moved some deletes into the background :)

I will do so the next time I we have to reboot nico1 and rich1. I read there is discard=async in BTRFS which makes even more sense. Because only trimming once a day might mean slower write speed due to trying to write on non-trimmed blocks.

I totally understand, and I am as willing to do large quants in the future as I was in the past. Just trying to put things into perspective (and getting priorities straight). I don't see the queue growing due to overlooked models much in the future.

I believe and hope this is a one-time thing and once we finally are done with the massive backlog things should relax. It was a crazy past 4 month.

Unfortunately, nico1 is als o the best host for static quants. rain/back/kaos/marco are often I/O limited, for example. rich1, too. leia less, because I queue smaller jobs on it.

I wonder if we could do something about rich1 being io limited. Currently we are using the 2 TB NVMe SSD but there is another 1 TB SATA SSD we could use. I already thought about RIAD 0 them together but RAID 0 a fast NVMe SSD with a slower SATA SSD seem like a bad idea. We could also disable BTRFS compression and see if that helps. I likely have to disable discard on rich1 as well anyways.

For example, I was even thinking about splitting jobs by quants, so that fast quants (Q*, essentially) are done by nico1 and slow ones (IQ* essentially) are done by other hosts. To get more throughput. At the expense of disk wear and increased I/O. Didn't seem appealing enough for me so far, but I did think of it :)

I really couldn't care less about disk ware. If we continue at the current rate, they will last another 3 years and I wouldn't mind if they break earlier as then I have a reason to replace them with high quality 4 TB SSDs. I currently filled all 8 NVMe slots of StormPeak so I can't add any more of them without replacing an existing one. Currently they are at 25% and 21% wear.

But only doing non-IQ quants would be an internet bandwidth concern. While there is no fair use clause in my contract testing out how much my ISP is willing to tolerate before kicking my out is not the smartest idea given that all other ISPs use the unreliable fiber network maintained by Swisscom instead of the stable high-quality fiber network maintained by Quickline. But I guess nico2 is worth risking pushing the limits at least a bit.

If we only modestly increase quanting throughput we should also be able to stay under 500 TB/month.

I think so as well. This is not really a hard limit anyways. It's just what their competitor Init7 put as fair use into their contract so if they complain and I'm below 500 TB/month I could tell them that their competitor would be fine with me using as much traffic as I use.

I would be ok with continuing as it is. I hope the current fad of kicking out 20 70b's per day will die down when everybody has distilled and sce-merged enough. Maybe they will start doing that to 405Bs instead, but hopefully not.

I wouldn't count on it so we better slightly increase our throughput so we can finally catch up with the latest models and work on our massive backlog.

In any case, it would be, hopefully, a one-time thing (the current queue). If I ever get to queue 2023 models, it will be far, far, less of them, because it will only be models, I will personally recognize somehow :)

That's what I'm thinking as well. do have the resources required for us to scale up so let's do it.

somewhat unrelated, I've cleaned up llama.cpp usage, and it should now be possible to use any custom llama.cpp variant per-job.

That's cool so we can finally redo the snowflake arctic models because we all miss massive models.

i'd even support if somebody else (cough) would take over maintaining any llama.cpp forks we might want to use. all that is required is to have some llama.cpp source directory with a build directory under it where it was built (I use cmake).

Sure if you create a pull request with your own llama.cpp patches to https://github.com/nicoboss/llama.cpp with "mradermacher" as target branch I can do so. Then we also finally have a place to merge my imatrix-allow-partial-data branch to.

https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6 This is another method that saves you a step.

Awesome! Thanks a lot for your recommendation. Do you have any idea if the GGUF produced that way will be equal to what you get from converting the downloaded DeepSeek V3/R1 model to BFS and then the BF16 model to GGUF? Will a source GGUF produced by this code be compatible with the official llama.cpp?

Did he really accidentally claimed a DeepSeek R1

I was also bit suspicious after he uploaded again. Too bad we don't have the original repository (with the un-downloadable file), but he was awfully quick in re-uploading a new repo.

I just got wake on LAN working for nico2

Uh, ah, oh, wow - I'll try to set it up tomorrow.

To shut down the nico2 host execute /root/shutdownHost.sh

Does that mean I should shut it down automatically before nightfall or so? I can probably devise a strategy (like, start at 7, interrupt/stop at 1600 and shut down). Of course, it will be a challenge :)

I now know the probable cause why the RPC

Sounds like a probable cause indeed. We'll find out soon enough :)

"intending to queue imatrix jobs at a later date." Just make sure not to forget about them

Some have already gone away :-)

JSON Path got standardized not even a year ago.

But web pages not intended for scraping are not standardized at all, unlike an api. In any case, I don't care, whatever made is easy for you to you come up with the script wins.

I will do so the next time I we have to reboot nico1 and rich1.

Fine with me, although it surely is remountable.

Having one with gpool as source running all the time additionally the two normal jobs make a lo

I have to hand-designate such jobs, but there is now some hacky code to do exactly that, in use for Tulu.

Because only trimming once a day might mean slower write speed due to trying to write on non-trimmed blocks.

fstrim should be more efficient than online trimming and, for some reason, less blocking. But I have no real experience with specifically your nvme disks, and its very disk dependent. I just noticed that deletes are surprisingly slow, and overwriting tends to be more efficient overall, even if individual writes may be a bit slower. As long as there is either some trimming or enough spare.

RAID 0 a fast NVMe SSD with a slower SATA SSD seem like a bad idea.

I agree. Espedfially if it's a non-enterprise sata ssd, we migzht end up at <<200_MBps write speed, maybe much less.

In any case, rich1 is not always I/O-limited, only when it is doing lots of static quants, or when it is converting to gguf. It's by far not as much of a problem as on rain/back/kaos/marco.

I could tell them that their competitor would be fine with me using as much traffic as I use.

Between you and a big corporation, you usually end up at the losing end of any argument.

Sure if you create a pull request with your own llama.cpp patches

I have no functional changes. But if somebody would maintain a fork, I might be bothered to add e.g. timestamps to imatrix output. I brought it up mainly because I didn't want to maintain (and regularly merge) the imatrix patches. I'd be happy to use you as upstream.

If I ever get to queue 2023 models

I just got a taste of it by looking at bluemoonrp-13b - repo consists of 4bit safetensors, sdome ggml files and the original pytorch files in a subdirectory. And bluemoonrp-30b seems to be lost other than a 4 and 5 bit version. I was susprised it did convert fine, though - I have the feeling we might have to resort to some early 2024 version of convert.py.

https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6

wrong thread?

@nicoboss did you do something special on nico1 to let me set btrfs properties (such as compression)? it's because my rsync's currently fail on nico2 because they can't set extended attributes (had to play around with it before going to sleep :)

Sign up or log in to comment