@Eligy you really can't. You'll need somewhere between 4 and 8 H200's to host the full unquantised model, a fractional quantised model needs 151GB video ram at the minimum, and you start to see issues at that level, plus no end-user is realistically going to have that much VRAM, even with Strix Halo. Cost for a system that performs half decent at that model size is somewhere in the region of the hundreds of thousands. You can go down to the 70b model which will need either 2x 5090's or an A100, which is ~$10k investment just in GPUs.
So you've spent $10k on GPUs, plus extra on ram, you've invested significant effort into learning how to run a local model, gotten it running, connected up your local interface, and now you're pumping out 10 tokens per second with occasional crashes, for a model that performs NEARLY as good as the aging GPT-4o model, or Gemini 2.0 flash thinking (which is free).
OR you could spend a few dollars and get access to models that will compete or beat the full R1 model, output 120 tokens per second, require no local setup or effort, and are significantly more reliable.
It isn't even a comparison. You cannot realistically run Deepseek-R1 locally. You can maybe run the 14b model, which is inferior to everything online by a substantial margin, and has very little practical value, or you can use an online api endpoint which is faster, cheaper, and better.