onekq (Yi Cui)

posted an update 2 days ago

Post

1878

Still waiting for 👽Grok👽 3 API ⌛😞😫

replied to their post 6 days ago

Done. So I understand this: you do not change model weights, but rather tweak the inference logic? Somehow remind me of speculative decoding.

replied to their post 9 days ago

Sure, this is what I intend to do.

But a HF 🤗 collection cannot include anything outside HF 🤗. It has to be a dataset, model, space, or paper. Do you have anything like those?

posted an update 10 days ago

Post

1758

R1 is still trending. Here is a collection of works trying to replicate R1.
onekq-ai/r1-reproduction-works-67a93f2fb8b21202c9eedf0b

Players include Huggingface (Open R1), Stanford (simple scaling), Berkeley (Bespoke, Open thoughts, etc.), ServiceNow, etc. I know there is another work from HKUST but couldn't find it on 🤗. Let me know if I miss any teams.

5 replies

·

replied to their post 19 days ago

In my case I asked both models to write code. The model is good if the code passes tests. What are your prompts?

https://huggingface.co/datasets/onekq-ai/WebApp1K-Duo-React

I know though Anthropic weighs in on safety.

replied to their post 21 days ago

And their python package too 😜

Having AI to do the refactor is a great idea though. It will be breaking change if you switch your model from non-reasoning to reasoning.

posted an update 21 days ago

Post

1675

o3-mini is slightly better than R1, but lags behind Claude. Sorry folks, no new SOTA 😕

But OAI definitely owns the fashion of API. temperature and top_p are history now, reasoning_effort will be copied by other vendors.

onekq-ai/WebApp1K-models-leaderboard

4 replies

·

posted an update 22 days ago

Post

1301

Mistral Small 3 is SUPER fast, and highest score for 20+b model, but still 11 points below Qwen 2.5 coder 32b.

I believe specialty model is the future. The more you know what to do with the model, the better bang you can get for your buck. If Mistral scopes this small model to coding only, I'm confident they can beat Qwen.

One day my leaderboard will be dominated by smol models excellent on one thing, not monolithic ones costing $$$. And I'm looking forward to that.

onekq-ai/WebApp1K-models-leaderboard

1 reply

·

replied to their post 25 days ago

Adding Qwen2.5-Max

posted an update 29 days ago

Post

2292

So 🐋DeepSeek🐋 hits the mainstream media. But it has been a star in our little cult for at least 6 months. Its meteoric success is not overnight, but two years in the making.

To learn their history, just look at their 🤗 repo https://huggingface.co/deepseek-ai

* End of 2023, they launched the first model (pretrained by themselves) following Llama 2 architecture
* June 2024, v2 (MoE architecture) surpassed Gemini 1.5, but behind Mistral
* September, v2.5 surpassed GPT 4o mini
* December, v3 surpassed GPT 4o
* Now R1 surpassed o1

Most importantly, if you think DeepSeek success is singular and unrivaled, that's WRONG. The following models are also near or equal the o1 bar.

* Minimax-01
* Kimi k1.5
* Doubao 1.5 pro

1 reply

·

reacted to clem's post with 🔥 29 days ago

Post

2434

The 🐳 just crossed 10,000 followers on HF

https://huggingface.co/deepseek-ai

replied to their post 29 days ago

My conclusion is the same. The R1 paper already reported lower success rate of the distilled models. This is not surprising since we cannot expect the same outcomes out of a much smaller model.

Here is the problem. The small models released by frontier labs are always generic, i.e. decent but lower performance than the flagship model on every benchmark. But we GPU deplorables often want a specialized model which is excellent on only one thing, hence the disappointment.

I guess we will have to help ourselves on this one. Distill an opinionated dataset from the flagship model to a small model of your choice, then hill climb the benchmark you care about.

replied to their post about 1 month ago

1000% agree.

Also reasoning models sure spit out lots of tokens. The same benchmark cost 4x or 5x the money and time to run than regular LLMs. Exciting time for inference players.

Have you tried the distilled models of R1(Qwen and Llama)?

replied to their post about 1 month ago

+1

Also the velocity of progress. I have wanted to learn Monte Carlo Tree Search and process rewards etc. and haven't got the time. I guess now I can skip them 🤗

posted an update about 1 month ago

Post

2684

This is historical. 🎉

DeepSeek 🐋R1🐋 surpassed OpenAI 🍓o1🍓 on the dual leaderboard. What a year for the open source!

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

4735

🐋DeepSeek 🐋 is the real OpenAI 😯

6 replies

·

posted an update about 1 month ago

Post

1651

Qwen 2.5 Coder 32b is a dime among nickels. Amazing performance for its size, so much so it earns a spot in the duo leaderboard. The day of small models is here.

onekq-ai/WebApp1K-models-leaderboard
Qwen/Qwen2.5-Coder-32B-Instruct

posted an update about 2 months ago

Post

3051

🐋 DeepSeek 🐋v3 achieves a solid 7 point jump than v2.5, surpassing GPT-4o, but is still behind 🍓 o1 🍓and Claude 3.5.

onekq-ai/WebApp1K-models-leaderboard

posted an update 4 months ago

Post

599

October version of Claude 3.5 lifts SOTA (set by its June version) by 7 points.
onekq-ai/WebApp1K-models-leaderboard

Closed sourced models are widening the gap again.

Note: Our frontier leaderboard now uses double test scenarios because the single-scenario test suit has been saturated.

posted an update 4 months ago

Post

1866

I'm now working on finetuning of coding models. If you are GPU-hungry like me, you will find quantized models very helpful. But quantization for finetuning and inference are different and incompatible. So I made two collections here.

Inference (GGUF, via Ollama, CPU is enough)
onekq-ai/ollama-ready-coding-models-67118c3cfa1af2cf04a926d6

Finetuning (Bitsandbytes, QLora, GPU is needed)
onekq-ai/qlora-ready-coding-models-67118771ce001b8f4cf946b2

For quantization, the inference models are far more popular on HF than finetuning models. I use https://huggingface.co/QuantFactory to generate inference models (GGUF), and there are a few other choices.

But there hasn't been such a service for finetuning models. DIY isn't too hard though. I made a few myself and you can find the script in the model cards. If the original model is small enough, you can even do it on a free T4 (available via Google Colab).

If you know a (small) coding model worthy of quantization, please let me know and I'd love to add it to the collections.

Yi Cui

AI & ML interests

Recent Activity

Organizations

onekq's activity