I've been awarded 128 NVIDIA Blackwell GPUs through NIPA (Korea's National IT Industry Promotion Agency). Sharing this here first — because Hugging Face is where it all started.
I design LLM architectures from scratch. HF was my lab — dissecting Transformers internals, analyzing thousands of checkpoints, iterating on Spaces with global feedback.
Our FINAL Bench reached #5 globally in HF dataset popularity, and this research is exactly what earned the GPU grant. 👉 FINAL-Bench/Leaderboard
These 128 Blackwells will scale AETHER-Net — our Proto-AGI architecture (Emergence Engine · Meta-Cognition · SLAI · Multi-Intelligence · Synergy & Critique) — validated at 0.8B with MoE expansion to 2.1B params. Next stop: 166B.
People I must thank:
@John6666 — Guardian of this ecosystem. Never misses a forum question, interested in every project, active 24/7. I've genuinely wondered if you're a machine. Remarkable.
@bartowski — Master of quantization. The hidden infrastructure of open-source LLM. Countless experiments possible thanks to you.
@SaylorTwift — You see what others miss. Insight that cuts to the essence. Deep respect.
My promise: AETHER-Net design docs, training recipes, checkpoints, and failure logs — all shared here openly.
🤗 Thank you, Hugging Face. Let's turn the next page together. 🚀
🏟️ Smol AI WorldCup: A 4B Model Just Beat 8B — Here's the Data
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
→ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
→ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" — we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.
🏟️ Smol AI WorldCup: A 4B Model Just Beat 8B — Here's the Data
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
→ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
→ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" — we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.