Shyam Sunder Kumar

theainerd

AI & ML interests

Natural Language Processing

Recent Activity

Organizations

Neuropark's profile picture Speech Recognition Community Event Version 2's profile picture Open-Source AI Meetup's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture

theainerd's activity

reacted to cogwheelhead's post with 👍 3 days ago
view post
Post
2437
Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models)

Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative

It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: toloka/u-math-leaderboard)

tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability:
* performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025)
* R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests)
* same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving)
* R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)

Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here
reacted to dreamerdeo's post with 🚀 3 days ago
view post
Post
2698
🚀 Excited to share our technical report on the Southeast Asian multilingual model Sailor2 and its latest updates!

Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.

🧭 We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.

Model updates include: 
💡 More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques. 
🌈 Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training. 
⚡️ Faster inference: Achieved 2.5x faster inference speed with speculative decoding. 
🌪️ More model sizes: Introduced new sizes of 3B and 14B through model pruning.

🌟 All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.

📚 Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982) 
🤖️ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b 
💬 Demo: sail/Sailor2-20B-Chat 
📣 Sailor2 community: https://huggingface.co/sailor2
upvoted an article 11 days ago
view article
Article

Open-source DeepResearch – Freeing our search agents

1.08k
upvoted an article 12 days ago