3 84 86

Shyam Sunder Kumar

theainerd

AI & ML interests

Natural Language Processing

Recent Activity

liked a dataset about 22 hours ago

facebook/natural_reasoning

upvoted a paper 1 day ago

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

upvoted a paper 1 day ago

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

View all activity

Organizations

theainerd's activity

liked a dataset about 22 hours ago

facebook/natural_reasoning

Viewer • Updated 3 days ago • 1.15M • 1.24k • 166

upvoted 2 papers 1 day ago

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published 3 days ago • 99

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Paper • 2502.14499 • Published 3 days ago • 148

upvoted a paper 2 days ago

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Paper • 2502.14739 • Published 3 days ago • 87

reacted to cogwheelhead's post with 👍 3 days ago

Post

2437

Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models)

Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative

It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: toloka/u-math-leaderboard)

tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability:
* performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025)
* R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests)
* same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving)
* R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)

Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here

liked 2 Spaces 3 days ago

155

Open Object Detection Leaderboard

🏆

Request model evaluation on COCO val 2017 dataset

Paligemma2 Mix

🌖

Generate text or segment objects from an image

liked a dataset 3 days ago

microsoft/IMAGE_UNDERSTANDING

Viewer • Updated Sep 20, 2024 • 10.2k • 53 • 5

upvoted 2 papers 3 days ago

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published 4 days ago • 136

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Paper • 2502.12982 • Published 5 days ago • 11

reacted to dreamerdeo's post with ➕🚀 3 days ago

Post

2698

🚀 Excited to share our technical report on the Southeast Asian multilingual model Sailor2 and its latest updates!

Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.

🧭 We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.

Model updates include:
💡 More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques.
🌈 Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training.
⚡️ Faster inference: Achieved 2.5x faster inference speed with speculative decoding.
🌪️ More model sizes: Introduced new sizes of 3B and 14B through model pruning.

🌟 All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.

📚 Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982)
🤖️ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b
💬 Demo: sail/Sailor2-20B-Chat
📣 Sailor2 community: https://huggingface.co/sailor2