Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
4
9
liu
yixin
Follow
huzimu's profile picture
1 follower
ยท
2 following
AI & ML interests
None yet
Recent Activity
liked
a model
3 days ago
nvidia/NV-Embed-v2
liked
a Space
11 days ago
gaia-benchmark/leaderboard
reacted
to
m-ric
's
post
with ๐
13 days ago
๐๐ฑ๐๐ฒ๐ป'๐ ๐ป๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ด๐ฒ๐ป๐๐ ๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐๐ต๐ผ๐๐ ๐๐ต๐ฎ๐ ๐๐ฒ๐ฒ๐ฝ๐ฆ๐ฒ๐ฒ๐ธ-๐ฅ๐ญ ๐๐๐ฟ๐๐ด๐ด๐น๐ฒ๐ ๐ผ๐ป ๐ฑ๐ฎ๐๐ฎ ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ฎ๐๐ธ๐! โ โก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system. So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand. ๐ But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers. ๐ง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well. But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data. It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐ Read more in the blog post ๐ https://huggingface.co/blog/dabstep
View all activity
Organizations
None yet
Papers
1
arxiv:
2402.17177
spaces
1
pinned
Runtime error
Experiment-Command-Generator
๐
models
3
Sort:ย Recently updated
yixin/output
Updated
Jan 7
yixin/trained-flux
Updated
Aug 25, 2024
yixin/liqe
Updated
Jul 5, 2024
datasets
2
Sort:ย Recently updated
yixin/metacloak_vggface2_protected_11
Viewer
โข
Updated
Jul 22, 2024
โข
192
โข
49
yixin/metacloak_celeba_vggface2
Viewer
โข
Updated
Apr 2, 2024
โข
768
โข
95
โข
1