Path to Multimodal Generalist

community

\n\n

On Path to Multimodal Generalist: General-Level and General-Bench

\n[📖 Project]\n[🏆 Leaderboard]\n[📄 Paper]\n[🤗 Paper-HF]\n[🤗 Dataset-HF (Close-Set)]\n[🤗 Dataset-HF (Open-Set)]\n[📝 Github]\n

\n\n---\n

\n\n\n\n

\nDoes higher performance across tasks indicate a stronger capability of MLLM, and closer to AGI?\n
\nNO! But synergy does.\n

\n\n\nMost current MLLMs predominantly build on the language intelligence of LLMs to simulate the indirect intelligence of multimodality, which is merely extending language intelligence to aid multimodal understanding. While LLMs (e.g., ChatGPT) have already demonstrated such synergy in NLP, reflecting language intelligence, unfortunately, the vast majority of MLLMs do not really achieve it across modalities and tasks.\n\nWe argue that the key to advancing towards AGI lies in the synergy effect—a capability that enables knowledge learned in one modality or task to generalize and enhance mastery in other modalities or tasks, fostering mutual improvement across different modalities and tasks through interconnected learning.\n\n\n

\n\n\n---\n\n

🚀🚀🚀 General-Level

\n\n**A 5-scale level evaluation system with a new norm for assessing the multimodal generalists (multimodal LLMs/agents). \nThe core is the use of synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions.**\n\n\n

\n\n\n\n---\n\n

🍕🍕🍕 General-Bench

\n\n \n**A companion massive multimodal benchmark dataset, encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325K instances.**\n\n\nWe set two dataset types according to the use purpose:\n- [**General-Bench-Openset**](https://huggingface.co/datasets/General-Level/General-Bench-Openset) with inputs and labels of samples all publicly open, for **free open-world use** (e.g., for academic experiment/comparisons).\n- [**General-Bench-Closeset**](https://huggingface.co/datasets/General-Level/General-Bench-Closeset) with only sample inputs available, which is used for **leaderboard ranking**. Participants need to submit the predictions to us for internal evaluation.\n\n\n

\n\n\n\n---\n\n

📌📌📌 Citation

\n\nIf you find this project useful to your research, please kindly cite our paper:\n\n```bibtex\n@articles{fei2025pathmultimodalgeneralistgenerallevel,\n title={On Path to Multimodal Generalist: General-Level and General-Bench},\n author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Qingshan Xu and Bobo Li and Shengqiong Wu and Yaoting Wang and Junbao Zhou and Jiahao Meng and Qingyu Shi and Zhiyuan Zhou and Liangtao Shi and Minghe Gao and Daoan Zhang and Zhiqi Ge and Weiming Wu and Siliang Tang and Kaihang Pan and Yaobo Ye and Haobo Yuan and Tao Zhang and Tianjie Ju and Zixiang Meng and Shilin Xu and Liyu Jia and Wentao Hu and Meng Luo and Jiebo Luo and Tat-Seng Chua and Shuicheng Yan and Hanwang Zhang},\n eprint={2505.04620},\n archivePrefix={arXiv},\n primaryClass={cs.CV}\n url={https://arxiv.org/abs/2505.04620},\n}\n\n```\n\n \n","html":"

\n\n

On Path to Multimodal Generalist: General-Level and General-Bench

\n[📖 Project]\n[🏆 Leaderboard]\n[📄 Paper]\n[🤗 Paper-HF]\n[🤗 Dataset-HF (Close-Set)]\n[🤗 Dataset-HF (Open-Set)]\n[📝 Github]\n

\n\n

\n\n\n\n

\nDoes higher performance across tasks indicate a stronger capability of MLLM, and closer to AGI?\n
\nNO! But synergy does.\n

\n\n\n

Most current MLLMs predominantly build on the language intelligence of LLMs to simulate the indirect intelligence of multimodality, which is merely extending language intelligence to aid multimodal understanding. While LLMs (e.g., ChatGPT) have already demonstrated such synergy in NLP, reflecting language intelligence, unfortunately, the vast majority of MLLMs do not really achieve it across modalities and tasks.

We argue that the key to advancing towards AGI lies in the synergy effect—a capability that enables knowledge learned in one modality or task to generalize and enhance mastery in other modalities or tasks, fostering mutual improvement across different modalities and tasks through interconnected learning.

\n\n\n

This project introduces General-Level and General-Bench.

🌐🌐🌐 Keypoints

🏆🏆🏆 Overall Leaderboard

\n\n\n

🚀🚀🚀 General-Level

\n\n

A 5-scale level evaluation system with a new norm for assessing the multimodal generalists (multimodal LLMs/agents).
The core is the use of synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions.

\n\n\n\n

🍕🍕🍕 General-Bench

\n\n \n

A companion massive multimodal benchmark dataset, encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325K instances.

We set two dataset types according to the use purpose:

General-Bench-Openset with inputs and labels of samples all publicly open, for free open-world use (e.g., for academic experiment/comparisons).
General-Bench-Closeset with only sample inputs available, which is used for leaderboard ranking. Participants need to submit the predictions to us for internal evaluation.

\n\n\n\n\n\n

\n\n\n\n

📌📌📌 Citation

\n\n

If you find this project useful to your research, please kindly cite our paper:

@articles{fei2025pathmultimodalgeneralistgenerallevel,\n  title={On Path to Multimodal Generalist: General-Level and General-Bench},\n  author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Qingshan Xu and Bobo Li and Shengqiong Wu and Yaoting Wang and Junbao Zhou and Jiahao Meng and Qingyu Shi and Zhiyuan Zhou and Liangtao Shi and Minghe Gao and Daoan Zhang and Zhiqi Ge and Weiming Wu and Siliang Tang and Kaihang Pan and Yaobo Ye and Haobo Yuan and Tao Zhang and Tianjie Ju and Zixiang Meng and Shilin Xu and Liyu Jia and Wentao Hu and Meng Luo and Jiebo Luo and Tat-Seng Chua and Shuicheng Yan and Hanwang Zhang},\n  eprint={2505.04620},\n  archivePrefix={arXiv},\n  primaryClass={cs.CV}\n  url={https://arxiv.org/abs/2505.04620},\n}\n

\n","classNames":"hf-sanitized hf-sanitized-Dlh5rCqs57-qxcdSXOnms"},"users":[{"_id":"62e7d64d270cc23975a359b2","avatarUrl":"/avatars/16281791afe4705e540ebac8fb552b3a.svg","isPro":false,"fullname":"zzy","user":"zyZhou","type":"user"},{"_id":"63490424667da7e6737f49a4","avatarUrl":"/avatars/b1a36b4d9d81cbc5d38353e3d5159f39.svg","isPro":false,"fullname":"Li","user":"BradNLP","type":"user"},{"_id":"638598a138f4aec99c50750e","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/no-auth/lRf-wLwxLGXPI7SKbuwNy.png","isPro":false,"fullname":"Shilin Xu","user":"shilinxu","type":"user"},{"_id":"6391e41f2e73987364e6bcb2","avatarUrl":"/avatars/d09a9ee329bb8c3a9e2929d67d24e97d.svg","isPro":false,"fullname":"Haobo Yuan","user":"HarborYuan","type":"user"},{"_id":"63958b4414513eaf9029ebf1","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","isPro":false,"fullname":"Xiangtai Li","user":"LXT","type":"user"},{"_id":"6422aeeb31252baf57a4f862","avatarUrl":"/avatars/2d6b1e4bb67a9fca4af433a30b05c725.svg","isPro":false,"fullname":"winston_ge","user":"W1nst0nGe","type":"user"},{"_id":"6436435c2d0ed796669258d3","avatarUrl":"/avatars/d357378eb039391e8ce74bbd84b80d07.svg","isPro":false,"fullname":"zhangtao","user":"zhangtao-whu","type":"user"},{"_id":"645ce028a19f3e64bbeeb680","avatarUrl":"/avatars/aa74083861ab24c6b4f9a524df11c867.svg","isPro":false,"fullname":"zixiang meng","user":"Tkic","type":"user"},{"_id":"647773a1168cb428e00e9a8f","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/NiRR3ScY6Plzjibfwy1hC.jpeg","isPro":false,"fullname":"Hao Fei","user":"scofield7419","type":"user"},{"_id":"648ef24dc92367eecac0f4bd","avatarUrl":"/avatars/38f1afd6b52efeee3aa41cc80225d788.svg","isPro":false,"fullname":"Minghe Gao","user":"gmh5811","type":"user"},{"_id":"64ad1c0bad6218d51a07b54e","avatarUrl":"/avatars/0f84d9a51c6ca9bcef44de2d7c707d9b.svg","isPro":false,"fullname":"LUO MENG","user":"Eureka-Leo","type":"user"},{"_id":"64c139d867eff857ea51caa8","avatarUrl":"/avatars/4b7b3f41c2e2cfa21dd43bbac6e081ae.svg","isPro":false,"fullname":"Shengqiong Wu","user":"ChocoWu","type":"user"},{"_id":"64ee52837ce493905c82bed5","avatarUrl":"/avatars/1f9f0db4eee887f9ed03dffc8c36ea23.svg","isPro":false,"fullname":"DaoanZhang","user":"DwanZhang","type":"user"},{"_id":"64ff369d9abcc85a5519b33e","avatarUrl":"/avatars/4b99cdaf5f970d930b196eddf1e5e499.svg","isPro":false,"fullname":"Yaoting Wang","user":"Gh0stAR","type":"user"},{"_id":"656724074f6ec72017754d33","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/no-auth/zLIgt4xvfbYPLZh0E3WWF.png","isPro":false,"fullname":"QingyuShi","user":"QingyuShi","type":"user"},{"_id":"6581517724c030b7d397428f","avatarUrl":"/avatars/7d92349bdcaa2c2ccbd52b7e7bd20faa.svg","isPro":false,"fullname":"kaihangpan","user":"midbee","type":"user"},{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","isPro":false,"fullname":"Jiahao Meng","user":"marinero4972","type":"user"},{"_id":"66a6f09bfe9e31ae0b42f428","avatarUrl":"/avatars/ca868129a11ddf5a4b418f6a27f55775.svg","isPro":false,"fullname":"liangtao","user":"zero-slt","type":"user"},{"_id":"678129219824f7ba2b7c131e","avatarUrl":"/avatars/d1873caf0610d6b9faa88eac7795ee76.svg","isPro":false,"fullname":"QSX","user":"Greenhills","type":"user"},{"_id":"67ebcf69f5870fc1da715eda","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/no-auth/TbfMOG2cKwPEh2xQdIqbr.png","isPro":false,"fullname":"Weiming Wu","user":"wwm1415","type":"user"},{"_id":"67efc00aa74f3780efb60ca9","avatarUrl":"/avatars/968e36a61f8bbe9a6a767c03d9e2b5b6.svg","isPro":false,"fullname":"Yuan Zhou","user":"YuanZ1995","type":"user"},{"_id":"6816d98fc075e49c1b15928e","avatarUrl":"/avatars/6b24d047fc25075bedb3e74f78981bc0.svg","isPro":false,"fullname":"Tianjie Ju","user":"jometeorieNUS","type":"user"}],"userCount":22,"collections":[{"slug":"General-Level/leaderboards-and-benchmarks-for-mllm-681d940f8ba3551121ebfbf9","title":"Leaderboards and Benchmarks for MLLM","description":"Leaderboards and Benchmarks for MLLM","gating":false,"lastUpdated":"2025-05-09T06:12:05.382Z","owner":{"_id":"67bbed2b55af598f1268405d","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/xs9VNM068bScYtOMbGiOg.png","fullname":"Path to Multimodal Generalist","name":"General-Level","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"isEnterprise":false,"followerCount":27},"items":[{"_id":"681d946180f57a6343d35e09","position":0,"type":"paper","id":"2505.04620","title":"On Path to Multimodal Generalist: General-Level and General-Bench","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2505.04620.png","upvotes":83,"publishedAt":"2025-05-07T17:59:32.000Z","isUpvotedByUser":false},{"_id":"681d94c61ba9cc34227e9e25","position":1,"type":"dataset","author":"General-Level","downloads":7247,"gated":false,"id":"General-Level/General-Bench-Openset","lastModified":"2025-08-04T16:55:11.000Z","private":false,"repoType":"dataset","likes":4,"isLikedByUser":false},{"_id":"681d94d92a7cdce962be690b","position":2,"type":"dataset","author":"General-Level","downloads":1490,"gated":false,"id":"General-Level/General-Bench-Closeset","lastModified":"2025-08-04T16:57:30.000Z","private":false,"repoType":"dataset","likes":2,"isLikedByUser":false},{"_id":"681d9cb555d3c4669983e960","position":3,"type":"dataset","author":"General-Level","downloads":50,"gated":false,"id":"General-Level/General-Bench-Closeset-Scoped","lastModified":"2025-07-15T08:00:18.000Z","private":false,"repoType":"dataset","likes":1,"isLikedByUser":false}],"position":0,"theme":"purple","private":false,"shareUrl":"https://huggingface.co/collections/General-Level/leaderboards-and-benchmarks-for-mllm-681d940f8ba3551121ebfbf9","upvotes":1,"isUpvotedByUser":false}],"datasets":[{"author":"General-Level","downloads":1490,"gated":false,"id":"General-Level/General-Bench-Closeset","lastModified":"2025-08-04T16:57:30.000Z","private":false,"repoType":"dataset","likes":2,"isLikedByUser":false},{"author":"General-Level","downloads":7247,"gated":false,"id":"General-Level/General-Bench-Openset","lastModified":"2025-08-04T16:55:11.000Z","private":false,"repoType":"dataset","likes":4,"isLikedByUser":false},{"author":"General-Level","downloads":50,"gated":false,"id":"General-Level/General-Bench-Closeset-Scoped","lastModified":"2025-07-15T08:00:18.000Z","private":false,"repoType":"dataset","likes":1,"isLikedByUser":false}],"models":[{"author":"General-Level","authorData":{"_id":"67bbed2b55af598f1268405d","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/xs9VNM068bScYtOMbGiOg.png","fullname":"Path to Multimodal Generalist","name":"General-Level","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"isEnterprise":false,"followerCount":27},"downloads":0,"gated":false,"id":"General-Level/General-Level-Scorer","availableInferenceProviders":[],"lastModified":"2025-07-10T14:40:11.000Z","likes":2,"private":false,"repoType":"model","isLikedByUser":false},{"author":"General-Level","authorData":{"_id":"67bbed2b55af598f1268405d","avatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/xs9VNM068bScYtOMbGiOg.png","fullname":"Path to Multimodal Generalist","name":"General-Level","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"isEnterprise":false,"followerCount":27},"downloads":0,"gated":false,"id":"General-Level/General-Bench-Closeset","availableInferenceProviders":[],"lastModified":"2025-07-09T09:54:41.000Z","likes":0,"private":false,"repoType":"model","isLikedByUser":false}],"spaces":[],"numDatasets":3,"numModels":2,"numSpaces":1,"lastOrgActivities":[{"time":"2025-08-04T16:57:52.271Z","user":"ChocoWu","userAvatarUrl":"/avatars/4b7b3f41c2e2cfa21dd43bbac6e081ae.svg","orgAvatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/xs9VNM068bScYtOMbGiOg.png","type":"update","repoData":{"author":"General-Level","downloads":1490,"gated":false,"id":"General-Level/General-Bench-Closeset","lastModified":"2025-08-04T16:57:30.000Z","private":false,"repoType":"dataset","likes":2,"isLikedByUser":false},"repoId":"General-Level/General-Bench-Closeset","repoType":"dataset","org":"General-Level"},{"time":"2025-08-04T16:55:34.100Z","user":"ChocoWu","userAvatarUrl":"/avatars/4b7b3f41c2e2cfa21dd43bbac6e081ae.svg","orgAvatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/647773a1168cb428e00e9a8f/xs9VNM068bScYtOMbGiOg.png","type":"update","repoData":{"author":"General-Level","downloads":7247,"gated":false,"id":"General-Level/General-Bench-Openset","lastModified":"2025-08-04T16:55:11.000Z","private":false,"repoType":"dataset","likes":4,"isLikedByUser":false},"repoId":"General-Level/General-Bench-Openset","repoType":"dataset","org":"General-Level"},{"time":"2025-07-11T08:09:02.533Z","user":"LXT","userAvatarUrl":"https://aifasthub.com/avatars/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","type":"paper","paper":{"id":"2505.24164","title":"Mixed-R1: Unified Reward Perspective For Reasoning Capability in\n Multimodal Large Language Models","publishedAt":"2025-05-30T03:11:46.000Z","upvotes":0,"isUpvotedByUser":false}}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"showRepoType":"dataset","numReposPerPage":30,"repos":[{"author":"General-Level","downloads":1490,"gated":false,"id":"General-Level/General-Bench-Closeset","lastModified":"2025-08-04T16:57:30.000Z","private":false,"repoType":"dataset","likes":2,"isLikedByUser":false},{"author":"General-Level","downloads":7247,"gated":false,"id":"General-Level/General-Bench-Openset","lastModified":"2025-08-04T16:55:11.000Z","private":false,"repoType":"dataset","likes":4,"isLikedByUser":false},{"author":"General-Level","downloads":50,"gated":false,"id":"General-Level/General-Bench-Closeset-Scoped","lastModified":"2025-07-15T08:00:18.000Z","private":false,"repoType":"dataset","likes":1,"isLikedByUser":false}],"numRepos":3,"currentRepoPage":0,"filters":{}}">