Pretraining Datasets wikimedia/wikipedia Viewer • Updated Jan 9, 2024 • 61.6M • 101k • 1.18k togethercomputer/RedPajama-Data-V2 Updated Nov 21, 2024 • 5.49k • 400 Skywork/SkyPile-150B Viewer • Updated Dec 7, 2023 • 1.76M • 32.9k • 404
Awesome Instruction Tuning Dataset Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 18.7k • 1.51k glaiveai/glaive-code-assistant Viewer • Updated Sep 27, 2023 • 136k • 488 • 100 silk-road/alpaca-data-gpt4-chinese Viewer • Updated May 23, 2023 • 52k • 1.04k • 103 anon8231489123/ShareGPT_Vicuna_unfiltered Updated Apr 12, 2023 • 139k • 851
Awesome Instruction Tuning Dataset Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 18.7k • 1.51k glaiveai/glaive-code-assistant Viewer • Updated Sep 27, 2023 • 136k • 488 • 100 silk-road/alpaca-data-gpt4-chinese Viewer • Updated May 23, 2023 • 52k • 1.04k • 103 anon8231489123/ShareGPT_Vicuna_unfiltered Updated Apr 12, 2023 • 139k • 851
Pretraining Datasets wikimedia/wikipedia Viewer • Updated Jan 9, 2024 • 61.6M • 101k • 1.18k togethercomputer/RedPajama-Data-V2 Updated Nov 21, 2024 • 5.49k • 400 Skywork/SkyPile-150B Viewer • Updated Dec 7, 2023 • 1.76M • 32.9k • 404