NVIDIA Releases Improved Pretraining Dataset: Preserves High Value Math & Code, and Augments with Multi-Lingual

Community Article Published August 18, 2025

Upvote

nvidia

nvidia

nvidia

nvidia

nvidia

Rabeeh Karimi Mahabadi

nvidia

nvidia

nvidia

NVIDIA is doubling down on its commitment to open, high-quality AI with the release of Nemotron-Pre-Training-Dataset-v1, a pretraining dataset comprising 6.6 trillion tokens of premium math, code, and multilingual Q&A data — built from carefully curated, high-signal web content and large-scale synthetic data generation.

Released alongside the NVIDIA Nemotron Nano 2 family of large language models, this dataset isn’t just a research artifact — it’s the very data used to train these leading open models.

The results speak for themselves:

Figure. Comparison from the tech report of Nemotron Nano V2 and Qwen3-8B in terms of accuracy and throughput. NVIDIA-Nemotron-Nano-v2-9B achieves comparable or better accuracies on complex reasoning benchmarks, while achieving up to 6.3 times higher throughput for such workloads. We abbreviate input sequence length to ISL and output sequence length to OSL.

Decomposition

The Nemotron-Pre-Training-Dataset-v1 collection is organized into four core categories:

Nemotron-CC-v2: Follow-up to Nemotron-CC (Su et al., 2025) with eight additional Common Crawl snapshots (2024–2025). The data has undergone global deduplication and synthetic rephrasing using Qwen3-30B-A3B. It also contains synthetic diverse QA pairs translated into 15 languages, supporting robust multilingual reasoning and general knowledge pretraining.
Nemotron-CC-Math-v1: A 133B-token math-focused dataset derived from Common Crawl using NVIDIA’s Lynx + LLM pipeline, which preserves equations and code formatting while standardizing math content to LaTeX. This ensures critical math and code snippets remain intact, resulting in high quality pretraining data that outperforms prior math datasets on benchmark.
Nemotron-Pretraining-Code-v1: A large-scale curated code dataset sourced from GitHub and filtered through multi-stage deduplication, license enforcement, and heuristic quality checks. It also includes LLM-generated code question–answer pairs in 11 programming languages.
Nemotron-Pretraining-SFT-v1: A synthetically generated dataset covering STEM, academic, reasoning, and multilingual domains. This includes complex multiple-choice and analytical questions derived from high-quality math and science seeds, graduate-level academic texts, and instruction-tuned SFT data spanning math, code, general QA, and reasoning tasks.
Nemotron-Pretraining-Dataset-sample: A small sampled version of the dataset provides 10 representative subsets, offering insight into high-quality QA data, math-focused extractions, code metadata, and SFT-style instruction data.

Token distribution

Dataset Category	Tokens Count (B)
English Common Crawl	3360.1
English Synthetic CC	1257.1
Diverse QA	692.4
Translated Diverse QA	558.1
Math	206.3
Math SFT	190.4
Synthetic Code	175.1
MMLU SFT	81.7
Code SFT	58.5
General SFT	5.8
TOTAL	6585.4

Additionally, we release metadata to reproduce a 747.4B token curated code dataset.

What’s in the Dataset and How Did We Build It?

Math

In building this dataset, we paid special attention to preserving high-value mathematical and code content from Common Crawl — data that is often lost or corrupted in typical pretraining pipelines. Our work (see full details in math blogpost introduces a new extraction process that:

Correctly renders math equations in multiple formats (MathJax, KaTeX, MathML, LaTeX) using a layout-aware text browser,
Uses a lightweight LLM pass to clean boilerplate, standardize equations to LaTeX, and fix formatting errors,
Retains code blocks with full syntax and indentation — instead of flattening them into plain text like many previous datasets.

The result is 133B tokens of math-rich documents in our full corpus, with a 52B-token highest-quality subset. This high-quality set is 5.5× larger than the best previous open math dataset (FineMath-4+). We also regenerated the Nemotron-MIND dataset using our highest-quality subset, resulted in a 73B-token synthetic dataset that consistently consistently improves math reasoning and general knowledge (MMLU, MMLU-Pro, MMLU-STEM) and gains +14.4 points on MATH over the prior MIND version.

Because our pipeline preserves structure instead of stripping it away, we also capture a large incidental set of code snippets — over 4.3M code-containing documents — making the data useful for both mathematical reasoning and code generation. In internal pretraining experiments, models trained with nemotron-CC-math data saw:

+4.8 to +12.6 points on MATH over strongest baselines,
+4.6 to +14.3 points on MBPP+ for code generation,
+2 to +5 points on STEM-heavy general knowledge benchmarks like MMLU-STEM.

This means the dataset not only boosts mathematical ability, but also strengthens logical reasoning, coding skills, and general-domain knowledge.

Code

Our curated code dataset comprises 747.4B tokens ofGitHub-sourced files that underwent multi-stage deduplication, license enforcement, and heuristic filtering. We are releasing the metadata needed to reproduce this dataset. In addition, we generate and release large-scale synthetic question–answer pairs across 11 programming languages, totaling 175.1B tokens.

Diverse QA & Multilingual

We generated high-quality multilingual question–answer data from two main sources. First, we translated our English Diverse QA dataset into 15 languages using Qwen3-30B-A3B, ensuring accurate linguistic and contextual alignment. Second, we generated synthetic QA pairs directly in these languages from Wikipedia articles, prompting the model to write both questions and answers in the target language. Additionally, a subset of our GSM8K STEM augmentation data was translated, with each solution post-processed to append a clear concluding sentence indicating the final answer (e.g., “La respuesta es …” in Spanish, “Die Antwort lautet …” in German).

This multilingual pipeline provides broad linguistic coverage and strong problem-solving focus. In our ablation studies, including this translated diverse QA data boosted average Global-MMLU accuracy to 47.0, compared to 37.0 when using only multilingual Common Crawl data.

SFT Data

We included synthetically generated SFT-style data to strengthen model reasoning, code generation, and instruction-following abilities. This covers: Code SFT: solving programming problems across multiple languages.

Math SFT: complex reasoning and problem-solving.
MMLU-style SFT: diverse question–answer examples across knowledge domains.
General instruction SFT: broad instruction-following tasks.

The data spans multiple difficulty levels and topics, ensuring comprehensive pretraining that complements our STEM, academic, and multilingual datasets.

Data Examples

Example: Our pipeline preserves both math and code, unlike prior pretraining datasets that often lose or corrupt math equations.

How to Use It

All the datasets in the collection can be accessed using 🤗 Datasets library. For eg .,

from datasets import load_dataset 
ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True)

👉 Explore the dataset collection here and get in touch to explore enterprise or research use cases.

References

👉 For more details, please see the following paper and technical report.

Contributors

Nemotron-CC-Math-v1. Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye.

Nemotron-CC-v2. Ying Lin, Dan Su, Kezhi Kong, Joseph Jennings, Brandon Norick, Arham Mehta, Ayush Dattagupta, Ranjit Rajan, Sarah Yurick, Vineeth Kalluru, Markus Kliegl.

Nemotron-Pretraining-Code-v1. Brandon Norick, Joseph Jennings, Miguel Martinez, Vitaly Kurin, Rabeeh Karimi Mahabadi.

Nemotron-Pretraining-SFT-v1. Abhinav Khattar, Aleksander Ficek, Brandon Norick, Dan Su, Daria Gitman, Evelina Bakhturina, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jane Polak Scowcroft, Jocelyn Huang, Joseph Jennings, Jupinder Parmar, Markus Kliegl, Matvei Novikov, Mehrzad Samadi, Miguel Martinez, Pavlo Molchanov, Pritam Gundecha, Rabeeh Karimi Mahabadi, Rima Shahbazyan, Sanjeev Satheesh, Sean Narenthiran, Seungju Han, Shizhe Diao, Shrimai Prabhumoye, Shubham Toshniwal, Siddhartha Jain, Somshubra Majumdar, Syeda Nahida Akter, Vahid Noroozi, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Ximing Lu, Yejin Choi, Ying Lin.

Legal and Compliance. Barnaby Simkin, Dina Yared, Iain Cunningham, Katherine Cheung, Laya Sleiman, Meredith Price, Michael Boone, Nikki Pope, Saori Kaji.

Project Management. Amy Shen, Ann Guan, Ashton Sharabiani, Krzysztof Pawelec, Negar Habibi, Twinkle Vashishth.

Leadership. Jane Polak Scowcroft, Boris Ginsburg, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote