Math-AI Dataset

AutoMathText-2.5

A foundational high-quality STEM training dataset.
Over 2 trillion tokens of deduplicated web, mathematics, code, reasoning, and bilingual text for language-model pretraining, midtraining, and finetuning.

Yifan Zhang and Team Math-AI
2026  ·  7.11 TB  ·  Hugging Face Dataset
STEM Reasoning Pretraining English + Chinese

Overview

AutoMathText-2.5 consists of more than 2 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual data. It is designed as a broad STEM-oriented training corpus for large language models.

The dataset combines 50+ premium data sources with semantic deduplication, contamination detection, and intelligent text cleaning to support stronger training data quality across diverse domains.

2T+Tokens
50+Data Sources
7.11TBTotal Size
2Languages

Dataset

Hugging Face

The release is hosted at math-ai/AutoMathText-2.5.

Tasks

Text generation and question answering, with tags covering LLM training, pretraining, finetuning, midtraining, reasoning, and STEM.

Modalities

Text-first dataset with English and Chinese content. The Hugging Face card lists the dataset size range as 10B<n<100B and the total file size as 7.11 TB.

Curation Pipeline

The dataset card describes a three-tier deduplication pipeline and AI-powered quality assessment. The release emphasizes semantic deduplication, contamination detection, and text cleaning as central processing stages.

01Deduplicate
02Detect Contamination
03Clean Text
04Quality Score

License

AutoMathText-2.5 is released under the AutoMathText Data Agreement for Model Training. Review the full LICENSE before using the dataset.

Use Scope

The agreement makes the dataset available for internal training of AI solutions with facts, ideas, patterns, and correlations, subject to the restrictions in the license.

Citation

If you use AutoMathText-2.5, please cite the dataset and related paper:

@misc{automathtext_2_5,
  title     = {AutoMathText-2.5: A Foundational High-Quality STEM Training Dataset},
  author    = {Zhang, Yifan and Math-AI Team},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/math-ai/AutoMathText-2.5}
}

@article{zhang2025autonomous,
  title   = {Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
  author  = {Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
  journal = {Findings of the Association for Computational Linguistics: ACL 2025},
  year    = {2025}
}
Open Dataset