AutoMathText-2.5: A Foundational High-Quality STEM Training Dataset

Yifan Zhang; Team Math-AI

Overview

AutoMathText-2.5 consists of more than 2 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual data. It is designed as a broad STEM-oriented training corpus for large language models.

The dataset combines 50+ premium data sources with semantic deduplication, contamination detection, and intelligent text cleaning to support stronger training data quality across diverse domains.

2T+Tokens

50+Data Sources

7.11TBTotal Size

2Languages

Dataset

Hugging Face

The release is hosted at math-ai/AutoMathText-2.5.

Tasks

Text generation and question answering, with tags covering LLM training, pretraining, finetuning, midtraining, reasoning, and STEM.

Modalities

Text-first dataset with English and Chinese content. The Hugging Face card lists the dataset size range as 10B<n<100B and the total file size as 7.11 TB.

Paper

Curation Pipeline

The dataset card describes a three-tier deduplication pipeline and AI-powered quality assessment. The release emphasizes semantic deduplication, contamination detection, and text cleaning as central processing stages.

01Deduplicate

02Detect Contamination

03Clean Text

04Quality Score

License

AutoMathText-2.5 is released under the AutoMathText Data Agreement for Model Training. Review the full LICENSE before using the dataset.

Use Scope

The agreement makes the dataset available for internal training of AI solutions with facts, ideas, patterns, and correlations, subject to the restrictions in the license.

Citation

If you use AutoMathText-2.5, please cite the dataset and related paper:

@misc{automathtext_2_5,
  title     = {AutoMathText-2.5: A Foundational High-Quality STEM Training Dataset},
  author    = {Zhang, Yifan and Math-AI Team},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/math-ai/AutoMathText-2.5}
}

@article{zhang2025autonomous,
  title   = {Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
  author  = {Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
  journal = {Findings of the Association for Computational Linguistics: ACL 2025},
  year    = {2025}
}

Open Dataset