Overview
AutoMathText-2.5 consists of more than 2 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual data. It is designed as a broad STEM-oriented training corpus for large language models.
The dataset combines 50+ premium data sources with semantic deduplication, contamination detection, and intelligent text cleaning to support stronger training data quality across diverse domains.
Dataset
Hugging Face
The release is hosted at math-ai/AutoMathText-2.5.
Tasks
Text generation and question answering, with tags covering LLM training, pretraining, finetuning, midtraining, reasoning, and STEM.
Modalities
Text-first dataset with English and Chinese content. The Hugging Face card lists the dataset size range as 10B<n<100B and the total file size as 7.11 TB.
Paper
The related paper is AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts.
Curation Pipeline
The dataset card describes a three-tier deduplication pipeline and AI-powered quality assessment. The release emphasizes semantic deduplication, contamination detection, and text cleaning as central processing stages.
License
AutoMathText-2.5 is released under the AutoMathText Data Agreement for Model Training. Review the full LICENSE before using the dataset.
Use Scope
The agreement makes the dataset available for internal training of AI solutions with facts, ideas, patterns, and correlations, subject to the restrictions in the license.
Citation
If you use AutoMathText-2.5, please cite the dataset and related paper:
@misc{automathtext_2_5,
title = {AutoMathText-2.5: A Foundational High-Quality STEM Training Dataset},
author = {Zhang, Yifan and Math-AI Team},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/math-ai/AutoMathText-2.5}
}
@article{zhang2025autonomous,
title = {Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
author = {Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
journal = {Findings of the Association for Computational Linguistics: ACL 2025},
year = {2025}
}