训练一个 70 亿中文模型需要多少 tokens

业界经验 70 亿参数模型至少需要 2T tokens 训练才有竞争力,3T 以上效果更好。其中中文 tokens 建议占 70% 以上才能在中文场景表现优秀。如果只用 SkyPile 加 Chinese Fineweb 两个数据集,大约 620B 中文 tokens,够 70 亿模型 1 个 epoch。再加上一些维基、新闻补充,总量可达 1.5T,基本够用。商用大厂通常会用 5T tokens 训练。

这些数据集足够训练对标 GPT-4 的中文模型吗

不足够。GPT-4 训练数据量估计 15T tokens 以上,且包含大量人类反馈强化学习数据。仅靠开源数据训练的中文模型上限大约接近 GPT-3.5 水平。要追 GPT-4 需要专有数据,如人类指令对、专业领域数据、合成数据增强。这也是国内大厂愿意花钱建数据团队的原因。

中文语料怎么处理繁简转换

主流做法是统一为简体。OpenCC 是事实标准工具,Python 安装 pip install opencc-python-reimplemented。批量转换 100GB 文本约 6 小时单核。或者训练时让模型同时学习简繁,模型自动适配,但要确保简体占比超过 90% 否则简体场景表现下降。

商用项目用开源中文数据集要注意什么

3 个法律风险点。第一原始网页可能有版权,Common Crawl 抓取本身是合法但二次使用大段文本可能侵权。第二许可证传递,Apache 2.0 数据 fine-tune 出的模型可以商用,GPL 数据训出的模型必须开源。第三个人数据,数据集可能包含个人信息,GDPR 和中国个保法都要求脱敏。建议商用前找法务审 1 次。

怎么判断一个数据集适不适合我的任务

3 步评估法。第一看域名和主题分布,该数据集主要包含什么领域内容,和你的目标场景对齐吗。第二用 1% 子集做 small scale 预训练或微调,看模型在你的下游任务上表现是否提升。第三看其他用过该数据集的项目效果,如果几个开源模型都用过且公开了评测分数,可以直接对比。

Chinese large model corpus data set Top 8,2026 Essential for training high-quality Chinese models

📅 2026-05-18 11:18:58 👤 DouWen Editorial 💬 9 条评论 👁 3

The core competitiveness of large models in 2026 has shifted from parameter scale to corpus quality. The model trained by the same training algorithm using high-quality Chinese corpus performs far better in the Chinese scene than the version trained by foreign giants through translated data. However, there is a relative lack of high-quality Chinese public corpus on the Internet, and collection and cleaning are technical difficulties. This article takes stock of the 8 major Chinese data sets that will be open or semi-open in 2026, and explains the sources, scale, characteristics, and acquisition methods.

References. The open source data set page of the Institute of Automation, Chinese Academy of Sciences. OpenCSG data community. BAAI Zhiyuan Research Institute WuDaoCorpora. Hugging Face Datasets Chinese section. Tsinghua GLM team data set. HuggingFace ModelScope joint list. Chinese NLP Community 2026 Q1 Trend Report.

Why are the difficulties of Chinese corpus different from those of English?

There are four reasons why it is more difficult to obtain Chinese corpus than English.

The number of first public web pages is small. Common Crawl is 41% Chinese and English, and only 5% Chinese. Directly crawling Chinese websites is limited by anti-crawling and compliance issues.

The second highest quality content is concentrated on closed source platforms. There is a large amount of high-quality content on WeChat official accounts, Zhihu, Xiaohongshu, and Douyin, but the open interfaces are limited.

Third, the digitization of traditional publications lags behind. The electronicization rate of Chinese books is about 12%, which is much lower than the 58% of English books. The digitization project of the National Library of China is progressing slowly.

Fourth, the cleaning cost is high. Chinese word segmentation, conversion between Traditional and Simplified Chinese, and switching between spoken and written language all increase the difficulty of preprocessing. For the same scale of cleaning, Chinese requires 30% more computing power than English.

For these reasons, in 2026, Chinese large model companies can only crawl it themselves or spend money to buy it, and open source data sets are the most economical starting point.

WuDaoCorpora 2.0, the flagship of Beijing Zhiyuan

WuDaoCorpora is a Chinese corpus released by Beijing Zhiyuan Artificial Intelligence Research Institute in 2021. It will be upgraded to version 2.0 in 2024 and the 3.0 beta version will be launched in early 2026.

Size and composition. The total size of version 3.0 is 4TB, and after cleaning, about 800GB is retained in high quality. Sources include Baidu Encyclopedia, Zhihu Selected Questions and Answers, official media news such as Xinhua News Agency, excerpts from literary websites, and abstracts of academic papers.

feature. Chinese accounts for 95% and English accounts for 5%. Text lengths are evenly distributed, ranging from short sentences to long articles. Cleaning filters out ads, duplicate content, and low-quality short text. Each piece of data comes with source domain name and timestamp.

Get. Apply on the Zhiyuan official website baai.ac.cn, free for academic users. Commercial users need to sign an authorization, and the price is negotiated based on scale.

Applicable scenarios. Pre-trained general base model. WuDaoCorpora is one of the core training data for domestic GLM, WuDao, and ChatGLM series models.

Open corpus aggregated by the OpenCSG data community

OpenCSG is the Chinese version of Hugging Face that will emerge in 2024. In 2026, there will be more than 60,000 aggregated data sets, of which Chinese accounts for 38%.

Size and composition. Chinese data sets include ChineseWebText series, Chinese Fineweb, Wikipedia Chinese version, and vertical corpora in professional fields such as medicine, law, and finance.

feature. Community-based operation, each data set has quality scores and download statistics. Those with more than 100,000 downloads are considered high quality. Many data sets come with meta-information such as the number of tokens, the number of documents, and average length, which facilitates the evaluation of pre-training computing power.

Get. Opencsg.com Register an account to download most of them for free. Large data for commercial use requires payment, and the price is based on the number of tokens. 1B tokens is about 5,000 yuan.

Applicable scenarios. The preferred platform for finding Chinese corpus in specific fields. OpenCSG has higher diversity than single-source datasets.

SkyPile-150B Skywork open large-scale web corpus

SkyPile is an open source Chinese web corpus created by Kunlun Skywork in 2024, and will be expanded to 280B tokens in 2026.

Size and composition. 280B Chinese tokens, completely from public web crawling. Skywork's own crawler SkySpider crawls more than 70,000 Chinese websites, removes advertisements, removes duplications, and filters them before retaining them.

feature. Including news, blogs, forums, e-commerce product descriptions, and knowledge bases. Variety is high but quality varies. Skywork provides a quality score field to facilitate user filtering.

Get. Hugging Face is completely open source, Apache 2.0 protocol. Download the fragmented jsonl file and use it directly.

Applicable scenarios. A low-cost start for the pre-training phase of Chinese large models. If you are on a tight budget and don’t have the money to buy high-quality data, SkyPile is the cheapest entry-level solution.

insufficient. The quality is lower than WuDaoCorpora and requires secondary cleaning. There are copyright risks for news content. Although Skywork has filtered it, users still need to be cautious.

ChineseWebText 2.0 Selection of Chinese Academy of Sciences

ChineseWebText is a Chinese web page corpus released by the Institute of Automation, Chinese Academy of Sciences in 2023, and version 2.0 will be online in 2024.

Size and composition. 2.4TB original web page, 1.4TB after cleaning, Chinese tokens about 600B. Source: Common Crawl Chinese part plus own crawling supplement.

feature. Five quality levels are distinguished, from high to low, and each article is scored. Research shows that the model trained with high-level data is 8 points higher than the mixed model in CMMLU evaluation.

Get. Zhiyuan BAAI mirror is free, Hugging Face synchronization, Apache 2.0.

Applicable scenarios. Mainly teaching and research. The quality grading of ChineseWebText is a good material for studying the relationship between corpus quality and model performance.

Chinese Fineweb replica Fineweb Chinese version of ideas

Fineweb is a high-quality English corpus open sourced by Hugging Face in 2024. Chinese Fineweb is a 2025 open-source replica of the Chinese community's reference to its ideas.

Size and composition. 340B Chinese tokens, from Common Crawl multiple snapshots. The cleaning method is based on Fineweb's C4 plus deduplication pipeline.

feature. Duplicate content is most cleanly removed, with more long articles and fewer short snippets. Suitable for training long context models.

Get. A subset of PleIAs/CommonCorpus on Hugging Face, free and open source.

Applicable scenarios. Curated academic content in Fineweb-Edu style. Complementary to SkyPile, the former focuses on in-depth long articles, while the latter focuses on breadth of coverage.

CCI3 Zhiyuan high-quality Chinese instruction fine-tuning

CCI3 is a Chinese instruction data set released by Zhiyuan in 2025, and will be expanded to v3.5 in 2026.

Size and composition. About 1.2M instruction data, covering tasks such as question and answer, writing, rewriting, reasoning, coding, mathematics, etc. Each piece of data is a manually or semi-manually labeled "instruction + output" pair.

feature. The quality is higher than that of early data sets such as Alpaca-Chinese because of manual review and back-translation verification. Instructions are of uniform length, with an average output of 500 words.

Get. BAAI official website application, some subsets of Hugging Face are open to the public and academically free.

Applicable scenarios. SFT fine-tuning phase. If you have pre-trained the basic model and want to do instruction tuning, CCI3 is one of the strongest open data sets in China.

MOSS Chinese instruction Fudan’s attempt to open up

MOSS is a Chinese ChatGPT replacement released by Fudan University in 2023, and its command data is also open sourced.

Size and composition. About 110M tokens Chinese command dialogue data. Contains multiple rounds of dialogue samples, covering role-playing, knowledge Q&A, tool usage and other scenarios.

feature. Multi-turn conversations account for a high proportion and are suitable for training chat models rather than pure instruction followers. The quality is slightly lower than CCI3 but fully open source CC-BY protocol is commercially friendly.

Get. OpenLMLab/MOSS repository on GitHub, Hugging Face mirror.

Applicable scenarios. Instruction fine-tuning starting data for commercial chat models. The loose open source agreement is an advantage.

CMMLU and C-Eval Chinese evaluation data sets

Although it is not training data, CMMLU and C-Eval are the must-run evaluation benchmarks for Chinese models in 2026.

CMMLU is opening in 2023, with 67 Chinese subjects covering humanities, social sciences, science and engineering, medicine, and law. 11528 multiple choice questions.

C-Eval is released by Tsinghua University in 2023, with 52 subjects, mainly test questions for secondary school and above. 13948 questions.

Get. Both are fully open source at Hugging Face.

value. These two data sets are the yardstick for judging the quality of Chinese models. All Chinese models released in 2026 will publish CMMLU and C-Eval scores.

Comparison table of 8 data sets

Total comparison. WuDaoCorpora 3.0 800GB after cleaning. OpenCSG Collection Multiple datasets totaling 3TB+. SkyPile 280B tokens. ChineseWebText 600B tokens. Chinese Fineweb 340B tokens. CCI3 1.2M instructions. MOSS 110M tokens. The CMMLU/C-Eval assessment uses 25k questions.

Quality comparison. WuDaoCorpora is the highest because of manual cleaning. CCI3 is the highest because of manual annotation. Chinese Fineweb high because of going severe. ChineseWebText is rated Medium to High 5. OpenCSG varies depending on the specific data set. SkyPile for low to medium web crawling. Early annotation in MOSS.

Commercial use license. SkyPile, Chinese Fineweb, and MOSS are completely open source and commercially friendly. WuDaoCorpora and CCI3 require authorization. ChineseWebText is loose for academic purposes and cautious for commercial purposes. CMMLU/C-Eval is free for evaluation.

Suitable for the stage. WuDaoCorpora, SkyPile, Chinese Fineweb, and ChineseWebText are the main tools used for pre-training. SFT uses CCI3 and MOSS as main force. CMMLU and C-Eval must be run for evaluation.

Practical suggestions for collecting Chinese corpus yourself

If the open data is not enough, you should pay attention to 4 things when crawling by yourself.

First choice source of compliance. The Chinese version of Wikipedia is completely open source, Weibo’s public API is limited, and Zhihu’s public Q&A can be crawled but robots.txt must be respected. Commercial platforms such as WeChat and Douyin are not commercially crawlable.

Second diversity. Don't just climb one area. By mixing 30% news, 25% forums, 25% encyclopedias, 10% literature, and 10% long articles, the model has the strongest generalization ability.

The third is to strictly remove duplicates. MinHash or SimHash removes duplicates, and document-level duplicate removal is more effective than paragraph level. Models trained on repeated data will overfit certain expressions.

Fourth quality rating. Use a small classification model to score each article, and only use the top 50% for training. Research shows that the training effect of high-quality subsets is better than that of full training under the same computing power.

FAQ

How many tokens are needed to train a 7 billion Chinese model?

Industry experience: A 7 billion parameter model requires at least 2T tokens for training to be competitive, and more than 3T will achieve better results. Among them, it is recommended that Chinese tokens account for more than 70% to perform well in the Chinese scene. If only two data sets, SkyPile and Chinese Fineweb, are used, there will be about 620B Chinese tokens, enough for 7 billion models for 1 epoch. Coupled with some wikis and news supplements, the total volume can reach 1.5T, which is basically enough. Major commercial manufacturers usually use 5T tokens for training.

Are these data sets sufficient to train a Chinese model against GPT-4?

Not enough. The amount of GPT-4 training data is estimated to be more than 15T tokens, and contains a large amount of human feedback reinforcement learning data. The upper limit of Chinese models trained solely on open source data is approximately close to the GPT-3.5 level. Chasing GPT-4 requires proprietary data, such as human command pairs, domain-specific data, and synthetic data augmentation. This is also the reason why major domestic companies are willing to spend money to build data teams.

How to process Chinese corpus conversion between Traditional and Simplified Chinese

The mainstream approach is to unify into simplified Chinese. OpenCC is the de facto standard tool, Python installation pip install opencc-python-reimplemented. Batch conversion of 100GB of text takes about 6 hours per core. Or let the model learn Simplified and Traditional Chinese at the same time during training, and the model will automatically adapt, but make sure that Simplified Chinese accounts for more than 90%, otherwise the performance of Simplified scenes will decline.

What should you pay attention to when using open source Chinese data sets for commercial projects?

3 legal risk points. The original web page may be copyrighted, and Common Crawl itself is legal, but the secondary use of large sections of text may be infringing. Second license transfer, models fine-tuned with Apache 2.0 data can be commercially used, and models trained with GPL data must be open source. Third personal data, the data set may contain personal information, and both GDPR and China’s Personal Protection Law require desensitization. It is recommended to seek legal review once before commercial use.

How to determine whether a data set is suitable for my task

3-step assessment method. First, look at the domain name and topic distribution. What fields does the data set mainly contain? Is it aligned with your target scenario? Second, use the 1% subset for small scale pre-training or fine-tuning to see whether the model's performance improves on your downstream tasks. Third, look at the effects of other projects that have used this data set. If several open source models have been used and the evaluation scores are public, you can directly compare.

Source of inspiration: Issue 390 of Ruan Yifeng's "Technology Enthusiasts Weekly" https://www.ruanyifeng.com/blog/2025/08/weekly-issue-390.html

📝 本文来自抖文 www.douwen.me ，转载请保留出处。

原文链接：https://douwen.me/archives/1065/

💬 评论 (9)

AIWatcher 2026-05-18 09:20 回复

Easy to follow.

DataNerd 2026-05-18 04:42 回复

Loved the FAQ section.

SEOFan 2026-05-18 03:07 回复

Stats really back it up.

DataNerd 2026-05-17 21:19 回复

Practical tips not fluff.

TechReader 2026-05-18 05:47 回复

Best summary I've read on this.

SEOFan 2026-05-17 20:26 回复

Solid breakdown, very useful.

ProductHunter 2026-05-18 06:10 回复

Bookmarked for reference.

TechReader 2026-05-17 23:18 回复

Clear and to the point.

AIWatcher 2026-05-18 10:16 回复

Sharing this with my team.