UltraData: OpenBMB's tiered data stack, from raw web to deep-thinking SFT
Collection: openbmb/ultradata on Hugging Face Platform: ultradata.openbmb.cn · validated on MiniCPM5-1B Tagline: Ultra Scale, Ultra Quality, Ultra Coverage.
TL;DR
UltraData isn’t one dataset — it’s a tiered data stack, organized L0 through L4, that mirrors how a corpus should actually flow from raw scrape to refined supervision. Crucially, the whole pyramid has been battle-tested end-to-end on MiniCPM5-1B rather than being a dump of files nobody trained on. If you’re pretraining or post-training a small-to-mid LLM and want a data recipe with a known-good outcome attached, this is one of the more honest open releases around.
The L0–L4 idea
The framing is a pipeline, not a pile. Raw data enters at the bottom and gets progressively filtered, rewritten, and refined as it climbs:
- L2 — selected web (general): Ultra-FineWeb, the curated web layer.
- L3 — refined supervision: UltraData-SFT-2605, the core-domain post-training set.
- Specialized verticals (math) run their own L1 → L3 progression alongside the general track.
Thinking in tiers is the useful part: it makes explicit that “web text” and “SFT data” are the same material at different stages of refinement, not separate problems.
The pieces worth knowing
Ultra-FineWeb — the L2 web layer. Built on top of the base FineWeb pipeline, it uses MiniCPM4 and Qwen3 to do Q&A-pair generation and multi-style rewriting, producing 400B+ English tokens and 200B+ Chinese tokens. That Chinese portion is described as the largest open-source Chinese pretraining synthetic dataset to date — a genuinely scarce resource.
UltraData-Math — a large-scale high-quality math pretraining set, 290B+ tokens across three progressive tiers (L1 / L2-preview / L3). Released Feb 9, 2026.
UltraData-SFT-2605 — the L3 refined SFT set used in MiniCPM5-1B’s post-training. 15M+ samples spanning math, code, knowledge, and instruction following, and notably split into Deep Thinking and Non-thinking examples — i.e., it carries both reasoning-trace supervision and direct-answer supervision. Released May 28, 2026.
The collection also gathers OpenBMB’s earlier well-known sets — UltraChat (large-scale dialogue), UltraFeedback (preference/feedback data), and DCAD-2000 — under the same umbrella, plus tooling: a math parser, a web-text quality classifier, and a math Q&A generator.
Why it matters (to me)
The thing I keep coming back to is the “battle-tested on a real model” guarantee. Open datasets usually ship with file sizes and a license and nothing about whether the recipe actually produces a good model — you find out by burning your own compute. Pinning the entire L0–L4 stack to MiniCPM5-1B’s results turns the collection into something closer to a reproducible recipe than a raw resource. The Deep-Thinking / Non-thinking split in the SFT layer is the other tell that this tracks the current frontier: post-training is no longer one homogeneous instruction set, it’s a deliberate mix of “show your work” and “just answer,” and UltraData ships that distinction as a first-class axis.
Sources: UltraData platform · Ultra-FineWeb · UltraData-Math · UltraData-SFT-2605 · MiniCPM