MiniMind-O: training a from-scratch omni model over a weekend
Project: jingyaogong/minimind-o Why this one: it’s the smallest credible omni model I’ve found where the full pipeline fits on a single consumer GPU.
TL;DR
MiniMind-O is “a small end-to-end Omni model from scratch” — a single set of weights that jointly handles text / audio / image inputs and produces text + streaming-speech outputs (24 kHz via the Mimi codec). The base model is ~115M trainable params (minimind-3o), with a ~315M MoE variant (minimind-3o-moe). The hook for me: the mini dataset runs the entire Thinker→Talker pipeline in ~2 hours on one RTX 3090. That’s a weekend with hours to spare for poking at it.
What’s actually trainable
The honest accounting: only ~0.1B params are yours to train. MiniMind-O leans on three frozen off-the-shelf modules for the heavy perceptual lifting (~425M params total):
- SenseVoice-Small — speech recognition (audio in)
- SigLIP2 — vision encoder (image in)
- Mimi codec — neural audio codec for streaming speech out
So the from-scratch part is the connective tissue and the language/audio reasoning, not the encoders. That’s the right tradeoff for a learning project — you train the interesting part and rent the parts that need industrial-scale data.
Architecture: Thinker + Talker
Two paths:
- Thinker — handles multimodal understanding (fuses text/audio/image into the language model).
- Talker — generates Mimi audio codes for speech output, using Multi-Token Prediction (MTP) so the speech streams instead of arriving all at once.
This Thinker/Talker split is the same shape as the bigger omni models (Qwen2.5-Omni and friends) — MiniMind-O is the miniature, readable version of that pattern, which is exactly why it’s worth training by hand.
The training pipeline (three stages)
- T2A — Text-to-Audio: teach the Talker to speak.
- A2A — Audio-to-Audio: speech in, speech out.
- I2T — Image-to-Text: bolt on vision understanding.
The weekend plan
Mini dataset (single RTX 3090, ~24GB):
- ~470 hours of output speech (English T2A)
- ~74 hours input / ~56 hours output (English A2A)
- Full pipeline end-to-end ≈ 2 hours
That leaves the rest of the weekend for the fun part: feeding it your own prompts, listening to the streamed speech, and breaking it to see where 0.1B params runs out of capacity.
Full dataset (if I get greedy): ~1,636 hours T2A and ~1,712h input / 423h output A2A, Chinese-English mixed — documented on 8× RTX 3090. That’s out of single-GPU weekend scope, but good to know the same code scales.
My rough order of operations
- Clone the repo, set up the env, download the mini dataset only.
- Pull the three frozen modules (SenseVoice-Small, SigLIP2, Mimi) so nothing blocks mid-run.
- Run T2A → A2A → I2T in sequence on the 3090; expect ~2h wall-clock total.
- Run inference: text→speech first (simplest signal that the Talker works), then audio→audio, then image→text.
- Compare base vs. the MoE variant if there’s time left.
Why I want to do this
Reading an omni-model paper tells you the shape of the system; training the smallest honest version of it tells you where the bodies are buried — how brittle the audio codec handoff is, how much the frozen encoders bottleneck understanding, what 0.1B params can and can’t hold. At this scale the whole thing is legible end-to-end in an afternoon, and the 2-hour train loop means the iteration cost of “what if I change this” is coffee-break sized rather than cloud-bill sized. That’s the entire appeal of the MiniMind family — it’s not trying to be good, it’s trying to be understandable.