MiniMind-O: training a from-scratch omni model over a weekend

Jun 9, 2026

Weekend build

omni-model multimodal train-from-scratch weekend-project

Project: jingyaogong/minimind-o Why this one: it’s the smallest credible omni model I’ve found where the full pipeline fits on a single consumer GPU.

TL;DR

MiniMind-O is “a small end-to-end Omni model from scratch” — a single set of weights that jointly handles text / audio / image inputs and produces text + streaming-speech outputs (24 kHz via the Mimi codec). The base model is ~115M trainable params (minimind-3o), with a ~315M MoE variant (minimind-3o-moe). The hook for me: the mini dataset runs the entire Thinker→Talker pipeline in ~2 hours on one RTX 3090. That’s a weekend with hours to spare for poking at it.

What’s actually trainable

The honest accounting: only ~0.1B params are yours to train. MiniMind-O leans on three frozen off-the-shelf modules for the heavy perceptual lifting (~425M params total):

SenseVoice-Small — speech recognition (audio in)
SigLIP2 — vision encoder (image in)
Mimi codec — neural audio codec for streaming speech out

So the from-scratch part is the connective tissue and the language/audio reasoning, not the encoders. That’s the right tradeoff for a learning project — you train the interesting part and rent the parts that need industrial-scale data.

Architecture: Thinker + Talker

Two paths:

Thinker — handles multimodal understanding (fuses text/audio/image into the language model).
Talker — generates Mimi audio codes for speech output, using Multi-Token Prediction (MTP) so the speech streams instead of arriving all at once.

This Thinker/Talker split is the same shape as the bigger omni models (Qwen2.5-Omni and friends) — MiniMind-O is the miniature, readable version of that pattern, which is exactly why it’s worth training by hand.

The training pipeline (three stages)

T2A — Text-to-Audio: teach the Talker to speak.
A2A — Audio-to-Audio: speech in, speech out.
I2T — Image-to-Text: bolt on vision understanding.

The weekend plan

Mini dataset (single RTX 3090, ~24GB):

~470 hours of output speech (English T2A)
~74 hours input / ~56 hours output (English A2A)
Full pipeline end-to-end ≈ 2 hours

That leaves the rest of the weekend for the fun part: feeding it your own prompts, listening to the streamed speech, and breaking it to see where 0.1B params runs out of capacity.

Full dataset (if I get greedy): ~1,636 hours T2A and ~1,712h input / 423h output A2A, Chinese-English mixed — documented on 8× RTX 3090. That’s out of single-GPU weekend scope, but good to know the same code scales.

My rough order of operations

Clone the repo, set up the env, download the mini dataset only.
Pull the three frozen modules (SenseVoice-Small, SigLIP2, Mimi) so nothing blocks mid-run.
Run T2A → A2A → I2T in sequence on the 3090; expect ~2h wall-clock total.
Run inference: text→speech first (simplest signal that the Talker works), then audio→audio, then image→text.
Compare base vs. the MoE variant if there’s time left.

Why I want to do this

Reading an omni-model paper tells you the shape of the system; training the smallest honest version of it tells you where the bodies are buried — how brittle the audio codec handoff is, how much the frozen encoders bottleneck understanding, what 0.1B params can and can’t hold. At this scale the whole thing is legible end-to-end in an afternoon, and the 2-hour train loop means the iteration cost of “what if I change this” is coffee-break sized rather than cloud-bill sized. That’s the entire appeal of the MiniMind family — it’s not trying to be good, it’s trying to be understandable.