QuCo-RAG: deciding when to retrieve by asking the pre-training corpus
Paper: QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation — Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng, 2025 Code: github.com/ZhishanQ/QuCo-RAG (MIT)
TL;DR
Decide when a RAG system should retrieve by reading the pre-training corpus, not the model: rare entities and never-seen entity co-occurrences predict hallucination risk better than the model’s own confidence signals. Reported +5–12 EM over prior SOTA across HotpotQA and 2WikiMultihopQA (up to +14 on transfer settings).
For readers who see the name and assume quantum computing: the “Qu-Co” stands for Quantifying Uncertainty from the pre-training Corpus. No qubits involved.
Problem
Dynamic / adaptive RAG wants to retrieve only when the model is about to hallucinate, to save the latency and context-budget cost of always-on retrieval. The catch is that the standard trigger signals — next-token probability, entropy, self-verification — are noisy, especially in long generations and on multi-hop questions. Under-triggering lets hallucinations through; over-triggering throws away the latency win. The root problem is that we’re asking the model to self-assess a capability the model doesn’t have good introspection into.
Approach
Ignore the model’s confidence; look at the pre-training corpus instead. The authors use Infini-gram over a 4-trillion-token index to query entity-level statistics in milliseconds, and gate retrieval in two stages:
- Pre-generation: parse entities from the question. If any entity’s corpus frequency is below a threshold, retrieve up front — the model probably hasn’t internalized it.
- During generation: as new entities are emitted, check their co-occurrence with question entities in the corpus. Zero or near-zero co-occurrence is a strong hallucination signal → retrieve mid-generation and continue.
The appeal is that both signals are grounded in something external and tractable: corpus statistics are a direct proxy for “did the model actually see this during training?” That’s a much cleaner question than “how confident is the model in this token?”
Results
Tested with OLMo, Llama, and Qwen backbones against SR-RAG, FL-RAG, FLARE, DRAGIN, and a no-retrieval baseline. Headline numbers (from the abstract):
- +5–12 EM points over prior SOTA on HotpotQA and 2WikiMultihopQA (multi-hop QA — the harder setting where adaptive retrieval should pay off most).
- Up to +14 points on transfer to models whose pre-training data is undisclosed, using a proxy corpus — the “does this break when we don’t know what the target model saw?” check.
Limitations & open questions
- Compositional uncertainty isn’t captured by entity co-occurrence alone. A question can be individually-high-frequency at the entity level but novel in its combination of facts; the 2-stage trigger would under-fire there.
- Proxy corpus for closed models. The transfer result is encouraging, but it’s an open question how closely a public 4T-token index tracks what a given proprietary model actually trained on. Drift here bounds generalization.
- Multilingual and code generation aren’t in the evaluation. Both are obvious next settings where the corpus-statistics prior should either transfer cleanly (multilingual Infini-gram indices exist) or break down (code tokens are tokenized and statistically modeled quite differently from natural language).
- Entity extractor sensitivity. Performance presumably depends on the quality of upstream NER. A worst-case failure mode is an extractor that misses rare entities entirely, collapsing the trigger back toward “never retrieve.”
Why it matters (to me)
The pattern — replacing model-introspection signals with external, measurable ones — generalizes beyond RAG gating. Any time you need a model to “know what it doesn’t know,” offloading that judgment to a cheap external oracle (a corpus index, a retrieval-augmented check, a structured memory lookup) is usually more robust than fine-tuning calibration into the model itself. QuCo-RAG is a nice concrete instance of that design principle, and Infini-gram’s millisecond query latency makes it actually practical in a generation loop.