LocateAnything: decoding bounding boxes in parallel instead of one token at a time

Jun 9, 2026

Paper notes

paper-notes vision-language grounding nvidia

Project: LocateAnything — lives in the Embodied directory of NVIDIA’s Eagle repo Paper: LocateAnything (PDF) — Wang, Liu, Kuang, Wei, et al., arXiv:2605.27365 (2026) Model: nvidia/LocateAnything-3B · Demo: HF Space License: code Apache 2.0, weights under the NVIDIA Model License

TL;DR

LocateAnything is a 3B vision-language model for visual grounding — point at an object with a box, detect many objects at once, localize a GUI element or a line of text, drop a point on a target. The interesting bit is how it emits boxes: instead of generating coordinate tokens autoregressively, one number after another, it uses Parallel Box Decoding (PBD) to predict each box as a single atomic unit in one forward pass. The payoff is throughput — a reported 12.7 boxes-per-second on an H100, ~10× faster than Qwen3-VL and ~2.5× faster than Rex-Omni — while still landing SOTA accuracy on grounding benchmarks.

The problem with autoregressive boxes

Most VLM-based detectors fold localization into text generation: a box becomes a little string like [x1, y1, x2, y2], emitted token by token. That’s elegant — one unified head for everything — but it’s slow and fragile at scale. Decoding 50 objects means decoding hundreds of coordinate tokens strictly in sequence, and a single mis-sampled digit mid-box corrupts the geometry. The autoregressive prior also doesn’t really want to model “a set of boxes”; it models “a sequence of characters that happen to be coordinates.”

Approach

Two ideas carry the work:

Parallel Box Decoding. A box is treated as one atomic output rather than a token sequence, so geometric coherence is preserved and many boxes decode concurrently. This is where the throughput comes from — the model isn’t paying the per-token serialization tax for every coordinate.
Hybrid inference. A fast multi-token-prediction (MTP) path does the bulk of the work, with a next-token-prediction (NTP) fallback for stability when the fast path is uncertain. You get speed by default and correctness when it matters.

The stack itself is conventional in the good way: a Moon-ViT vision encoder at native resolution, an MLP projector bridge, and a Qwen2.5 language decoder. On Hopper/Blackwell it can use MagiAttention; otherwise it falls back to plain SDPA.

What it was trained on

LocateAnything-Data is the other half of the story: ~138M language queries and ~785M boxes across ~12M images, spanning detection, referring-expression comprehension, GUI grounding, OCR/text localization, layout understanding, and pointing. One model, one schema, six task families.

Results

A single backbone covering a spread that usually needs separate specialist models:

LVIS — 50.7 F1 (open-vocabulary detection)
COCO — 54.7 F1
ScreenSpot-Pro — 60.3 avg (GUI element grounding)
DocLayNet — 76.8 F1 (document layout)

Why it matters (to me)

The recurring lesson is that output representation is a design decision, not a given. The field defaulted to “everything is a token sequence” because it unified training, but for structured spatial outputs that default quietly costs you an order of magnitude in throughput. PBD is a clean reminder that when your output has structure — a box is four numbers with geometric constraints, not four independent characters — modeling that structure directly beats forcing it through the autoregressive funnel. For anything robotics- or agent-facing, where you’re localizing dozens of things per frame in a loop, that 10× is the difference between usable and not.