Blog - nvidia

LocateAnything: decoding bounding boxes in parallel instead of one token at a time
Paper notes

Paper notes on NVIDIA's LocateAnything — a 3B vision-language grounding model that treats boxes as atomic units via Parallel Box Decoding, hitting 10× the throughput of comparable VLMs.

paper-notes vision-language grounding nvidia

LocateAnything: decoding bounding boxes in parallel instead of one token at a time Paper notes

LocateAnything: decoding bounding boxes in parallel instead of one token at a time
Paper notes