LocateAnything: decoding bounding boxes in parallel instead of one token at a time
Paper notes
Paper notes on NVIDIA's LocateAnything — a 3B vision-language grounding model that treats boxes as atomic units via Parallel Box Decoding, hitting 10× the throughput of comparable VLMs.