GUI Automation Agent
Controls the plant dashboard by reasoning over screenshots, with fingerprint-based element remapping and dynamic schema constraints that keep the model in bounds.
Why it matters
Voice and text are only useful if they can actually drive the UI. The operations agent takes a screenshot plus a structured list of currently visible interactive elements and returns a click, type, scroll, or select action — reliably, without hallucinating IDs that don’t exist. It’s the hands that connect every conversational agent to the panels operators already use.
Capabilities
- Set-of-Marks pattern — the frontend detects interactive elements, fingerprints them by tag, data attributes, and label, and sends a filtered, viewport-only list alongside the screenshot.
- Dynamic schema constraints — the element-id field is bounded to the exact valid range on every request, enforced by structured output so the model literally cannot pick an index that doesn’t exist.
- Fingerprint-based remapping — between reasoning and execution the DOM can shift; elements are re-matched by fingerprint so index drift doesn’t cause wrong clicks.
- Shortcut actions — common panel operations bypass fragile click chains and execute as single high-level intents.
- Ambiguity handling — when multiple elements match, the agent returns an “ambiguous” action with the candidates and asks the user to pick.
What makes it hold up
The whole class of “AI hits button 997 out of 74” bugs comes from the same root cause: the model is unconstrained about what IDs it can invent. Rebuilding the output schema per request — so the constraint travels with the call, not the prompt — closes that door entirely. Fingerprints cover the other half: the world under the model’s feet moves between reasoning and action, and fingerprints let it land anyway.
Enterprise project. Official writeup and demo link will be added once online.