A fundamental challenge for GUI agents is robustly grounding natural language instructions, which requires not only precise spatial alignment (locating elements accurately) but also correct semantic ...
🖼️ Multimodal Vision: Your agent can see what you see. It can view the scene, look through any camera, watch play mode, and inspect asset thumbnails. 🔎 Powerful Search: Go beyond the project panel ...