Turing Test on Screen: Benchmarking Human-Like Mobile GUI Agents

Researchers introduce a new benchmark for evaluating the humanization of mobile GUI agents, framing the challenge as an optimization problem. This work highlights the need for agents to evade detection in human-centric digital ecosystems.

Researchers have introduced the "Turing Test on Screen," a new benchmark designed to evaluate the humanization capabilities of mobile GUI agents. The study, published on arXiv, argues that existing research has prioritized utility and robustness over the critical dimension of anti-detection. The authors formally model the interaction between a detector and an agent as a MinMax optimization problem, aiming to minimize behavioral divergence.

This work is significant because it addresses the growing need for autonomous agents to operate undetected in human-centric digital ecosystems. As platforms develop adversarial countermeasures, agents must evolve to mimic human behavior more convincingly. The benchmark includes a new high-fidelity dataset of mobile interactions, providing a robust framework for testing and improving agent humanization.

The introduction of this benchmark is timely, as the proliferation of autonomous agents in various applications—from customer service to data entry—has increased the scrutiny from digital platforms. Future research will likely focus on refining the optimization models and expanding the dataset to cover more complex interaction scenarios. The success of these efforts could pave the way for agents that are not only functional but also indistinguishable from human users.