Learn where to Click from Yourself:
On-Policy Self-Distillation for GUI Grounding

Yan Zhang^1,3,*, Daiqing Wu^1,3,*, Huawen Shen^1,3, Yu Zhou^2,†, Can Ma^1,3,†

¹Institute of Information Engineering, Chinese Academy of Sciences
²VCIP & TMCC & DISSec, College of Computer Science, Nankai University
³School of Cyber Security, University of Chinese Academy of Sciences

* Equal contribution † Corresponding authors

Paper arXiv Code

🏆

First OPSD for GUI Grounding

First exploration of OPSD in the GUI grounding domain, offering an efficient alternative to GRPO.

👁

Visual Privileged Guidance

Visually grounded teacher guidance with entropy-aware distillation for rich and reliable supervision.

⚡

4x Faster Training

Single rollout with dense token-level supervision, ~4x faster than GRPO-based methods.

Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.

Contributions

We present the first exploration of the OPSD framework in the GUI grounding domain, offering an efficient alternative to GRPO-based methods that suffer from expensive multiple rollouts and sparse signals on hard samples.
We propose GUI-SD, which integrates visually grounded teacher guidance with entropy-aware distillation, enabling rich and reliable supervision that concentrates optimization on the most impactful coordinate tokens.
Extensive experiments verify the effectiveness of GUI-SD over naive OPSD and GRPO-based methods across six representative GUI grounding benchmarks, demonstrating significant improvements in both accuracy and training efficiency, establishing OPSD as a promising paradigm for future GUI grounding research.

Method Overview

GUI-SD consists of two complementary components: (a) Teacher Privileged Guidance — the teacher receives a visually enriched privileged context (drawn bounding box + Gaussian soft-mask), while the student receives the original image. Both share the same policy weights. (b) Entropy-Guided Optimization — computes reverse KL divergence between teacher and student logits at each token position, weighted by positional credit assignment (higher weight for high-order digits) and entropy-gated supervision (higher weight for low-entropy, high-confidence teacher predictions).

Figure 3. Overview of the GUI-SD framework. (a) The teacher takes a privileged context x_pri, which augments the student's original input x with visual cues and a hint prompt, to produce richer soft labels for guiding the student. (b) The training objective is a weighted KL divergence, where w(t) prioritizes high-order tokens via positional credit and filters unreliable supervision via entropy gating on teacher confidence.

Key Findings

Finding 1: Textual Privilege Causes Distribution Collapse. Textual privilege causes the teacher's distribution to collapse into near-one-hot targets with near-zero entropy, making the distillation signal nearly equivalent to hard-label SFT and providing negligible benefit over supervised fine-tuning.

Finding 2: High-Order Digits Carry the Most Reliable Signal. At the per-token level, higher-order digits (hundreds) exhibit the largest teacher-student confidence gap and carry the most reliable supervisory signal, suggesting that optimization should prioritize these positions. The standard reverse KL, which treats all tokens uniformly, fails to exploit this.

Why GUI-SD?

(a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL with uniform per-token weight, yet suffers from distillation-to-SFT collapse and indiscriminate optimization. (c) GUI-SD addresses both issues via teacher privileged guidance and entropy-guided optimization. By replacing sparse outcome-level rewards with dense token-level guidance, OPSD provides an appealing alternative for improving both training efficiency and supervision quality.

Comparison between GRPO, Naive OPSD, and GUI-SD

Figure 1. (a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL with uniform per-token weight, yet suffers from distillation-to-SFT collapse and indiscriminate optimization. (c) Ours addresses both issues via teacher privileged guidance and entropy-guided optimization.

Main Results

GUI grounding accuracy on six representative benchmarks. Bold indicates the best results.

Method	Time/epoch	SSP	SS2	UIV	OSW-G	OSW-GR	MMG	Avg.
Qwen3-VL-Instruct	—	53.6	93.2	25.2	58.7	67.4	83.0	63.5
+ GRPO-Binary	16.9	56.8	94.6	27.6	61.2	68.6	84.3	65.5
+ GRPO-Distance	16.7	56.6	93.8	27.5	62.1	69.9	83.3	65.5
+ GRPO-Gaussian	16.8	57.4	94.0	28.2	61.9	70.0	83.7	65.9
+ GUI-SD (Ours)	4.2	60.7	95.1	33.3	64.0	70.9	86.7	68.4

+2.5 Avg. Accuracy over best GRPO baseline

~4x Faster training per epoch (4.2h vs 16.8h)

Comparison with SOTA Methods

GUI-SD surpasses existing GUI grounding methods in average accuracy across six representative benchmarks:

On ScreenSpot-Pro, GUI-SD achieves 60.7%, outperforming Propose-then-Critic (58.7%) which relies on test-time scaling.
On OSWorld-G-Refine, GUI-SD reaches 70.9%, surpassing ZwZ (69.0%) which leverages large teacher models (Qwen3-VL-235B).
Notably, GUI-SD achieves these improvements through a self-distillation framework that uses only the policy model itself as the teacher, without relying on test-time scaling or external large-scale models.

BibTeX

@article{zhang2025guisd,
  title     = {Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding},
  author    = {Zhang, Yan and Wu, Daiqing and Shen, Huawen and Zhou, Yu and Ma, Can},
  journal   = {arXiv preprint arXiv:2605.00642},
  year      = {2025},
}