UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

Abstract

Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer.

To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications.

Method

Overview of the proposed UI-Mem framework. Given a task instruction, the agent retrieves hierarchical experience including Workflows, Subtask Skills, and Failure Patterns. We employ Stratified Group Sampling to generate a group of trajectories under varying levels of guidance, enabling effective advantage estimation for Policy Optimization. Finally, a Self-Evolving Loop extracts abstract plans from successful trajectories and diagnoses from failures to update the memory.

Hierarchical Experience Memory

UI-Mem constructs a structured memory pool that stores reusable workflows, subtask skills, and failure patterns as parameterized templates. This hierarchical structure allows the agent to retrieve relevant past experience and instantiate it to form specific plans when facing novel tasks.

Illustration of the Hierarchical Experience Retrieval process. Given a task instruction, the system performs template matching to extract specific variables (e.g., city names) and instantiates the retrieved experience to form a concrete plan.

Memory-Guided Exploration

To leverage memory effectively, we introduce a mechanism which utilizes Stratified Group Sampling. This approach injects different strengths of memory guidance (Strong, Weak, and No Guidance) into the same GRPO rollout group.

Strong Guidance: Provides full hierarchical plans to stabilize training and ensure high-quality trajectories.
Weak Guidance: Provides only high-level workflows, forcing the agent to learn low-level execution details.
No Guidance: Encourages pure exploration to provide an unbiased estimate of the agent's internalized policy.

This strategy facilitates effective advantage estimation while preventing the agent from becoming dependent on external guidance.

Self-Evolving Loop

Finally, the Self-Evolving Loop continuously refines the memory by extracting novel experience from the newly collected trajectories. This enables progressive improvement and cross-task transfer.

The Self-Evolving Loop. Successful plans and failure causes are extracted from new trajectories to continually refine the memory and guide next rollouts.

Experimental Results

AndroidWorld

AndroidLab

Performance comparison on the AndroidWorld benchmark. * denotes inference-time memory retrieval.

Model	Params	Success Rate (%)
Seed1.5-VL	-	62.1
UI-Tars-1.5	-	64.2
Gemini-2.5-Pro	-	69.7
Seed1.8	-	70.7
MAI-UI-2B	2B	49.1
Ferret-UI Lite-3B	3B	28.0
Qwen3-VL-4B (Base)	4B	45.3
UI-Mem-4B (Ours)	4B	58.2
*UI-Mem-4B (Ours)**	4B	62.5
GUI-Owl-7B	7B	66.4
Step-GUI-8B	8B	67.7
Qwen3-VL-8B (Base)	8B	47.6
UI-Mem-8B (Ours)	8B	66.8
*UI-Mem-8B (Ours)**	8B	71.1

Performance comparison on the AndroidLab benchmark. * denotes inference-time memory retrieval.

Model	Sub-Goal SR	Reasonable Op Ratio	Success Rate (%)
GPT-4o	35.0	85.4	31.2
AutoGLM	-	-	36.2
UI-Genie-Agent-3B	35.4	90.6	28.8
Qwen3-VL-4B (Base)	48.2	90.5	37.0
UI-Mem-4B (Ours)	49.5	93.5	37.7
*UI-Mem-4B (Ours)**	51.9	94.6	39.9
UI-Genie-Agent-7B	46.3	91.4	38.7
UI-TARS-1.5-7B	49.4	92.5	40.6
MobileRL (7B)	-	-	42.5
Qwen3-VL-8B (Base)	45.3	91.8	34.8
UI-Mem-8B (Ours)	52.7	90.9	43.5
*UI-Mem-8B (Ours)**	56.0	94.9	44.9

Qualitative Analysis

Impact of Memory Guidance

Trajectory analysis on the task "Create a new contact...". Top (Reward 1.0): Full memory guidance yields a perfect execution. Middle (Reward 0.6): Weak guidance results in incomplete execution where the agent misses the name. Bottom (Reward 0): Lack of memory results in a total failure.

Error Correction via Failure Diagnosis

Visualizing the Failure Diagnosis mechanism. The system identifies the navigation error in the first rollout (navigating back instead of entering the list) and generates a specific Correction Guideline, enabling success in the second-round rollout.

BibTeX

@article{xiao2026uimem,
  title={UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents},
  author={Xiao, Han and Wang, Guozhi and Wang, Hao and Liu, Shilong and Chai, Yuxiang and Pan, Yue and Zhou, Yufeng and Chen, Xiaoxin and Wen, Yafei and Li, Hongsheng},
  journal={arXiv preprint},
  year={2026}
}