UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

1CUHK MMLab, 2vivo AI Lab, 3Princeton University
4Shenzhen Loop Area Institute 5Shanghai AI Lab
*Equal Contribution, Corresponding Author
Comparison of RL paradigms

Comparison of RL paradigms for GUI agents. (a) Standard Online RL suffers from sparse rewards. (b) Experience Replay and (c) Dense Reward address sample efficiency and credit assignment respectively, but both lack mechanisms for Cross-Task Transfer. (d) Our Framework introduces an Evolving Memory that provides hierarchical guidance for exploration and continuously updates itself by abstracting successful plans and failure patterns from new trajectories, enabling cross-task knowledge transfer.

Abstract

Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer.

To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications.

Method

UI-Mem Framework Overview

Overview of the proposed UI-Mem framework. Given a task instruction, the agent retrieves hierarchical experience including Workflows, Subtask Skills, and Failure Patterns. We employ Stratified Group Sampling to generate a group of trajectories under varying levels of guidance, enabling effective advantage estimation for Policy Optimization. Finally, a Self-Evolving Loop extracts abstract plans from successful trajectories and diagnoses from failures to update the memory.


Hierarchical Experience Memory

UI-Mem constructs a structured memory pool that stores reusable workflows, subtask skills, and failure patterns as parameterized templates. This hierarchical structure allows the agent to retrieve relevant past experience and instantiate it to form specific plans when facing novel tasks.

Hierarchical Experience Retrieval

Illustration of the Hierarchical Experience Retrieval process. Given a task instruction, the system performs template matching to extract specific variables (e.g., city names) and instantiates the retrieved experience to form a concrete plan.

Memory-Guided Exploration

To leverage memory effectively, we introduce a mechanism which utilizes Stratified Group Sampling. This approach injects different strengths of memory guidance (Strong, Weak, and No Guidance) into the same GRPO rollout group.

  • Strong Guidance: Provides full hierarchical plans to stabilize training and ensure high-quality trajectories.
  • Weak Guidance: Provides only high-level workflows, forcing the agent to learn low-level execution details.
  • No Guidance: Encourages pure exploration to provide an unbiased estimate of the agent's internalized policy.

This strategy facilitates effective advantage estimation while preventing the agent from becoming dependent on external guidance.

Self-Evolving Loop

Finally, the Self-Evolving Loop continuously refines the memory by extracting novel experience from the newly collected trajectories. This enables progressive improvement and cross-task transfer.

Self-Evolving Loop

The Self-Evolving Loop. Successful plans and failure causes are extracted from new trajectories to continually refine the memory and guide next rollouts.

Experimental Results

AndroidWorld
AndroidLab

Performance comparison on the AndroidWorld benchmark. * denotes inference-time memory retrieval.

Model Params Success Rate (%)
Seed1.5-VL-62.1
UI-Tars-1.5-64.2
Gemini-2.5-Pro-69.7
Seed1.8-70.7
MAI-UI-2B2B49.1
Ferret-UI Lite-3B3B28.0
Qwen3-VL-4B (Base)4B45.3
UI-Mem-4B (Ours)4B58.2
UI-Mem-4B* (Ours)4B62.5
GUI-Owl-7B7B66.4
Step-GUI-8B8B67.7
Qwen3-VL-8B (Base)8B47.6
UI-Mem-8B (Ours)8B66.8
UI-Mem-8B* (Ours)8B71.1

Performance comparison on the AndroidLab benchmark. * denotes inference-time memory retrieval.

Model Sub-Goal SR Reasonable Op Ratio Success Rate (%)
GPT-4o35.085.431.2
AutoGLM--36.2
UI-Genie-Agent-3B35.490.628.8
Qwen3-VL-4B (Base)48.290.537.0
UI-Mem-4B (Ours) 49.5 93.5 37.7
UI-Mem-4B* (Ours) 51.9 94.6 39.9
UI-Genie-Agent-7B46.391.438.7
UI-TARS-1.5-7B49.492.540.6
MobileRL (7B)--42.5
Qwen3-VL-8B (Base)45.391.834.8
UI-Mem-8B (Ours) 52.7 90.9 43.5
UI-Mem-8B* (Ours) 56.0 94.9 44.9

Qualitative Analysis

Impact of Memory Guidance

Impact of Memory Guidance

Trajectory analysis on the task "Create a new contact...". Top (Reward 1.0): Full memory guidance yields a perfect execution. Middle (Reward 0.6): Weak guidance results in incomplete execution where the agent misses the name. Bottom (Reward 0): Lack of memory results in a total failure.


Error Correction via Failure Diagnosis

Error Correction via Diagnosis

Visualizing the Failure Diagnosis mechanism. The system identifies the navigation error in the first rollout (navigating back instead of entering the list) and generates a specific Correction Guideline, enabling success in the second-round rollout.

BibTeX

@article{xiao2026uimem,
  title={UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents},
  author={Xiao, Han and Wang, Guozhi and Wang, Hao and Liu, Shilong and Chai, Yuxiang and Pan, Yue and Zhou, Yufeng and Chen, Xiaoxin and Wen, Yafei and Li, Hongsheng},
  journal={arXiv preprint},
  year={2026}
}