Oppo Open-Sources X-OmniClaw: An On-Device Android Agent That Runs Camera, Screen, and Voice Locally
Oppo's Multi-X team published X-OmniClaw on GitHub under an Apache 2.0 licence — an edge-native Android AI agent integrating camera, screen recognition, and voice that processes data on-device and calls cloud LLMs only for high-level reasoning, marking the first fully open-source multimodal mobile agent from a major Android OEM.

Oppo's Multi-X research team released X-OmniClaw on GitHub on May 17, 2026, under an Apache 2.0 licence — an Android AI agent designed to operate directly on the physical device, combining camera input, on-screen UI analysis, and voice commands to automate tasks across apps without routing sensitive user data to remote servers for processing.
Architecture: Three Interlocking Layers
The technical report accompanying the release, published as arXiv 2605.05765, describes X-OmniClaw through three components that correspond to perception, memory, and execution.
Omni Perception functions as a unified ingress pipeline that ingests screen state in XML, real-world visual context through the camera, and speech input through the microphone, aligning them temporally before handing off to a vision-language model for interpretation. This is what allows the agent to simultaneously understand what is on the screen, what the camera sees, and what the user said — treating them as a single, coherent observation rather than three separate signals.
Omni Memory operates on two levels. Working memory maintains continuity across the steps of an active task. Long-term personal memory is distilled from device data — gallery contents, app usage patterns, saved workflows — with filtering applied before storage to prevent sensitive information from being retained. Photos in the gallery, for instance, are converted into searchable semantic descriptions stored in a local text index rather than uploaded as images.
Omni Action executes tasks using a hybrid UI grounding approach that combines XML structural metadata, visual element recognition, and optical character recognition. The combination is significant: XML alone is unreliable in ad-heavy interfaces where element boundaries shift; visual grounding alone is slow. Switching between them based on context gives the agent more robust performance across the full range of apps a typical Android user might run.
A behaviour cloning module records user navigation sequences and extracts the underlying Android Intent and deeplink parameters, converting them into reusable skill trajectories. When the agent needs to reach a deeply nested screen — say, a specific product category inside an e-commerce app — it can replay the Intent directly rather than re-executing the full tap sequence, which would be fragile against UI updates.
What It Can Do in Practice
The release demonstrates three concrete scenarios. In the first, a user photographs a product in the real world and asks X-OmniClaw to find the cheapest available price; the agent opens a shopping app, searches using the camera input, and compares results without the user specifying which app to use or how to navigate it. In the second, acting as a "ScreenAvatar" companion, it watches a maths exercise on screen and provides step-by-step guidance. In the third, it creates a photo album from gallery contents based on a spoken description, querying its local semantic memory index rather than scanning raw image files.
The system requires Android 8.0 or later and is written primarily in Kotlin, with a Python component for the cloud reasoning integration. It supports multiple LLM provider backends — OpenRouter, Anthropic, OpenAI, and Ollama among them — which means users can route the reasoning tier to a local model if they want a fully offline operation, or to a cloud API for more capable responses.
Significance for the Open-Source Mobile Agent Ecosystem
Mobile AI agents have been an active research area for several years, but the field has largely been dominated by cloud-orchestrated approaches — frameworks that stream screenshots to a remote server, receive action instructions, and replay them. These architectures work but require a continuous internet connection, introduce latency, and create privacy exposure at every step.
X-OmniClaw's edge-first design is a genuine architectural departure. By keeping perception and memory on-device and using cloud LLMs only for the planning and reasoning tier — the step that requires broad world knowledge and multi-step logic — it reduces the data surface area considerably. The distinction matters not just for privacy but for usability in low-connectivity environments.
The release is also notable for being the first fully open-source multimodal mobile agent from a major Android OEM. Google, Samsung, and Xiaomi have all developed proprietary on-device agent capabilities, but have not published the underlying architectures or released them under permissive licences. Oppo's decision to publish under Apache 2.0 — rather than the Creative Commons licence noted on the project page, which applies to documentation — means the code can be incorporated into derivative products, including by competitors.
The repository had 140 stars and 17 forks within the first day of wider circulation on Hacker News. The team indicated in the paper that future work will focus on self-evolving skill acquisition, dynamic memory consolidation, and improved balancing of on-device versus cloud processing — the last of which is likely to become increasingly relevant as on-device foundation models grow more capable.