MMSkills

MMSkills: Towards Multimodal Skills for General Visual Agents

Kangning Zhang1,2‡*, Shuai Shao1,2‡*, Qingyao Li1,2*, Jianghao Lin1, Lingyue Fu1, Shijian Wang3,
Wenxiang Jiao2,✉, Yuan Lu2, ✉, Weiwen Liu1,✉, Weinan Zhang1,✉, Yong Yu1,✉

1Shanghai Jiao Tong University   2Xiaohongshu Inc.   3Southeast University
*Work done during internship at Xiaohongshu Inc., Equal contribution, Corresponding authors

Overview of the MMSkills framework

MMSkills turns public visual-agent trajectories into reusable multimodal skill packages with text procedures, runtime state cards, and multi-view keyframes. At inference time, branch loading lets agents inspect only the relevant skill evidence before acting in the live environment.

What MMSkills Provides

Multimodal skill packages

Each MMSkill binds a reusable textual procedure to state cards and visual keyframes that specify when to use the procedure, when to skip it, and how to verify progress.

Trajectory-to-skill Generator

Public non-evaluation trajectories are grouped, merged, drafted, grounded, and audited into compact reusable skills rather than stored as raw demonstrations.

Branch-loaded utilization

A temporary branch selects relevant state cards and views, aligns them with the live screenshot, and returns structured guidance to the main agent.

Case Studies

Four OSWorld case studies compare no-skill, text-only skill, and multimodal MMSkills runs with the same task instruction. The examples show how visual state cards help agents avoid repeated clicks, wrong routes, and misplaced GUI actions.

A Concrete MMSkill Package

Concrete MMSkills example with skill package and three usage modes

A concrete MMSkills example. A multimodal skill package combines a textual procedure, runtime state cards, and multi-view visual evidence. For the same chart-creation task, text-only guidance can miss the active sheet state, while branch-loaded MMSkills align skill evidence with the live screen and return state-aware guidance for the main agent.

Ubuntu Skill Library

The public Skill Library currently indexes 247 Ubuntu desktop MMSkills spanning browser, office, system, code editor, email, media, image-editing, and cross-application workflows.

Chrome GIMP LibreOffice Calc LibreOffice Impress LibreOffice Writer Multi-App Workflows OS Thunderbird VLC VS Code

Example skills

Results Overview

Across GUI and game-based visual-agent benchmarks, MMSkills improves performance over no-skill and text-only skill conditions, especially for smaller or click-heavy visual agents.

Benchmark Model No skill MMSkills Gain
OSWorld Gemini 3.1 Pro 44.08 50.11 +6.03
OSWorld Gemini 3 Flash 36.65 47.97 +11.32
OSWorld Qwen3-VL-235B 21.34 39.17 +17.83
OSWorld Qwen3-VL-8B-Instruct 10.78 25.40 +14.62
Ablation results for MMSkills

Ablations show that runtime state cards, visual keyframes, branch loading, and view selection all contribute to effective multimodal skill use.

Behavioral Shift

MMSkills does not only raise final success rate. It changes how agents behave: fewer low-level primitives, fewer repeated actions, stronger completion judgments, and more structured state-aware execution.

Behavior shifts induced by MMSkills

Interaction Case Study

In representative OSWorld traces, the main agent acts directly when possible, consults skill branches at decision points, and receives compact guidance grounded in selected state cards and visual evidence.

Representative interaction case with branch-loaded MMSkills

Quick Start

git clone https://github.com/DeepExperience/MMSkills.git
cd towards_mmskills
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

python3 scripts/install_into_osworld.py /path/to/OSWorld --with-runner --with-skills
python run.py \
  --agent_type mm_skill \
  --model gpt-4o \
  --api_backend openai \
  --observation_type screenshot \
  --action_space pyautogui \
  --max_steps 20 \
  --skills_library_dir skills_library \
  --task_skill_mapping_root task_skill_mappings/task_skill_mapping.json \
  --skill_mode multimodal \
  --task_skill_top_k 6 \
  --save_conversation_json \
  --test_all_meta_path evaluation_examples/test_nogdrive.json \
  --domain chrome \
  --result_dir results/mm_skill_multimodal

BibTeX

@software{mmskills2026,
  title = {MMSkills: Towards Multimodal Skills for General Visual Agents},
  year = {2026},
  url = {https://github.com/DeepExperience/MMSkills}
}