MissionBench

Abstract

MissionBench is a mission-level aerial benchmark for testing frozen, off-the-shelf MLLMs on long-horizon drone tasks from a single high-level instruction. Instead of evaluating isolated skills, it measures end-to-end performance in realistic environments where agents must coordinate perception, planning, movement, viewpoint selection, and reporting. Across 22 models evaluated in the paper, even the best systems achieve below 35% success, showing that mission-level embodied reasoning remains a hard open problem.

120 missions across 5 photorealistic environments.
3 task families: Visual Inspection, Manipulation, and Patrol.
A closed-loop evaluation protocol that separates mission understanding from spatial execution.
Evaluation of 22 open- and closed-source MLLMs in zero-shot settings.

Current models struggle with robust mission completion in aerial environments. Performance improves with model scale, but gains are still far from reliable autonomy. Strong mission performance requires coordinated capabilities beyond visual recognition alone.

MissionBench Leaderboard

Performance of 20 models on the MissionBench evaluation set (SR5). Open-weight models shown with transparent fill.

For full documentation and configuration templates, see the GitHub repository.

Qualitative Results (Failure Modes)

Even top-performing models exhibit systematic failure patterns across mission types.

Premature Termination
The agent declares task completion before reaching or confirming the target, cutting the mission short.

Alignment Failure
The drone positions itself near the target but fails to achieve the required viewpoint alignment for accurate inspection.

Drift & Oscillation
Repeated back-and-forth movement around a waypoint prevents stable hovering and leads to mission timeout.

BibTeX

@article{MissionBench,
  title   = {MissionBench: Zero-Shot Mission-Level Evaluation for Aerial MLLM Agents},
  author  = {Suman Navaratnarajah and Taehyoung Kim and Ishaan Bhimwal and
             Ryousuke Yamada and Yannik Blei and Wolfram Burgard and
             Jona Ruthardt and Yuki M. Asano},
  year    = {2026}
}

MissionBench

Zero-Shot Mission-Level Evaluation for Aerial MLLM Agents

Abstract

Sample Missions

Visual Inspection

Area Patrolling

Object Manipulation

MissionBench Leaderboard

Qualitative Results (Failure Modes)

BibTeX