MissionBench

Zero-Shot Mission-Level Evaluation for Aerial MLLM Agents

1 University of Technology Nuremberg     2 Fraunhofer IVI     3 THWS     4 AIST
*Equal contribution
MissionBench overview figure

MissionBench. Our proposed MLLM benchmark evaluates mission-level aerial reasoning across three task families: Visual Inspection, Manipulation, and Patrol, and five types of environments. Each episode begins with a single natural-language directive. The agent must then iteratively decide where to go, how to position itself, and what outcomes to report to complete the mission.

Abstract

MissionBench is a mission-level aerial benchmark for testing frozen, off-the-shelf MLLMs on long-horizon drone tasks from a single high-level instruction. Instead of evaluating isolated skills, it measures end-to-end performance in realistic environments where agents must coordinate perception, planning, movement, viewpoint selection, and reporting. Across 22 models evaluated in the paper, even the best systems achieve below 35% success, showing that mission-level embodied reasoning remains a hard open problem.

  • 120 missions across 5 photorealistic environments.
  • 3 task families: Visual Inspection, Manipulation, and Patrol.
  • A closed-loop evaluation protocol that separates mission understanding from spatial execution.
  • Evaluation of 22 open- and closed-source MLLMs in zero-shot settings.

Current models struggle with robust mission completion in aerial environments. Performance improves with model scale, but gains are still far from reliable autonomy. Strong mission performance requires coordinated capabilities beyond visual recognition alone.

Sample Missions

Visual Inspection

Inspect the top of the isolated house
Isolated House Rooftop
Find and report the cause of the fire
Forest Fire
Report license number of the red car
Red Car

Area Patrolling

Patrol the main road
Main Road
Patrol the coastal region
Coastal Region
Patrol all the wind turbines
Wind Turbines

Object Manipulation

Drop medical supplies on yard
Yard
Land the drone on the building
Red Cross Building
Land the drone on the red cross
Red Cross Forest

MissionBench Leaderboard

Performance of 20 models on the MissionBench evaluation set (SR5). Open-weight models shown with transparent fill.

For full documentation and configuration templates, see the GitHub repository.

Qualitative Results (Failure Modes)

Even top-performing models exhibit systematic failure patterns across mission types.

Premature Termination
Premature Termination
The agent declares task completion before reaching or confirming the target, cutting the mission short.
Alignment Failure
Alignment Failure
The drone positions itself near the target but fails to achieve the required viewpoint alignment for accurate inspection.
Drift & Oscillation
Drift & Oscillation
Repeated back-and-forth movement around a waypoint prevents stable hovering and leads to mission timeout.

BibTeX

@article{MissionBench,
  title   = {MissionBench: Zero-Shot Mission-Level Evaluation for Aerial MLLM Agents},
  author  = {Suman Navaratnarajah and Taehyoung Kim and Ishaan Bhimwal and
             Ryousuke Yamada and Yannik Blei and Wolfram Burgard and
             Jona Ruthardt and Yuki M. Asano},
  year    = {2026}
}