MissionBench is a mission-level aerial benchmark for testing frozen, off-the-shelf MLLMs on long-horizon drone tasks from a single high-level instruction. Instead of evaluating isolated skills, it measures end-to-end performance in realistic environments where agents must coordinate perception, planning, movement, viewpoint selection, and reporting. Across 22 models evaluated in the paper, even the best systems achieve below 35% success, showing that mission-level embodied reasoning remains a hard open problem.
Current models struggle with robust mission completion in aerial environments. Performance improves with model scale, but gains are still far from reliable autonomy. Strong mission performance requires coordinated capabilities beyond visual recognition alone.
Performance of 20 models on the MissionBench evaluation set (SR5). Open-weight models shown with transparent fill.
For full documentation and configuration templates, see the GitHub repository.
Even top-performing models exhibit systematic failure patterns across mission types.
@article{MissionBench,
title = {MissionBench: Zero-Shot Mission-Level Evaluation for Aerial MLLM Agents},
author = {Suman Navaratnarajah and Taehyoung Kim and Ishaan Bhimwal and
Ryousuke Yamada and Yannik Blei and Wolfram Burgard and
Jona Ruthardt and Yuki M. Asano},
year = {2026}
}