Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach—suited to onboard real-time inference—achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency.
SSP enhances temporal consistency by combining current segmentation predictions with past frame predictions using linear interpolation. The interpolation weights, determined by semantic feature similarities between consecutive frames, are computed through convolutional layers. Additionally, global registration alignment compensates efficiently for camera movements, aligning previous predictions to the current frame without relying on expensive optical flow.
The detailed architecture shows the step-by-step inference process. During inference, SSP combines the current frame prediction (logits) from an image segmentation model with the previous frame's aligned prediction using linear interpolation. This alignment is achieved via a global registration homography transformation (H) to compensate efficiently for camera movements. Interpolation weights are computed by a similarity layer, which uses convolutional layers to analyze semantic feature maps extracted from both current and past frames. This design facilitates efficient, stable, and accurate semantic predictions throughout entire videos.
Due to sparse labeling in aerial datasets, the paper proposes a semi-supervised knowledge distillation (KD) training process. A larger, high-quality teacher model provides consistent annotations across all frames, enabling the SSP model to learn effectively from unlabeled frames. This method significantly enhances accuracy and temporal consistency without sacrificing inference speed.
Experiments on UAVid and RuralScapes datasets demonstrate that SSP and KD-SSP significantly outperform baseline image models and other established video segmentation methods. Specifically, KD-SSP achieves 80.63% mIoU and 91.53% Temporal Consistency (TC) on UAVid and 64.56% mIoU and 94.00% TC on RuralScapes, showing a superior balance of accuracy, consistency, and inference speed.
Qualitative comparisons confirm SSP's superior temporal stability and segmentation quality across video frames, clearly outperforming traditional image-based models and other video segmentation methods like NetWarp and TCB-OCR. Overall, SSP provides an efficient, accurate, and highly temporally consistent video segmentation solution suited for autonomous UAV applications, effectively addressing the unique challenges posed by aerial footage, including limited annotations and camera motion.
@misc{vincent2025hightemporalconsistencysemantic,
title={High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation
for Autonomous Flight},
author={Cédric Vincent and Taehyoung Kim and Henri Meeß},
year={2025},
eprint={2503.15676},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15676},
}