arxiv:2412.07072

Stable Mean Teacher for Semi-supervised Video Action Detection

Published on Dec 10, 2024

Authors:

Akash Kumar ,

Abstract

Stable Mean Teacher with Error Recovery and Difference of Pixels constraints improves semi-supervised video action detection by enhancing pseudo labels and ensuring temporal consistency.

AI-generated summary

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA, and YouTube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and YouTube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.07072 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.07072 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.07072 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.