Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

MoTIF - Moving Temporal Interpretable Framework

Patrick Knab, Sascha Marton, Philipp Schubert, Drago Nilo, Christian Bartelt

Technical University of Clausthal, Germany • CORE Research GroupRamblr.ai Research, Germany

Paper Code Checkpoints Demo
Spotlight acceptances: MoTIF was accepted as a Spotlight paper at CompLearn @ ICML 2026 and XAI4CV @ CVPR 2026.
MoTIF Architecture
Figure 1: MoTIF architecture showing the modular design with concept extraction, temporal bottlenecks, and interpretable classification.

Concept Visualization: MoTIF identifies and highlights key temporal concepts in video sequences.

Abstract

Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., "bow," "mount," "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance.

Method

Contributions

CBM Framework for Video

MoTIF supports arbitrary-length inputs and integrates seamlessly with vision–language backbones

Three Complementary Explanation Modes

MoTIF is the first method to enable:

  • Global concept relevance via log-sum-exp (LSE) pooling
  • Localized temporal explanations using windowed concept attributions
  • Attention-based temporal maps that visualize how a concept channel distributes its focus across time

Per-Channel Temporal Self-Attention

Preserves concept independence within transformer blocks and models temporal dynamics on a per-concept basis

Architecture

MoTIF pipeline overview
Videos are embedded with a vision–language backbone and mapped to concept activations via cosine similarity. Per‑channel temporal self‑attention models dynamics independently for each concept, followed by a non‑negative affine transformation and classification. MoTIF enables explanations across three views: global concepts, local concepts, and temporal dependencies. Sample frames from SSv2 with MoTIF (ViT‑L/14).

Video and concept embeddings

Frames are embedded with an image–text aligned backbone (e.g., CLIP) into a shared space. For each temporal window we use either a representative frame or a video‑adapted CLIP embedding. Concept activations X (T×C) are obtained as cosine similarities to a bank of human‑interpretable actions and objects. The concept bank is built from natural‑language descriptions; a large language model proposes candidate concepts, and we adopt the resulting set directly.

Per‑channel temporal self‑attention (diagonal)

Standard transformers mix channels in Q/K/V projections, which obscures concept attribution. MoTIF keeps concepts independent using depthwise 1×1 projections so each concept owns its Q, K and V. Attention is computed within a concept across time, yielding a T×T weight map per concept and refined activations.

Per‑concept affine transformation

Refined activations are scaled and shifted by concept‑specific parameters and passed through Softplus to keep activations non‑negative. A lightweight depthwise two‑layer feed‑forward block (GELU, dropout) is applied.

Complexity

Diagonal attention removes channel‑mixing cost (from O(C²T) to O(CT)) but computes a T×T map per concept, giving O(CT²). This trades efficiency for strict concept isolation compared with standard multi‑head attention O(HT²) with H ≪ C.

Results

Performance Comparison

The table below reflects the updated paper results. We compare zero-shot backbones, Global CBM baselines, MoTIF and MoTIF-ST variants, and representative non-interpretable video baselines. The main takeaway remains that temporal concept bottlenecks consistently improve over global concept pooling while providing interpretable video reasoning.

Method Breakfast HMDB51 UCF101 SSv2
Zero-shot
CLIP-RN/5018.6 ± 2.629.8 ± 0.557.2 ± 0.90.8
CLIP-ViT-B/3223.2 ± 2.938.1 ± 0.359.9 ± 0.40.9
CLIP-ViT-L/1431.1 ± 4.745.7 ± 0.170.6 ± 0.50.9
SigLIP-L/1423.6 ± 5.049.3 ± 0.880.4 ± 1.41.3
PE-L/1441.4 ± 7.056.7 ± 0.674.6 ± 0.92.2
Global CBM
CLIP-RN/5036.5 ± 9.059.3 ± 0.880.0 ± 0.713.7
CLIP-ViT-B/3237.2 ± 9.161.6 ± 1.682.8 ± 0.715.2
CLIP-ViT-L/1455.3 ± 10.268.4 ± 0.590.0 ± 1.118.1
SigLIP-L/1457.1 ± 10.965.0 ± 2.190.5 ± 0.519.6
PE-L/1472.9 ± 10.374.4 ± 0.694.5 ± 0.625.5
MoTIF (Ours)
MoTIF (RN/50)52.8 ± 6.962.8 ± 1.182.8 ± 0.616.0
MoTIF (ViT-B/32)53.4 ± 6.965.3 ± 1.885.6 ± 1.217.5
MoTIF (ViT-L/14)69.3 ± 6.273.3 ± 1.093.2 ± 0.720.4
MoTIF-ST (ViT-L/14)71.1 ± 7.774.8 ± 1.093.8 ± 0.923.9
MoTIF (SigLIP-L/14)73.5 ± 8.673.2 ± 2.494.0 ± 0.822.4
MoTIF (PE-L/14)83.6 ± 6.579.6 ± 0.395.4 ± 0.730.0
MoTIF-ST (PE-L/14)84.1 ± 6.479.6 ± 0.796.3 ± 0.635.1
Existing Video Models
TSM59.1¹73.595.961.7
No frame left behind62.0¹73.4¹96.4¹62.7¹
VideoMAE V2--88.199.676.8

¹ Literature results. Ranking emphasis was removed here to avoid ambiguity from ties and mixed interpretable vs. non-interpretable comparisons.

Comparison with DANCE

We also compare MoTIF to DANCE and related explainable action-recognition baselines on UCF101, HAA-100, and HAA-500. Unlike DANCE, MoTIF explicitly models temporally localized concept activations and how they evolve over time.

Method Backbone UCF101 HAA-100 HAA-500
DANCE [29]Baseline w/o interp.88.473.5
DANCE [29]DANCE87.570.7
LF-CBM [30]Disentangled concepts85.566.5
MoTIFViT-B/3288.5 ± 0.661.355.3
MoTIFPE-L/1494.8 ± 0.487.880.9
MoTIFPE-G/1498.0 ± 0.289.984.1

Concept Interventions

MoTIF’s bottleneck also supports direct interventions. The destructive columns report normalized prediction overlap after removing concepts or windows, while the corrective columns report top-1 repair rate after targeted manual edits on misclassified top-5 cases.

Destructive Interventions

Dataset k Global Top-k ↓ Global Rand. ↓ Local Slot Top-k ↓ Window Top-k ↓ Window Rand. ↓
Breakfast01.0001.0001.0001.0001.000
Breakfast10.4960.9720.9540.8660.986
Breakfast20.2290.9500.9330.7750.977
Breakfast30.0850.9370.9080.7540.973
Breakfast40.0280.9090.8910.7320.960
HMDB5101.0001.0001.0001.0001.000
HMDB5110.6030.9750.9340.8750.963
HMDB5120.3740.9590.8920.8010.947
HMDB5130.2380.9420.8520.7310.918
HMDB5140.1420.9260.8090.6740.886

Corrective Interventions

k Breakfast Global Edit ↑ Breakfast Local Edit ↑ HMDB51 Global Edit ↑ HMDB51 Local Edit ↑
0
10.200.030.470.10
20.470.100.600.23
30.570.170.700.27
40.800.200.830.30

Pre-trained checkpoints for all MoTIF models are available at Hugging Face.

Citation

            @misc{knab2025conceptsmotiontemporalbottlenecks,
                title={Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification}, 
                author={Patrick Knab and Sascha Marton and Philipp J. Schubert and Drago Guggiana and Christian Bartelt},
                year={2025},
                eprint={2509.20899},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2509.20899}, 
            }