MoTIF: Concepts in Motion - Temporal Bottlenecks for Interpretable Video Classification

Concept Visualization: MoTIF identifies and highlights key temporal concepts in video sequences.

Abstract

Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., "bow," "mount," "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance.

Method

Contributions

CBM Framework for Video

MoTIF supports arbitrary-length inputs and integrates seamlessly with vision–language backbones

Three Complementary Explanation Modes

MoTIF is the first method to enable:

Global concept relevance via log-sum-exp (LSE) pooling
Localized temporal explanations using windowed concept attributions
Attention-based temporal maps that visualize how a concept channel distributes its focus across time

Per-Channel Temporal Self-Attention

Preserves concept independence within transformer blocks and models temporal dynamics on a per-concept basis

Architecture

Videos are embedded with a vision–language backbone and mapped to concept activations via cosine similarity. Per‑channel temporal self‑attention models dynamics independently for each concept, followed by a non‑negative affine transformation and classification. MoTIF enables explanations across three views: global concepts, local concepts, and temporal dependencies. Sample frames from SSv2 with MoTIF (ViT‑L/14).

Video and concept embeddings

Frames are embedded with an image–text aligned backbone (e.g., CLIP) into a shared space. For each temporal window we use either a representative frame or a video‑adapted CLIP embedding. Concept activations X (T×C) are obtained as cosine similarities to a bank of human‑interpretable actions and objects. The concept bank is built from natural‑language descriptions; a large language model proposes candidate concepts, and we adopt the resulting set directly.

Per‑channel temporal self‑attention (diagonal)

Standard transformers mix channels in Q/K/V projections, which obscures concept attribution. MoTIF keeps concepts independent using depthwise 1×1 projections so each concept owns its Q, K and V. Attention is computed within a concept across time, yielding a T×T weight map per concept and refined activations.

Per‑concept affine transformation

Refined activations are scaled and shifted by concept‑specific parameters and passed through Softplus to keep activations non‑negative. A lightweight depthwise two‑layer feed‑forward block (GELU, dropout) is applied.

Complexity

Diagonal attention removes channel‑mixing cost (from O(C²T) to O(CT)) but computes a T×T map per concept, giving O(CT²). This trades efficiency for strict concept isolation compared with standard multi‑head attention O(HT²) with H ≪ C.

Results

Performance Comparison

Table below provides a comparison including accuracies from non-interpretable baselines. We compare against TSM, No Frame Left Behind, and VideoMAE V2, all of which report results on our selected datasets. As expected, these models generally outperform our interpretable variant. However, on Breakfast, we exceeded the performance of two of the baselines that report scores. Importantly, our objective is not to surpass state-of-the-art benchmarks, but to demonstrate a novel MoTIF framework for video data that provides unique interpretability insights.

Method	Breakfast	HMDB51	UCF101	SSv2
Zero-shot
CLIP-RN/50	18.6 ± 2.6	29.8 ± 0.5	57.2 ± 0.9	0.8
CLIP-ViT-B/32	23.2 ± 2.9	38.1 ± 0.3	59.9 ± 0.4	0.9
CLIP-ViT-L/14	31.1 ± 4.7	45.7 ± 0.1	70.6 ± 0.5	0.7
SigLIP-L/14	23.6 ± 5.0	49.3 ± 0.8	80.4 ± 1.4	1.3
PE-L/14	41.4 ± 7.0	56.7 ± 0.6	74.6 ± 0.9	2.2
Linear Probe
CLIP-RN/50	36.5 ± 9.0	59.3 ± 0.8	80.0 ± 0.7	13.7
CLIP-ViT-B/32	37.2 ± 9.1	61.6 ± 1.6	82.8 ± 0.7	15.2
CLIP-ViT-L/14	55.3 ± 10.2	68.4 ± 0.5	90.0 ± 1.1	18.1
SigLIP-L/14	57.1 ± 10.9	65.0 ± 2.1	90.5 ± 0.5	19.6
PE-L/14	72.9 ± 10.3	74.4 ± 0.6	94.5 ± 0.6	25.5
MoTIF (Ours)
MoTIF (RN/50)	52.8 ± 6.9	62.8 ± 1.1	82.8 ± 0.6	16.0
MoTIF (ViT-B/32)	53.4 ± 6.9	65.3 ± 1.8	85.6 ± 1.2	17.5
MoTIF (ViT-L/14)	69.3 ± 6.2	73.3 ± 1.0	93.2 ± 0.7	20.4
MoTIF (SigLIP-L/14)	73.5 ± 8.6	73.2 ± 2.4	94.0 ± 0.8	22.4
MoTIF (PE-L/14)	83.6 ± 6.5	79.6 ± 0.3	95.4 ± 0.7	30.0
Existing Video Models
TSM	59.1¹	73.5	95.9	61.7
No frame left behind	62.0¹	73.4¹	96.4¹	62.7¹
VideoMAE V2	--	88.1	99.6	76.8

¹ Results from literature. Bold indicates best performance, underlined indicates second best.

Pre-trained checkpoints for all MoTIF models are available at Hugging Face.

Citation

            @misc{knab2025conceptsmotiontemporalbottlenecks,
                title={Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification}, 
                author={Patrick Knab and Sascha Marton and Philipp J. Schubert and Drago Guggiana and Christian Bartelt},
                year={2025},
                eprint={2509.20899},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2509.20899}, 
            }