Beyond Pixels: Enhancing LIME with Hierarchical Features and Segmentation Foundation Models

A Novel Approach to Explainable AI in Image Analysis

📄 Read on arXiv Accepted at ICLR 2025 FM-Wild Workshop

Patrick Knab

Technical University of Clausthal

patrick.knab@tu-clausthal.de

Sascha Marton

University of Mannheim

sascha.marton@uni-mannheim.de

Christian Bartelt

Technical University of Clausthal

christian.bartelt@tu-clausthal.de

Example Dog Image Analysis

Abstract

LIME (Local Interpretable Model-agnostic Explanations) is a popular XAI framework for unraveling decision-making processes in vision machine-learning models. The technique utilizes image segmentation methods to identify fixed regions for calculating feature importance scores as explanations. Therefore, poor segmentation can weaken the explanation and reduce the importance of segments, ultimately affecting the overall clarity of interpretation. To address these challenges, we introduce the DSEG-LIME (Data-Driven Segmentation LIME) framework, featuring: i) a data-driven segmentation for human-recognized feature generation by foundation model integration, and ii) a user-steered granularity in the hierarchical segmentation procedure through composition. Our findings demonstrate that DSEG outperforms on several XAI metrics on pre-trained ImageNet models and improves the alignment of explanations with human-recognized concepts.

Core Technologies

  • 🔍 Foundation Models

    SAM, DETR

  • 🧠 Base Models

    EfficientNet, ResNet, ViT, ConvNext, CLIP

  • 🎯 Applications

    XAI, Vision, Concept Hierarchy

Key Features

Data-Driven Segmentation

Advanced feature generation aligned with human recognition patterns using foundation models

  • 🔍 SAM (Segment Anything)

    Precise object segmentation

  • 🎯 DETR Object Detection

    End-to-end object recognition

Primary Use:

Feature extraction and semantic segmentation

Hierarchical Approach

User-steered granularity in the segmentation procedure through composition

  • 🌟 ConvNeXt Integration

    Modern architecture support

  • 🔮 ResNet Backbone

    Robust feature hierarchy

Primary Use:

Multi-level feature analysis and composition

Improved Interpretability

Enhanced alignment with human-recognized concepts and better XAI metrics

  • 👁️ VisionTransformer Support

    Attention-based analysis

  • 🎯 CLIP Zero-shot Analysis

    Cross-modal understanding

Primary Use:

Human-aligned explanations and explanation for zero-shot predictions

DSEG-LIME Technique

DSEG-LIME Technique Visualization

Algorithm 1: DSEG-LIME Framework

Input: $f$ (black-box model), $\zeta$ (segmentation function), $x$ (input instance), $g$ (interpretable model), $d$ (maximum depth), $hp$ (hyperparameters), $\theta$ (minimum segment size), $k$ (top segments)
1. Initial segmentation:
$\mathcal{S} \leftarrow \zeta(x, hp) \hspace{2em} \triangleright$ Segment the input instance
2. Small cluster removal:
$\mathcal{S'} \leftarrow \{ s_i \in \mathcal{S} \mid \text{size}(s_i) \geq \theta \} \hspace{2em} \triangleright$ Remove small clusters
3. Hierarchical ordering:
$\mathcal{H} \leftarrow \text{BuildHierarchy}(\mathcal{S'}) \hspace{2em} \triangleright$ Build hierarchical segmentation
$\textbf{for}$ $l \leftarrow 1$ $\textbf{to}$ $d$ $\textbf{do}$
4. Empty space removal:
$\textbf{if}$ $l = 1$ $\textbf{then}$
$\mathcal{S}_l \leftarrow \mathcal{H}[l] \hspace{2em} \triangleright$ Segments at depth $1$
$\textbf{else}$
$\mathcal{S}_l \leftarrow \{ s_i \in \mathcal{H}[l] \mid \text{parent}(s_i) \in \text{top_ids} \} \hspace{2em} \triangleright$ Select child segments
$\textbf{end if}$
$\mathcal{S}_l \leftarrow \text{NearestNeighbor}(\mathcal{S}_l) \hspace{2em} \triangleright$ Fill unsegmented space
$Z \leftarrow \text{Perturb}(x, \mathcal{S}_l) \hspace{2em} \triangleright$ Create neighborhood perturbations
$w \leftarrow \text{Proximity}(Z, x) \hspace{2em} \triangleright$ Compute sample weights
$\text{preds} \leftarrow \{f(z) \mid z \in Z\} \hspace{2em} \triangleright$ Get predictions
$g \leftarrow \text{InitializeModel}(g) \hspace{2em} \triangleright$ Initialize model
$g \leftarrow \text{Fit}(g, Z, \text{preds}, w) \hspace{2em} \triangleright$ Train model
$\text{top_ids} \leftarrow \{ \text{id}(s_i) \mid s_i \in \mathcal{S}_l, s_i \text{ is among top } k \text{ features in } g \} \hspace{2em} \triangleright$ Update top IDs
$\textbf{end for}$
Return $g \hspace{2em} \triangleright$ Return the local surrogate model

Key Implementation Details

  • • Hierarchical segmentation with depth control (\(d\))
  • • Adaptive feature selection based on parent importance
  • • Integration with any black-box model (\(f\)) and interpretable model (\(g\))

Evaluation Setup

Methodology

Model Architecture

🧠 Primary Models

EfficientNet (B4)

Core architecture for evaluations

ResNet-101

Feature extraction backbone

👁️ Advanced Models

VisionTransformer-384

Self-attention mechanisms

ConvNeXt

Modern CNN architecture

Foundation Models

  • 🔍 Segment Anything (SAM)

    State-of-the-art segmentation model for precise object boundary detection

  • 🎯 DETR

    End-to-End object detection with transformers

  • 🎯 CLIP

    Zero-shot visual-language understanding and classification

Segmentation Methods

DSEG (Ours)

Data-driven hierarchical segmentation with SAM

SLIC

Superpixel segmentation

Quickshift

Mode-seeking segmentation

Felzenszwalb

Graph-based segmentation

Watershed

Marker-based segmentation

Evaluation Framework

Comprehensive evaluation across multiple dimensions:

  • • Correctness (Random Model, Random Explanation, Single Deletion)
  • • Output Completeness (Preservation and Deletion)
  • • Consistency (Noise Preservation and Deletion)
  • • Contrastivity (Preservation and Deletion)
  • • Performance Metrics (Gini Index, Stability, Computation Time)

Quantitative Results

Our comprehensive evaluation demonstrates DSEG-LIME's superior performance across multiple metrics:

Performance Metrics

Gini Index

0.54

Best score (DSEG-GLIME)

Stability

0.010

Best repetition stability score

Random Model

74%

Highest correctness score

Detailed Evaluation Results

Metric Category DSEG Score Performance
Random Model Correctness 74% Highest among all methods
Random Explanation 93% Best performance (DSEG-GLIME)
Single Deletion 64% Significantly outperforms other methods
Deletion Score 74% Consistent across SLIME, GLIME, and BayLIME variants
Noise Preservation 77% Best consistency score (DSEG-GLIME/BayLIME)

Key Findings

  • ✓ Superior performance in correctness metrics
  • ✓ Best-in-class stability (0.010)
  • ✓ Highest Gini index (0.54) with GLIME variant
  • ✓ Consistent high performance across all LIME variants
  • ✓ Balanced computation time (28.5-31.9s)

Qualitative Results

Our comprehensive user study with 87 participants evaluated the quality and interpretability of explanations across different segmentation methods:

User Study Results

Method Avg. Score ↑ Best Rated ↑
DSEG 4.16 1042
SLIC 3.01 150
Quickshift 1.99 90
Felzenszwalb 3.25 253
Watershed 2.59 205

Note: Average scores are on a scale of 1-5, with 5 being the best. Best Rated shows the number of times each method was rated highest.

Key Findings

  • 📈 Highest Average Score: DSEG achieved 4.16/5, significantly outperforming other methods
  • 🏆 Most Preferred: DSEG was rated best 1042 times, over 4x more than the next best method
  • 📊 Consistent Performance: DSEG maintained superior ratings across different image types and scenarios
  • 🎯 User Satisfaction: Participants particularly appreciated DSEG's alignment with human-recognized features

Example Explanations

Feature Attribution Heatmaps

Feature Attribution Heatmaps

DSEG-LIME generates precise feature attribution heatmaps for different object classes. The examples above show attribution maps for a dishwasher, gorilla, and airplane, where darker blue regions indicate stronger positive contributions to the model's decision. This demonstrates DSEG-LIME's ability to identify semantically meaningful regions that influence the model's predictions.

Multimodal Analysis & Zero-Shot Classification

Deer Multimodal Analysis

DSEG-LIME's capability to integrate both visual and textual information for comprehensive explanations, showcasing its versatility in zero-shot classification scenarios using CLIP. The example demonstrates analysis of a deer image with associated text about wildlife conservation, where CLIP successfully identifies the image as a "land mammal" without prior training on this specific category. The bar chart shows feature importance scores that incorporate both visual elements and textual context, highlighting terms like "wildlife" and "climate" that influence the model's predictions. This demonstrates DSEG-LIME's ability to explain zero-shot predictions by leveraging CLIP's multimodal understanding.

Granularity Control

Zebra Granularity Example

Demonstration of DSEG-LIME's user-controlled granularity: The original image (left) is analyzed at different hierarchical depths. At d=1, the explanation captures major features, while at d=2, it provides finer-grained segmentation focusing on more specific regions like the head and striped pattern. This flexibility allows users to choose the level of detail in explanations.

View More Examples

Quick Start

Get started with DSEG-LIME using our interactive notebook:

Try Demo Notebook View on GitHub