Evaluation Metrics Suite

ViStoryBench Evaluation Metrics

A comprehensive suite of automated and human-verified metrics designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings.

Character Identification Similarity (CIDS)

Evaluates character identity consistency using a rigorous pipeline. It quantifies how well the model preserves character identity across generated shots (Self-Similarity) and against the reference anchor (Cross-Similarity).

Detection & Extraction Pipeline

Utilizes Grounding DINO for precise bounding box detection. Features are extracted via a tri-model ensemble (ArcFace, AdaFace, FaceNet) for realistic subjects, or CLIP for stylized characters, producing robust 512-d vectors.

Cross-Similarity

Measures fidelity between generated characters and their ground-truth reference images.

Self-Similarity

Evaluates identity stability across the sequence of generated shots within a story.

Style Similarity (CSD)

Adopts the CSD (CLIP Style Disentanglement) metric to evaluate stylistic coherence. This metric decouples content from style, ensuring that the generated sequence maintains a consistent artistic direction, independent of the changing semantic content.

CSD Pipeline Implementation
  • Image Encoding: Encodes images using a CLIP vision encoder trained on large-scale style datasets.
  • Feature Disentanglement: Features pass through CSD layers to isolate style embeddings from content.
  • Similarity Scoring: Computes pairwise cosine similarity between style embeddings to measure both intra-sequence (Self) and reference-target (Cross) consistency.

Prompt Alignment

Evaluates how well generated images align with the multi-grained storyboard descriptions. We employ GPT-4.1 to assess four distinct aspects of narrative fidelity on a 0-4 Likert scale, converted to a 100-point metric.

Character Interaction

Alignment of group-level interactions (e.g., "hugging", "fighting") with the static shot description.

Shooting Method

Consistency of camera perspective (Close-up, Wide shot, etc.) with the shot design.

Static Shot Description

Global correspondence of scene setting, mood, and layout with the narrative details.

Individual Actions

Accuracy of specific gestures, expressions, and poses for each character.

Onstage Character Count Matching (OCCM)

A specialized metric addressing hallucination (superfluous characters) or omission (missing characters). It calculates the accuracy of the character count based on detected vs. expected entities.

$$\text{OCCM} = 100 \times \exp\left(-\frac{|D - E|}{\epsilon + E}\right)$$
$$D$$ Detected Count
$$E$$ Expected Count
$$\epsilon=e^{-6}$$ (Smoothing)

Details of Copy-Paste Detection

To rigorously evaluate whether the generated image is merely a replication of a specific reference image (denoted as the anchor or target reference r0) rather than a generalized synthesis from the provided character concept, we employ a Softmax-based Copy-Paste Score.

Let g be the unit-normalized feature vector of the generated image, and R = {r0, r1, ..., rN} be the set of unit-normalized feature vectors for the input reference images, where r0 represents the primary reference subject to copy-paste detection, and {r1, ..., rN} serves as the set of auxiliary references for the same character.

We first calculate the cosine similarity between the generated image and each reference image in the set R. To quantify the exclusivity of the match between g and the target r0 relative to other references, we formulate the score as a probability distribution using a temperature-scaled Softmax function.

$$ \text{CopyRate}(\mathbf{g} \mid \mathcal{R}) = \frac{\exp(\mathbf{g}^\top \mathbf{r}_0 / \tau)}{\sum_{k=0}^{N} \exp(\mathbf{g}^\top \mathbf{r}_k / \tau)} $$

Equation: Temperature-scaled Softmax

Where τ is a temperature hyperparameter set to 0.01. This low value sharpens the distribution, making the metric highly sensitive to the nearest neighbor in feature space.

Interpretation: A score approaching 1 suggests "copy-paste" overfitting to r0. Lower scores imply successful generalization beyond the specific appearance of r0.

Image Quality & Aesthetic

Evaluates visual fidelity using a composite of Aesthetic Predictors and Diversity metrics.

Aesthetic Score (V2.5)

Based on a SigLIP model, assessing images on a 1-10 scale. Penalizes blur/noise; scores > 5.5 indicate high quality.

Inception Score (IS)

Measures the model's ability to generate varied outputs (Diversity) and distinct features (Clarity).

Human Evaluation

A rigorous user study validating automated metrics against human preference across three key dimensions.

Environment Consistency Corr: CSD
Character ID Consistency Corr: CIDS
Subjective Aesthetics Corr: Aes.

Analysis reveals strong correlation (e.g., Self-CIDS Pearson's σ=0.80) between automated metrics and human judgment.