Benchmark Metrics Definition

Benchmark Metrics

Comprehensive evaluation metrics for story visualization, assessing consistency, adherence, quality, diversity, and human perception.

1. Character Consistency (CRef / CIDS)

Evaluates whether characters maintain consistent appearance across an image sequence. Focuses on self-similarity of character identity features and cross-similarity with reference images. Also known as Character Identification Similarity (CIDS) in ViStoryBench.

Average Character Cosine Similarity (aCCS)

Calculates character similarity across frames using CLIP image features (referencing Theatregen, Story-Adapter). Suitable for general characters.

ArcFace / AdaFace

Face recognition models specialized for real human faces, extracting facial features for similarity computation.

CCIP (Contextual Character Identity Preservation)

Evaluates consistency for anime-style characters, often requiring detection of individual characters within images.

Multi-Character Confusion

Measures the model's ability to distinguish different characters in multi-character scenes, avoiding identity mismatches.

Grounding DINO + Similarity

First uses Grounding DINO to detect and crop main characters based on text descriptions, then calculates their similarity with reference images in feature spaces like CLIP, FID, or DreamSim.

DreamSim Similarity

A cross-modal fine-grained matching metric for evaluating the perceptual similarity between generated and reference images.

VBench Character Tracking

Applies methods for evaluating character consistency in videos to image sequences, tracking character stability across the sequence.

Char-F1 / Char-Acc

Uses a pre-trained character classifier to identify characters in generated images and compares with ground truth labels to calculate F1 score or accuracy.

VLM Verification

Utilizes Vision Language Models (VLMs) to compare characters in generated images with manually annotated ground truth character information for consistency verification.

2. Style Consistency (SRef / CSD)

Evaluates whether the artistic style remains consistent across an image sequence, including style cross-similarity with reference images and style self-similarity among generated images. ViStoryBench uses CSD (CLIP Style Disentanglement) for this.

CSD (CLIP Style Disentanglement)

Quantifies self-similarity (within generated images) and cross-similarity (between generated and reference images) through CSD-CLIP feature analysis. First, features are extracted using a CLIP vision encoder, then CSD layers separate content and style features, and finally, cosine similarity between style feature embeddings is computed.

VLM Scoring

Utilizes Vision Language Models like Qwen or GPT-4o to score the style consistency of generated image sequences.

ArtFID

A variant of FID (Fréchet Inception Distance) tailored for artistic styles, used to evaluate the realism and diversity of generated images in terms of artistic aesthetics.

Frequency Analysis / VGG Distance

Evaluates style similarity by analyzing frequency domain features of images or distances between deep features extracted by VGG networks; considered a more advanced exploratory method.

3. Prompt Adherence

Assesses how well the generated images align with the storyboard descriptions (prompts), including character interactions, shooting methods, accuracy of the number of on-stage characters, and individual actions.

Objective Assessment - Entity Presence (VLM)

Uses VLMs to evaluate whether key entities (objects, colors, character count & description, behaviors) appear as prompted. Specifically, Onstage Character Count Matching (OCCM) is calculated using the formula: `100 * exp(-|Detected - Expected| / (epsilon + Expected))`.

Objective Assessment - Camera/Shot Type (VLM)

Evaluates whether the shooting method in the generated image (e.g., long/medium shot, high/low angle) aligns with the "Shot Perspective Design" in the storyboard.

Objective Assessment - Environmental Consistency (VLM)

Evaluates whether scene attributes, time, location, etc., in the generated image align with the "Setting Description" in the storyboard.

Subjective Assessment - Emotion / Atmosphere (VLM)

Evaluates whether the emotion and atmosphere conveyed by the generated image align with the descriptions in the prompt (e.g., character interactions, static descriptions).

Subjective Assessment - Action / Expression (Grounding DINO + VLM)

After character segmentation using Grounding DINO, VLMs are used to verify if individual character actions and expressions conform to the "Static Shot Description" in the storyboard.

4. Generation Quality

Evaluates the visual quality and aesthetic appeal of generated images, and whether there is an issue of "copy-pasting" reference images.

Aesthetic Score (aesthetic-predictor-v2-5)

Uses Aesthetic Predictor V2.5 (based on SigLIP) to evaluate image aesthetics, with scores ranging from 1-10. Scores above 5.5 are generally considered excellent quality. Focuses on whether images are blurry, noisy, etc.

FID (Fréchet Inception Distance)

Measures the similarity in feature space distribution between generated images and real images, commonly used to assess the quality of realistic images.

PickScore

A CLIP-based image quality assessment model that predicts human preference rankings for images, reflecting the overall attractiveness of generated images.

Copy-Paste Detection

Detects whether the model excessively relies on or directly copies character reference images. For single-image input models, compares the similarity difference of output characters with the input image and an alternative reference image; this item is not applicable to multi-image input models.

Degradation Analysis

Analyzes whether generation quality decreases as the length of the image sequence increases.

5. Diversity

Measures the model's ability to generate multiple different outputs from a single prompt or to exhibit variation in sequence generation.

Inception Score (IS)

Evaluates the clarity and diversity of a batch of generated images. For multiple results generated from a single prompt, IS can reflect the diversity of their content.

CLIP Feature Variance

Calculates the variance of CLIP features for all generated images in a story sequence. Higher variance indicates more content or style variation among images, thus higher diversity.

6. Human Evaluation

Subjective assessment of generated results by human evaluators, typically covering environmental consistency, character identification consistency, and subjective aesthetics.

Cinematography

Assesses the use of camera language, the rationality and aesthetics of composition, and the effectiveness of visual storytelling.

Narrative Coherence

Evaluates the fluency of visual presentation, the rationality of plot development, and the overall comprehensibility of the story through methods like questionnaires.

Visual Coherence

Assesses the continuity and consistency of visual elements (such as lighting, color, object placement) in adjacent or related images within an image sequence.

Layout (Comics)

For comic generation tasks, evaluates the appeal, narrative guidance, and visual impact of panel layouts.