Measuring What Matters: Objective Metrics for Image Generation Assessment

Metadata

Author: huggingface.co
Full Title: Measuring What Matters: Objective Metrics for Image Generation Assessment
URL: https://huggingface.co/blog/PrunaAI/objective-metrics-for-image-generation-assessment

Highlights

creating images is the easy part. Judging their quality is much harder. Human feedback is slow, expensive, biased, and often inconsistent. Plus, quality has many faces: creativity, realism, and style don’t always align. Improving one can hurt another. (View Highlight)
That’s why we need clear, objective metrics that capture quality, coherence, and originality. we’ll look at ways to measure image quality and compare models with Pruna, beyond just “does it look cool?” (View Highlight)
There is no single correct way to categorize evaluation metrics, as a metric can belong to multiple categories depending on its usage and the data it evaluates. In our repository all quality metrics can be computed in two modes: single and pairwise. • Single mode evaluates a model by comparing the generated images to input references or ground truth images, producing one score per model. • Pairwise mode compares two models by directly evaluating the generated images from each model together, producing a single comparative score for these two models. (View Highlight)
This flexibility enables both absolute evaluations (assessing each model individually) and relative evaluations (direct comparisons between models). (View Highlight)
Efficiency Metrics: Measure the speed, memory usage, carbon emissions, energy, etc. usage of models during inference. At pruna, we focus on making your models smaller, faster, cheaper, greener, so evaluating your models using these efficiency metrics is a natural fit. However, because efficiency metrics are not specific to image generation tasks, we won’t discuss them in detail in this blog post. If you’d like to learn more about these metrics, please refer to our documentation. (View Highlight)
Quality Metrics: Measure the intrinsic quality of generated images and their alignment to intended prompts or references. These include: • Distribution Alignment : How closely generated images resemble real-world distributions. • Prompt Alignment : Semantic similarity between generated images and their intended prompts. • Perceptual Alignment : Pixel-level or perceptual similarity between generated and reference images. (View Highlight)
Distribution Alignment Metrics Distribution alignment metrics measure how closely generated images resemble real-world data distributions, comparing low and high features. In pairwise mode, they compare outputs from different models to produce a single score that reflects relative image quality. The generated image closely resembles the real one, and the distributions are well aligned, suggesting good quality. the generated image is noticeably off, and the distributions differ significantly, which the metric captures as a mismatch. (View Highlight)
Fréchet Inception Distance (FID): FID (introduced here) is one of the most popular metrics for evaluating how realistic AI-generated images are. It works by comparing the feature distribution of the reference images (e.g. real images) to the images generated by the model to evaluate. Here’s how it works in a nutshell:
1. We take a pretrained surrogate model and pass both real and generated images through it. The pretrained surrogate model is usually the Inception v3 explaining the metric name**.**
2. The model turns each image into a feature embedding (a numerical summary of the image). We assume the embeddings from each set form a Gaussian distribution.
3. FID then measures the distance between the two distributions — the closer they are, the better. A lower FID score indicates that the generated images are more similar to real ones, meaning better image quality. (View Highlight)
FID is calculated as the Fréchet distance between two multivariate Gaussians: FID = ||μr − μg||² + Tr(Σr + Σg − 2(Σr Σg)1/2) Where: • (μr, Σr) are the mean and covariance of real image features. • (μg, Σg) are the mean and covariance of generated image features. • Tr denotes the trace of a matrix. • (Σr Σg)1/2 represents the geometric mean of the two covariance matrices. (View Highlight)
Clip Maximum-Mean-Discrepancy (CMMD): CMMD (introduced here) is another way to measure how close your generated images are to real ones. Like FID, it compares feature distributions, but instead of using Inception features, it uses embeddings from a pretrained CLIP model. Here’s how it works:
1. We take a pretrained surrogate model and pass both real and generated images through it. The pretrained surrogate model is usually the CLIP.
2. The model turns each image into a feature embedding (a numerical summary of the image). We do not assume the embeddings from each set form a Gaussian distribution.
3. Use a kernel function (usually RBF) to compare how these distributions differ, without assuming they are Gaussian. A lower CMMD score indicates that the feature distributions of generated images are more similar to those of real images, meaning better image quality. (View Highlight)
CMMD is based on the Maximum Mean Discrepancy (MMD) and is computed as: CMMD = 𝔼[ k(φ(xr), φ(x′r)) ] + 𝔼[ k(φ(xg), φ(x′g)) ] − 2𝔼[ k(φ(xr), φ(xg)) ] Where: • φ(xr) and φ(x′r) are two independent real image embeddings extracted from CLIP. • φ(xg) and φ(x′g) are two independent generated image embeddings extracted from CLIP. • k(x, y) is a positive definite kernel function that measures similarity between embeddings. • The expectations 𝔼[·] are computed over multiple sample pairs. (View Highlight)
Prompt alignment metrics evaluate how well generated images match their input prompts, especially in text-to-image tasks. In pairwise mode, they instead measure semantic similarity between outputs from different models, shifting focus from prompt alignment to model agreement. • CLIPScore: CLIPScore (introduced here) tells you how well a generated image matches the text prompt that produced it. It uses a pretrained CLIP model, which maps both text and images into the same embedding space. Here’s the idea:
1. Pass the image and its prompt through the surrograte CLIP model to get their embeddings.
2. Measure how close these two embeddings. The closer they are, the better the alignment between the image and the prompt. CLIPScore ranges from 0 to 100. A higher score means the image is more semantically aligned with the prompt. Note that this metric doesn’t look at visual quality, just the match in meaning. (View Highlight)
Given an image x and its corresponding text prompt t, CLIP Score is computed as: CLIPScore = max⎛100 × (φI(x) · φT(t)) / (||φI(x)|| · ||φT(t)||), 0⎞ Where: • φI(x) is the CLIP image embedding of the generated image. • φT(t) is the CLIP text embedding of the associated prompt. CLIP Score ranges from 0 to 100, with higher scores indicating better alignment between the image and its prompt. However, it may be insensitive to image quality since it focuses on semantic similarity rather than visual fidelity. (View Highlight)
Perceptual alignment metrics evaluate the perceptual quality and internal consistency of generated images. They compare pixel-level or feature-level differences between images. These metrics are often pairwise by nature, as comparing generated images with other generated images is more appropriate in certain cases, such as pixel-by-pixel comparisons. (View Highlight)
Peak Signal-to-Noise Ratio (PSNR): PSNR measures the pixel-level similarity between a generated image and its reference (ground truth) image. It is widely used for evaluating image compression and restoration models. A higher PSNR value indicates better image quality, but PSNR does not always correlate well with human perception. (View Highlight)
Structural Similarity Index (SSIM): SSIM improves upon PSNR by comparing local patterns of pixel intensities instead of just raw pixel differences. It models human visual perception by considering luminance, contrast, and structure in small image patches SSIM ranges from -1 to 1, where 1 indicates perfect similarity. (View Highlight)
Learned Perceptual Image Patch Similarity (LPIPS): LPIPS is a deep-learning-based metric that measures perceptual similarity between images using features from a pre-trained neural network (e.g., VGG, AlexNet). Unlike PSNR and SSIM, LPIPS captures high-level perceptual differences rather than pixel-wise differences. (View Highlight)
The results in the image illustrate how different types of distortions affect the scores given by these task-based metrics. Notably: • Blurred images tend to score higher in SSIM than in PSNR. This suggests that while fine details are lost, the overall structure and patterns of the image remain intact, which aligns with SSIM’s focus on structural consistency. • Pixelated images, on the other hand, maintain relatively high PSNR values but drop in SSIM ranking. This indicates that while pixel intensity differences remain small, the structural coherence of the image is significantly degraded—highlighting SSIM’s sensitivity to spatial relationships rather than just pixel-level accuracy. (View Highlight)
The evaluation framework in pruna consists of several key components: • Step 1: Define what you want to measure Use the Task object to specify which quality metrics you’d like to compute. You can provide the metrics in three different ways depending on how much control you need. (View Highlight)
Step 2: Run the Evaluation Agent Pass your model to the EvaluationAgent and let it handle everything: running inference, computing metrics, and returning the final scores. (View Highlight)
As AI-generated images become more prevalent, evaluating their quality effectively is more important than ever. Whether you’re optimizing for realism, accuracy, or perceptual similarity, selecting the right evaluation metric is key. With Pruna now open-source, you have the freedom to explore, customize, and even contribute new evaluation metrics to the community . (View Highlight)
Our documentation and tutorials (here) provides a step-by-step guide on how to add your own metrics, making it easier than ever to tailor evaluations to your needs. Try it out today, contribute, and help shape the future of AI image evaluation! (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

Self-proclaimed experts

My failure resume

Tres Millones de viviendas

Measuring What Matters: Objective Metrics for Image Generation Assessment

Metadata

Highlights

Graph View

Table of Contents

Now Reading

New platform, familiar risks: Zillow and Expedia bet on OpenAI’s ChatGPT apps rollout