Diffusion KPIs: A Guide to the Qualitative Metrics of Diffusion Models (DDMs) - Part 1/2

If you can't measure it, it doesn't exist!

Sep 22, 2023

Table of Content

Introduction
Types of Evaluation
1. Qualitative Evaluation.
2. Quantitative Evaluation.
Taxonomy of Evaluation Metrics
1. General Evaluation Metrics
2. Domain Specific Metrics
Qualitative Metrics
1. Prompts Evaluation
  1. What is the prompts’ evaluation?
  2. Why it’s important?
  3. How does it work?
2. Prompts datasets
  1. DrawBench Dataset
  2. Parti Prompts Dataset
  3. COCO Dataset
  4. Visual Genome Dataset
  5. Flickr30k Dataset
3. Prompts evaluation score
Recap
References

⚠️ Note: This post used the term KPIs to signify the technical KPIs for evaluating diffusion models, so don't mix it with the business KPIs!

1. Introduction

As we've seen in our last post, we explored different techniques to speed diffusion models and learned how DDIMs introduced a simple approach to speed up diffusion models to make them work in practice. In this post, we will sail on a very interesting journey to explore how these models are evaluated in practice.

Model evaluation is one of the most important steps when you are involved in developing a machine learning model because you don't know if your model learned an informative representation and insights from your data or just misleading you with some crap predictions. In this post, we will dive into two main categories of model evaluation, and where these diffusion metrics fall under this broad umbrella.

Subsequently, we will see how human evaluation is conducted in practice using different prompts datasets. Finally, we will visit each group of diffusion metrics and discuss them in detail, and when to use them.

Let's get started!

2. Types of Evaluation

A machine learning model can be measured from two different angles. The first angle is the qualitative evaluation. In this type of evaluation, model predictions are presented to a human evaluator for feedback. The second angle is the quantitative evaluation, where the goal is to use automated metrics to assess the quality of these predictions without any human interventions.

2.1 Qualitative Evaluation

Qualitative evaluation is interchangeably called human evaluation. In this type of evaluation, human evaluators assess model predictions according to some specific criteria and assign a score that describes how close are these predictions to their viewpoint of the real world. For example, let's say that we have trained a GAN model to generate different designs for Barbie doll fashion competition. (that's a bit weird, I know!)

So, the human evaluators (in this case, the judges) will tell you how good is that design according to some factors such as novelty, color matching, style, ..., etc. For each factor, they will assign a score (i.e. 6/10), and the final score will be the average of this score. Here is a visual illustration of this example:

2.2 Quantitative Evaluation

Humans are biased!

I can't imagine how many times you heard someone saying this sentence, and it's completely true. Each person has their own viewpoint of the real world and life in general. You can't compare two people in an absolute manner, because each one of them has its own childhood background, social background, education background,..., etc., which are completely unique.

However, we can reduce this evaluation bias by comparing in micro-settings, yes we just need to focus on very specific cases, like comparing two people on how good are they at driving a specific type of bike or comparing two people on how good they are at composing classical poems or a piece of classical music.

Because of these reasons, we turn to design quantitative metrics (no human) that measure a specific qualitative concept. In other words:

Each quantitative metric is a measurement of a qualitative aspect or concept.

For example, in diffusion models, the CLIP score is used to measure the compatibility of a pair of text and image.

3. Taxonomy of Evaluation Metrics

Diffusion models can be leveraged for different machine-learning tasks and applications. The paper that introduced diffusion models used image data in the training and evaluation steps because these models work better in image-like datasets (2D/3D data, i.e. the mel-spectrogram audio representation).

Image generation, image editing, video generation, or CAD design in industrial car manufacturing, are all different use cases for diffusion models. Thus, we can not use the same metrics for each application, we have to introduce some changes to make it compatible with each application and domain.

3.1 General Evaluation Metrics

There are some metrics that can be used in most of the applications or domains in which diffusion models can be used. These metrics measure some universal aspects or factors of the diffusion-generated samples. To the time of this post, these are the known general metrics:

CLIP Similarity Score: measures the compatibility of text-image pair. It quantifies how is similar the generated image to the input text.
Fréchet Inception Distance (FID): measures the quality and diversity of generated samples.
Inception Score (IS): measures the quality and diversity of generated samples.
Peak Signal-to-Noise Ratio (PSNR): measures the amount of distortion in an image or signal (i.e. audio)

3.2 Domain Specific Metrics

In image editing, we can not use the standard CLIP score to measure the similarity between text-image pairs (before and after editing). CLIP directional score is the most relevant metric in this scenario. Another example is measuring the quality of objects inside generated images can't be fully measured using the standard FID metric. We need to use precision and recall to assess how good is the diffusion model. In the next post, we will highlight in detail some of these metrics and their most used applications.

4. Qualitative Metrics

There are common datasets that are used to evaluate diffusion models in a qualitative way. These datasets could also be used for training diffusion models, however, researchers and practitioners tend to use their private datasets (i.e. scrapped from the internet) for the training process, while using the next datasets for evaluation. Also, we should keep in mind that the datasets depend on the task itself. For example, a drawing-to-image (D2I) task will have a completely different dataset compared to a text-to-image task (T2I).

4.1 Prompts Evaluation

4.1.1 What is the common prompts’ evaluation?

Generative models are mostly useful when they are conditioned on something during the generation process. There are multiple ways to condition generative models, whether by text, drawing, audio, image, or even video frames. Diffusion models tend to work with all these kinds of inputs, but the most used input is the text data (i.e. generating an image from text). Hereby, the next datasets are commonly used in text-to-image tasks, but they could also work for other tasks.

Finally, the evaluation datasets for diffusion models in the text-to-image task are commonly named “Prompts Evaluation”, which signifies using texts as a prompt for generating images using diffusion models. In the next section, we will highlight some of these datasets and explain their use cases.

4.1.2 Why It’s Important?

The prompts’ dataset provides several benefits when evaluating diffusion models:

Reducing Human Bias: The process of collecting these datasets is done in a very careful manner, taking care of several aspects that challenge diffusion models. Also, the evaluation process consists of a group of people who score the same diffusion predictions on these datasets, which mitigates the bias and subjectivity of depending on one human or two.
Comprehensive Evaluation: Diffusion models are challenged against different aspects during the qualitative process. For example, to evaluate the “Quality” of generated images from diffusion models, the quality term is usually measured across different aspects like compositionality, spatial relations, long-form text, and rare words. Thus, the dataset used for the evaluation contains examples that address these aspects to challenge diffusion models.

4.1.3 How Does it Work?

The qualitative evaluation of diffusion models consists of the following steps:

Define the challenge aspects that you want to test on your diffusion model.
Choose your prompts’ dataset(s). (Also, you could design your own one)
Choose a sample size to evaluate your model. For example, a 10 sample size means each human evaluator will score the diffusion predictions on 10 inputs.
Select a group of people to score your model predictions.
Calculate the average score.

4.2 Prompts Datasets

Here are the most common benchmark datasets used to evaluate diffusion models:

4.2.1 DrawBench Dataset

The DrawBench dataset is a collection of prompts (text only) examples. It tests the diffusion model on the following aspects:

Compositionality.
Cardinality.
Spatial relations.
Long-form text.
Rare words.
Challenging prompts.

Number of samples: 200 text samples.

Common Tasks: Text to Image generation.

Link: DrawBench Dataset.

4.2.2 Parti Prompts Dataset

The PartiPrompts dataset consists of a diverse collection of more than 1600 prompts. It evaluates diffusion models across the following aspects and categories:

Aspects and Categories of the parti prompts dataset. ( source from parti)

Number of samples: 1600 text samples.

Common Tasks: Text to Image generation.

Link: PartiPrompt Dataset.

4.2.3 COCO Dataset

The COCO (Common Objects in Context) dataset is used for various computer vision tasks, including object detection, image segmentation, and captioning. It contains more than 330K images with 5 captions per image.

Number of samples: 330K samples.

Common Tasks: Text to Image generation, object detection, image segmentation, and Image captioning.

Link: COCO Dataset.

4.2.4 Visual Genome Dataset

Visual Genome is a dataset that has more than 108K images with 3.8 million object instances.

Number of samples: 108,077 samples.

Common Tasks: Text to Image generation, object detection, and Image captioning and Visual Question Answering.

Link: Visual Genome Dataset.

4.2.5 Flickr30k Dataset

Flickr30k is a dataset consisting of 31,783 images sourced from Flickr, each paired with five descriptive captions, resulting in a total of 157,415 captioned images. This is an extension of the Flicker8K Dataset.

Number of samples: 31,783 samples.

Common Tasks: Text to Image generation, object detection, and Image captioning.

Link: Flicker30K Dataset.

4.3 Prompts Evaluation Score

This is the last step in the qualitative evaluation where the researcher compiles all the results from the human evaluators and draws statistical analysis to it. The average score could be calculated to represent the final score for qualitative evaluation. For example:

\(\begin{array}{lcl} score_{avg} &=& \sum^n_{i=1} score_i \\ \\ \textbf{where:} \\ \text{score}_i &\equiv& \text{ The score of a human evaluator i} \end{array}\)

6. Recap

In this post, we began by discussing the importance of evaluating diffusion models and understanding how they work in practice.
We introduced two main categories of evaluation: qualitative and quantitative, and explained each one of them.
We highlighted the importance of using domain-specific metrics tailored to different applications of diffusion models.
Common qualitative evaluation datasets were introduced, including prompts evaluation datasets like DrawBench, PartiPrompts, and COCO.
We focus on how these datasets help reduce human bias and provide comprehensive evaluations.
then, we walked you through the process of conducting qualitative evaluation using prompts datasets, step by step.
we also, explore different benchmark datasets like DrawBench and Flickr30k.

7. References

Evaluating Diffusion Models, by Hugging Face. Available at: https://huggingface.co/docs/diffusers/conceptual/evaluation
Microsoft COCO: Common Objects in Context, by Tsung-Yi Lin et al. (2014).
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, by Ranjay Krishna et al. (2016). International Journal of Computer Vision.
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, by Peter Young et al. (2014). Transactions of the Association for Computational Linguistics.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, by Jiahui Yu et al. (2022). Transactions on Machine Learning Research, 2022.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, by Chitwan Saharia et al. (2022).Computer Vision and Pattern Recognition.

Before Goodbye!

Want to Cite this Article?

@article{khamies2023qualitative,
  title   = "Diffusion KPIs: A Guide to the Qualitative Metrics of Diffusion Models (DDMs) - Part 1/2",
  author  = "Waleed Khamies",
  journal = "Zitoon.ai",
  year    = "2023",
  month   = "Sept",
  url     = "https://publication.zitoon.ai/diffusion-KPIs-a-guide-to-the-qualitative-metrics-of-diffusion-models"
}

New to this Series?

New to the “Generative Modeling Series”? Here you can find the previous articles in this series [link to the full series].

Any oversights in this post?

Please report them through this Feedback Form, we really appreciate that!

Thank you for your reading!

We appreciate your reading! If you would like to receive the following posts in this series in your email, please feel free to subscribe to the ZitoonAI Newsletter. Come and Join the Family!