Analyzing Samplers And Schedulers For ComfyUI In Latent Diffusion Workflows

Motivation

I have always been reluctant towards AI due to the black box effect. Let me explain. As a former creative turned technologist I took great pride translating the vision, the effect or the process into an algorithm: I still do. This algorithm would then become a piece in a bigger system which would grow based of a similar principle. This practical fascination, is accompanied on a conceptual level, as each algorithm is a translation of the creative intent into a mechanization that teaches the System exact instructions to produce a new result. A new concept. From this perspective I understood programming as craft, or dare I say it, art form. A form of expression that transcends result-driven creativity, into process-driven creativity. However AI for me was this new concept initially, that abstracted out the process -- the algorithm -- and based on an input "just" produced an output. I felt this was the downfall of the algorithmic ingenuity which I grew to like across the boundaries of these two disciplines.

Until recently...

I have been playing around a lot with AI, in form or LLMs and Diffusion Models for text to image and image to image workflows, to produce concepts and assets for my game prototypes. Thanks to ComfyUI I was able to work with AI models, combine them and experiment with different niche workflows, specific to my needs. The AI models became pieces in a bigger system, which where layed out into an overarching workflow. By myself. But this is not all, I realized that a model without the right nodes, that are capable of processing them, the models are very very useless. And put simply, everything in the system that is not a model, has to be an algorithm, right? This lead to a follow-up observation, that contrary to my beliefs, AI actually requires a ton of algorithms and I explored where the models interfaced with algorithms, and how these algorithm result in impact to the creative process.

Data collection, Data preparation, Learning, are all modeled as algorithms, shaping the potential output of an AI. In Latent Diffusion (this for me includes all Models ranging from the original Stable Diffusion to the newer Flux.1) some key elements for great success are parameters like step number, samplers and schedulers. They shape the actual output of the AI. Trying desperately to close the gap between AI and my obsession with algorithmic ingenuity, I came to accept AI as the interplay of Algorithms and Models. Sounds familiar?
The building blocks of any program are Algorithm and Data Structures, interacting inside an Architecture. To fit AI into this conceptual model, I simply abstract the AI model to a very very complex Data Structure that also incorporates a very complex function as it's interface. But the key point is, it is not self-sufficient. This realization felt liberating.

Goal of the Study

This personal insight sparked a new found curiosity: getting into AI my self, by analyzing algorithms that interact directly with the model. A topic which regretfully I had put aside for far too long. Since I love to analyze things, and derive meaning from processes, I thought this could be a fruitful gateway into this domain. Which brings me to this article. One element which seems to greatly affect the results of my outputs inside the ComfyUI pipeline are the mystical Samplers and Schedulers with very cryptic names.

There is multiple articles (1,2,3,4,5) covering the different types of samplers and schedulers inside the Stable Diffusion pipeline, and more practical guides for which applications they are suitable.
However the tinkerer in me is often not satisfied by application driven examples alone, and I therefor want to examine the algorithm themselves to find out:

  • How does the Algorithm produce the observable effect?
  • How does the Algorithm internally impact the output of the image?
  • Based on these two question, what is the conceptualization of each algorithm?

This article will grow and change over time, and become an analytical playground for myself sharing all my thoughts while breaking down the source code of the modules implementing the samplers and schedulers. I will try my best with available resources to make sense of the mathematical formulas employed, understanding the "why" behind their implementations. When possible I will consult the official papers of the methods.

Disclaimer

An algorithm as envisioned by the original author will never be 100% reproducible. Similar to a cooking recipe, every cook puts his own flair on the final dish often involuntarily by his own ability or the techniques used. This is why I think understanding the fundamental conceptual ideas behind each implementation is much more valuable. The observations that I make here are purely for an educational purpose, for people like me curious about the inner working of these components. Also these observations only apply to the specific implementation inside the ComfyUI system.

Some Context: Latent Diffusion and ComfyUI

Let us begin giving this research endeavor a little bit of context.

First thing, what is ComfyUI?
They describe themselves as:

The most powerful open source node-based application for creating images, videos, and audio with GenAI (1).

They started as an open source project on GitHub and are to this day an active community making the usage of Latent Diffusion models accessible and user friendly. They first came to my attention as an alternative to Automatic1111's WebUI which followed a similar goal. However they sacrificed flexibility for simplicity and accessibility.
I was already familiar with node-based development due to my Unreal Engine experience with Blueprints, or the node software VVVV. So I find it quite attractive. Nowadays ComyUI features a fully self-contained Desktop-App with a model and node manager, centralizing all your diffusion AI needs in one place.
Everyone can develop new custom nodes, either to add utility functions, to interact with simpler data types like numbers, strings and images. Or to extend the functionality in interacting with AI models.

simple_comfy.gif

This is how you build the simplest "base" workflow for any diffusion workflow. It is the hello world of comfy ui so to say.

  1. You load models (Diffusion U-Net, CLIP, VAE), or as in this case you use a checkpoint which packages all of them together.
  2. You create a text prompt which is then encoded into an embedding for the U-Net model using Clip.
  3. Generate an empty image in the latent space
  4. You connect them all to a ksampler which uses the embeddings to denoise the image after adding random noise with a seed. And using multiple steps Samples the latent image after each step according to a scheduler and denoises it into a final latent image
  5. The denoised latent image is then turned from latent space into pixel space by the VAE (Variational Auto Encoder)
  6. And the final image can then be saved or previewed with the according final node.

I think now has come the time, we have to talk about the technical details.
What is a Diffusion Model? What is a CLIP and what is a VAE?
Since this article is already going to be very long I will address these questions by explaining them in simple concepts linking to full guides, as I am myself not a Machine Learning expert, and lack the terminology to go into details.

Let us begin conceptualizing it. The idea is to take an AI model that has been trained on denoising images (by prediction): the U-Net, and to do so in Latent space, which is another word for hidden, or abstracted. All the inputs, be it pixels (empty or another image) or prompts in form of the embeddings feed into a complex input tensors (the official name for vectors, matrixes that extend into higher dimensions) which the AI can make sense of. (1,2,3) The embeddings are a collection of numbers derived from the used keywords and text prompt, derived by yet another AI Model. The CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. (1)
In other words given an image, it can tell you the most relevant keywords to describe that image. Well in case of Stable Diffusion, this has been flipped or rather used as a guider in telling the AI if that what it is denoising corresponds to the keywords used: Algorhithmic ingenuity! :D.

At this point one has to mention two additional model types, that are not required but make the workflow even more powerful. LoRAs Which allows for submodels being trained on specific keywords (the triggers) to influence the conditioning. Or in other words introduce new concepts for the ai to represent. Like specific characters, art styles or objects. There is a fishnet LoRA for example.
Yeah...

And then there is ControlNets which complement the conceptual with the compositional. They are of many various types, like Depth, or Pose or Scribble and help to give the picture the structure you want. For example a ControlNet that encodes a functioning QR code into your image.

And finally the VAE (Variational Auto-Encoder) are Autoencoders, a type of neural network designed to learn efficient data representations, primarily for the purpose of dimensionality reduction or feature learning. (1) Autoencoders put simply can learn to represent something with many many data points, into an abstract space with way less parameters, and then based on these parameters reproduce something from the original data point set applying these latent parameters. It is the magician turning the hidden latent image into a visible image. While this sounds like a glorified ai compressor and decompressor -- which in this pipeline (or dare I say algorithm) it is. One can use Auto encoders for creative applications, reducing a design space down to a few parameters play around with those and generate new results similar to the "decompressed" originals, like levels, character designs and much more. (1)
But the most important component in all this is the KSampler.

The KSampler node is designed for advanced sampling operations within generative models, allowing for the customization of sampling processes through various parameters. It facilitates the generation of new data samples by manipulating latent space representations, leveraging conditioning, and adjusting noise levels. (1)

Or in other words, it is the bridge between U-Net, CLIP, and VAE to direct all of these models into a synergy which allows us to turn a text description into juicy anime boobs.
But more importantly it is filled with juicy juicy algorithms, ready to be explored.

ComfyUI_W0hNHWCypD.png

ComfyUI_uokfV8t3dw.png

The KSampler has various parameters, like noise (with a seed) as starting point, amounts of steps, a cfg value, and denoising. These are all numbers, quantitative parameters, which need to be interpreted by an algorithm. However sampler name and scheduler have names, labels (cryptic ones that is). Meaning they hide concepts behind them, ideas and patterns, following certain strategies to solve the same problem. How do this concepts/strategies work? How do they "intepret" our numerical values and to what effect? This is what we will explore now diving deep into their implementation.

Code_ic0s3vodQJ.png

Inside comfy/kdiffusion/sampling.py and comfy/samplers.py we shall find our answers.

First Steps

First of all let us analyze the labels themselves. First thing that becomes appearent, a lot of acronyms and/or names (people names). Also what sticks out is common suffixes like:

  • ancestral
  • cfg_pp
  • sde
  • sde_gpu
  • _2(s) and _3(m)

On the scheduler side we have a handful with clearer names. As they are less and also at first glance their implementation is limited to a self-contained function, I will start with them, and might expand to samplers if I need more context.

Schedulers

First thing that becomes appearent from research is that Scheduler and Sampler are used interchangeably in many sources, and in some UIs there is only the option for the scheduler, , which is the combination of a Sampler and a Scheduler (1,2). While others differentiate between Schedulers and Schedule Type(3)
So it seems once again the difference is made explicit in ComfyUI for increased control. Let us see, what they have to say.
All I could find is

The type of schedule to use, see the samplers page for more details on the available schedules.(1)

As such scheduling is part of the sampling process, and therefor a subcomponent of Samplers (1).

The noise schedule controls the noise level at each sampling step. The noise is highest at the first step and gradually reduces to zero at the last step.
At each step, the sampler’s job is to produce an image with a noise level matching the noise schedule. (1)

Samplers

So let us go back to what other articles call Scheduler to understand what Samplers are in our case.

Schedulers are algorithms that are used alongside the UNet component of the Stable Diffusion pipeline.  They play a key role in the denoising process and run multiple times iteratively (called steps) to create a clean image from a completely random noisy image. The key to all these scheduler algorithms is to progressively perturb data with intensifying random noise (called the “diffusion” process), then successively remove noise to generate new data samples. Sometimes, Schedulers are also referred to as Samplers (1).

Algorithms! Hurray. Also interestingly enough the steps are not an input to the algorithm but the count of how many times the algorithm is applied!

One of the mentioned Samplers/Schedulers in the article is DPM2 Karras. A sampler that is also very familiar to my times using Automatic1111, which also did not distinguish between sampler and scheduler. Looking at how comfyUI presents them, karras is in fact a scheduler, which in this setup can be applied to any sampler, not just DPM2.

Before we finally begin with the code analysis let us quickly break down the pattern each algorithm re-inteprets in their own way:

  1. progressively perturb data with intensifying random noise (called the “diffusion” process)
  2. then successively remove noise to generate new data samples

"Many schedulers are implemented from the k-diffusion library by Katherine Crowson, and they’re also widely used in A1111."(1)

Schedulers

anime_00001_.png

This is an experiment I did in ComfyUI using the Flux.1.Schnell UNet and the custom node flux-sampler-paremeters to spit out a matrix of a variety of results combining the sampler Euler with all the Schedulers and each from 1, 2, 4, 8, 16 steps. (Flux Schnell does not require many steps). The prompt was

anime grizzly bear fighting a pikachu mid air. The bear wears a blue necklace with big shiny beads. The pickachu is curled up and emitting a potent thunderstorm. The perspective is from above. Seing the city below them in a dynamic perspective

I chose this prompt to combine many elements, on which I wanted to challenge the AI and see how the samplers influence the result. A style (Anime), a species of animal (grizzly bear) and it's interpretation in the style, an iconic character (pikachu) specific details (blue necklace with big shiny beads) a conceptual action (fighting mid air) (emitting a potent thunderstorm) and a specific composition and challenging perspective (from above, city below). This relates to the two components I described briefly before (conceptual -> LoRA and compositional -> ControlNet), but I did not use any of them.

To help you make sense of the image grid: All images were made with the Euler sampler. Each row represents one Scheduler going from 1 to 16 steps. While most are similar, there is a few note-worthy outliers. KL Optimal produced an empty black image on the first step, while DDIM Uniform works with a flat gray image for the most part turning it into a distinct composition (from the other schedulers) only after more steps. I find the result at 4 steps particularly cute, as it looks like a soup of tiny image molecules, almost as if DDIM Uniform was not quite finished cooking yet...

Considering that this was it's result at 32 steps

anime_00012_euler_ddmi_uniform_32s.png

What I am showing you next is the registration of the labels and their internal implementations

SCHEDULER_HANDLERS = {
    "normal": SchedulerHandler(normal_scheduler),
    "karras": SchedulerHandler(k_diffusion_sampling.get_sigmas_karras, use_ms=False),
    "exponential": SchedulerHandler(k_diffusion_sampling.get_sigmas_exponential, use_ms=False),
    "sgm_uniform": SchedulerHandler(partial(normal_scheduler, sgm=True)),
    "simple": SchedulerHandler(simple_scheduler),
    "ddim_uniform": SchedulerHandler(ddim_scheduler),
    "beta": SchedulerHandler(beta_scheduler),
    "linear_quadratic": SchedulerHandler(linear_quadratic_schedule),
    "kl_optimal": SchedulerHandler(kl_optimal_scheduler, use_ms=False),
}

Note-worthy here is that karras, exponential are imported from the k_diffusion_sampling module. They have a different function signature and therefor are integrated differently into the system.

Signature 1: (model_sampling, steps)

Signature 1 of the functions residing in comfy/sampling.py. Here we have model_samplingand steps as common denominator.

The number of steps are passed into the function as input together with model_sampling (which I assume is the sampler itself). On a first glance it seems that model_sampling is queried for some parameters internally. After a more thorough look, I could identify where the scheduler functions are invoked.

def calculate_sigmas(model_sampling: object, scheduler_name: str, steps: int) -> torch.Tensor:
    handler = SCHEDULER_HANDLERS.get(scheduler_name)
    if handler is None:
        err = f"error invalid scheduler {scheduler_name}"
        logging.error(err)
        raise ValueError(err)
    if handler.use_ms:
        return handler.handler(model_sampling, steps)
    return handler.handler(n=steps, sigma_min=float(model_sampling.sigma_min), sigma_max=float(model_sampling.sigma_max))

This function is in turn invoked inside the KSampler class, in the method definition of the same name calculate_sigmas

class KSampler:
...
    def calculate_sigmas(self, steps):
...
calculate_sigmas(self.model.get_model_object("model_sampling"), self.scheduler, steps)
...

After a total tracing through 5-6 layers, model_sampling seems to resolve into a torn.nn.Module (Base class for all neural network modules, in which you can assign the submodules as regular attributes (1) ) using a util function that retrieves parts of the model that transform the string name into an attribute. This model_sampling attribute is set here based on the type of the U-Net Architecture.

class BaseModel(torch.nn.Module):
...
   self.model_sampling = model_sampling(model_config, model_type)
...

def model_sampling(model_config, model_type):
    s = ModelSamplingDiscrete
    if model_type == ModelType.EPS:
        c = EPS

    elif model_type == ModelType.V_PREDICTION:
        c = V_PREDICTION

    elif model_type == ModelType.V_PREDICTION_EDM:
        c = V_PREDICTION
        s = ModelSamplingContinuousEDM

    elif model_type == ModelType.FLOW:
        c = comfy.model_sampling.CONST
        s = comfy.model_sampling.ModelSamplingDiscreteFlow

    elif model_type == ModelType.STABLE_CASCADE:
        c = EPS
        s = StableCascadeSampling

    elif model_type == ModelType.EDM:
        c = EDM
        s = ModelSamplingContinuousEDM

    elif model_type == ModelType.V_PREDICTION_CONTINUOUS:
        c = V_PREDICTION
        s = ModelSamplingContinuousV

    elif model_type == ModelType.FLUX:
        c = comfy.model_sampling.CONST
        s = comfy.model_sampling.ModelSamplingFlux
        
    class ModelSampling(s, c):
        pass
    return ModelSampling(model_config)

We will see in each implementation what they influence.
Returning back into the KSampler class, we will have a look how steps are passed in and processed.

class KSampler:
...
        self.set_steps(steps, denoise)
...

    def set_steps(self, steps, denoise=None):

        self.steps = steps
        if denoise is None or denoise > 0.9999:
            self.sigmas = self.calculate_sigmas(steps).to(self.device)
        else:
            if denoise <= 0.0:
                self.sigmas = torch.FloatTensor([])
            else:
                new_steps = int(steps/denoise)
	            sigmas = self.calculate_sigmas(new_steps).to(self.device)
                self.sigmas = sigmas[-(steps + 1):]

Here two interesting things can be observed.

  1. Steps and denoise interact. Steps are divided by the amount of denoise or rather, dividing by a value smaller than 1 and bigger than 0 results into a multiplication. So the steps are increased the lower the denoising.
  2. Also, the scheduler implementation is directly called inside the set_step method of the KSampler through the calculate_sigmas method.

Signature 2: (n, sigma_min, sigma_max)

The first thing that becomes apparent, this signature is not directly influenced by the model_sampling as is the case with the first signature. Also we have encountered this signature already in this layer of abstraction:

def calculate_sigmas(model_sampling: object, scheduler_name: str, steps: int) -> torch.Tensor:
    handler = SCHEDULER_HANDLERS.get(scheduler_name)
    if handler is None:
        err = f"error invalid scheduler {scheduler_name}"
        logging.error(err)
        raise ValueError(err)
    if handler.use_ms:
        return handler.handler(model_sampling, steps)
    return handler.handler(n=steps, sigma_min=float(model_sampling.sigma_min), sigma_max=float(model_sampling.sigma_max))

This shows us several things.

  1. n is the same as steps from signature 1
  2. The distinguishing aspect between schedulers using signature 1 and 2 is the property use_ms
  3. Signature 2 deals with sigma, _min and _max, which are taken from the model_sampling, meaning they are the only two properties taken into account

So before we proceed we need to understand a few things, what does "use ms" mean? and what are sigmas?

class SchedulerHandler(NamedTuple):
    handler: Callable[..., torch.Tensor]
    # Boolean indicates whether to call the handler like:
    #  scheduler_function(model_sampling, steps) or
    #  scheduler_function(n, sigma_min: float, sigma_max: float)
    use_ms: bool = True

All I could find about use_ms is this single flag which confirms what has become appearent from my analysis already, it is to decide between the two signatures. My research in the internet also did not show any clear evidence for what ms stands for. But since there is no explicit mention, it is probably short for use model sampling, as that is also the distinguish feature of Signature 1 Schedulers.

Sigmas

Bare with me, this is going to be the most technical and mathematical part of the article, but one necessary to associate a lot of the calculations to a meaningful concept.

An important detail related to the diffusion process is the noise schedule. This is the level of noise (a.k.a. sigma, a.k.a. standard deviation of normal distribution) added to the image depending on the time step, a.k.a. t, during the diffusion process. Namely, we add random numbers coming from the normal distribution with a given sigma parameter to each pixel value in each channel of an image. (1)

...Additive Gaussian noise is most commonly used as the corruption method. ... The total standard deviation is found as σ=√∑iσ2iσ=∑iσi2, where σ1,σ2,…σ1,σ2,… are the standard deviations of the noise added at each point in the sequence. Therefore, at each point in the corruption process, we can ask: what is the total amount of noise that has been added so far – what is its standard deviation? We can write this as σ(t)σ(t), where tt is a time variable that indicates how far the corruption process has progressed. This function σ(t)σ(t) is what we typically refer to as the noise schedule. (2)

Sigma, the greek letter: "σ" stands for the standard deviation of the normal distribution.
Sigmas represent the magnitude of noise added to the image at each timestep and the sequence of sigma values represents the noise schedule.

Now that we have established the context and the most important pieces let us finally dive into the different implementations.

Normal

def normal_scheduler(model_sampling, steps, sgm=False, floor=False):

    s = model_sampling
    start = s.timestep(s.sigma_max)
    end = s.timestep(s.sigma_min)

    append_zero = True

    if sgm:
        timesteps = torch.linspace(start, end, steps + 1)[:-1]
    else:
        if math.isclose(float(s.sigma(end)), 0, abs_tol=0.00001):
            steps += 1
            append_zero = False
        timesteps = torch.linspace(start, end, steps)
    
    sigs = []

    for x in range(len(timesteps)):
        ts = timesteps[x]
        sigs.append(float(s.sigma(ts)))

    if append_zero:
        sigs += [0.0]

    return torch.FloatTensor(sigs)

The Normal scheduler follows Signature 1 and has an additional sgm and floor parameter. There is a flag called sgm. I will skip it for now, because this is the distinguishing parameter between normal and SGM Uniform, and as such we will dedicate that section to it. Floor is a dead parameter, it is never used inside the function. So we will ignore it. Probably a remainder from an older implementation.

The function returns torch.FloatTensor(1) of sigs. A Tensor as we already have briefly touched on is a multi-dimensional matrix containing elements of a single data type (1). I leave here a more in-depth explaination for the curious about what a tensor is.
They are the fundamental data structures used in machine learning (2) and are particularly useful for handling complex data such as images, audio, and text. For example, a color image can be represented as a 3D tensor with dimensions corresponding to height, width, and color channels (red, green, and blue). Similarly, a video can be represented as a 4D tensor with time being the additional dimension. This warrants an entirely own article, of how Tensors can be used to represent and model all sorts of concepts for the purpose of machine learning. The most essential information for us is that tensors are the parent class of many structures that are more familiar, corresponding to different ranks: A scalar (or single value) is a rank 0 tensor, a vector is a rank 1 tensor and a matrix is a rank 2 tensor. Simply put a rank corresponds to it's dimensionality. And essentially a tensor is a n-dimensional parameter for our AI model. In terms of python it is a nested list. Which we can see in the code, eventhough it seems to be a rank 1 tensor in this case.

Sigs which I assume are the sigmas, or the sequence of levels of noise is initialized as an empty list. Then we iterate in a range over the length of timesteps. We then append the result of the model_sampling method sigma given the timestep. If the flag append_zero is true a last sigma of 0.0 is added, which I assume is to run a last cleaning step with no noise for the final polish. What stands out here is:

  1. The model is responsible to calculate the sigma
  2. The timestep value is the most impactful
  3. There is a final 0.0 value added to the sigma sequence

We have two branches (one of which we ignore, as it deals with sgm). If end (which is the timestep of the model at sigma_min) is almost 0, we do not append_zero and add an additional step. Essentially having a similar result. Either the last native step is near 0, or we add our own 0 to it, to "normalize" the sequence.

And finally we have timesteps as the result of torch.linspace of start (the timestep of the model's highest sigma), end (the timestep of the model's lowest sigma) and the steps parameter provided by the user. torch.linspace Creates a one-dimensional tensor of size steps whose values are evenly spaced from start to end, inclusive. (1)

The Normal scheduler relies on the built-in timesteps of the model, distributing them evenly and normalizing with a 0.0 sigma at the final timestep. This feels like the most "vanilla" (unaltered) version of scheduling.

Karras

def get_sigmas_karras(n, sigma_min, sigma_max, rho=7., device='cpu'):

    """Constructs the noise schedule of Karras et al. (2022)."""

    ramp = torch.linspace(0, 1, n, device=device)
    min_inv_rho = sigma_min ** (1 / rho)
    max_inv_rho = sigma_max ** (1 / rho)
    sigmas = (max_inv_rho + ramp * (min_inv_rho - max_inv_rho)) ** rho
    
    return append_zero(sigmas).to(device)

We will now see the first Scheduler using Signature 2. Here sigma_min and sigma_max are passed in externally and the scheduler does not rely on model specific definitions. So the implementations of Signature 2 are the most self-sufficient and probably the most influential.

The function returns append_zero of sigmas converted to device. I suppose append_zero is a custom utility function. So let's track it down.

def append_zero(x):
    return torch.cat([x, x.new_zeros([1])])

Given a tensor x it concatenates a zero tensor to it. For me this looks like an inline variant of what the Normal scheduler has done with the append_zero and if block. And the .to performs Tensor dtype and/or device conversion (1).

Sigmas are produced by ramp which is a distribution from 0 to 1 with n steps. Other than in the Normal scheduler here a flat 0 to 1 distribution is taken instead of the built-in sigmas by timestep of the model. Then a min_inv_rho and max_inv_rho are calculated, which seems to be an inverted power of rho. A shaping function (1).
Ramp, which is a tensor (rank 1) is then added to max_inv and multiplied by the difference of their range and raised to the power of rho. Effectively creating a non-linear noise schedule. The use of rho as an exponent allows for fine-tuning the curve of the noise schedule.

What is difficult to comprehend here is that we are talking about a domain of numbers between 0 and 1, which inverts most operations. Division is effectively multiplication, multiplication effectively division, and therefor power is also inverted. The inv values are higher than the original values. And the final power of rho reduced them down again, reversing the operation.

As demonstrated here

>>> 0.2 ** (1/8)
0.8177654339579425
>>> 0.8177654339579425 ** 8
0.20000000000000007

Now the question remains, do I as a comfyUI user get to use rho?
And the short answer is, no but also yes. The rho value defaults to 7 if we select the scheduler inside the KSampler node. However there is a dedicated node just for the Karras scheduler, that gives access to the rho parameter.

Karras is a scheduling strategy presented in the paper of the same name (Karras et. all, 2022) which is a research group from NVIDIA introducing multiple new algorithms for the sake of a more clear design space. The value 7 has been found to be a good default value for a well balanced design space.

To sum up Karras is a deterministic and universal (model and sampler agnostic) scheduler that produces a non-linear noise schedule. Rho controls how much the steps near sigma_min are shortened at the expense of longer steps near sigma_max (1).

Exponential

def get_sigmas_exponential(n, sigma_min, sigma_max, device='cpu'):

    """Constructs an exponential noise schedule."""

    sigmas = torch.linspace(math.log(sigma_max), math.log(sigma_min), n, device=device).exp()

    return append_zero(sigmas)

This one also follows Signature 2 and looks quite simple. Again it returns append_zero applied to sigmas.
Which is given by a linear distribution from log of sigma_max to log of sigma_min.

The logarithm of a number gives the exponent that the base needs to be raised to, to produce that number. If the base is not given in python, it defaults to e, which is the base of the natural logarithm and exponential function.

>>> math.e
2.718281828459045
>>> math.log(0.2)
-1.6094379124341003
>>> math.e ** -1.6094379124341003
0.20000000000000004

So we have a similar mechanism taking place here, where we first apply the inverted operation to our base values to get the distribution and then we return it to a range between 0 and 1 with the actual intended operation.

Exponential is straight-forward, it gives a deterministic exponential distribution.

SGM Uniform

SGM Uniform uses the Normal implementation with the sgm flag set to True and thrown through the Partial function . This leads to another observation.

So what is SGM, and what does it do?
Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target.(1)

Ok, but this sound very dynamic. Is this reflected in the implementation?

def normal_scheduler(model_sampling, steps, sgm=False, floor=False):

    s = model_sampling
    start = s.timestep(s.sigma_max)
    end = s.timestep(s.sigma_min)
    ...
    if sgm:
        timesteps = torch.linspace(start, end, steps + 1)[:-1]
...
    sigs = []

    for x in range(len(timesteps)):
        ts = timesteps[x]
        sigs.append(float(s.sigma(ts)))
...
    return torch.FloatTensor(sigs)

Based on the implementation no such dynamic becomes appearent. So after researching I found out that this schedule is tailored to a specific category of models (1). So contrary to Signature 2 schedulers, this one is very use-case specific.

Anyway, in the algorithm here we increment the target steps by one stretching the linear distribution slightly (also reducing the step size) before cutting it of, removing the last entry (countering the previously increased step size).

I have created a dummy distribution function to demonstrate.

def distribution(minV, maxV, n):
    if n < 3:
        return [minV, maxV]
    result = []
    step = (maxV - minV) / (n - 1)
    for x in range(n-1):
        result.append(minV + x * step)
    result.append(maxV)
    return result

Which given

print(distribution(0.2, 0.8, 8)) # min_sigma, max_sigma, steps
print(distribution(0.2, 0.8, 8+1)[:-1])

produces

[0.2, 0.28571428571428575, 0.37142857142857144, 0.4571428571428572, 0.5428571428571429, 0.6285714285714287, 0.7142857142857144, 0.8] # normal
[0.2, 0.275, 0.35000000000000003, 0.42500000000000004, 0.5, 0.5750000000000001, 0.6500000000000001, 0.7250000000000001] # sgm uniform

As we can see the values of the sgm uniform, are more uniform and don't end with the timestep of sigma_min.

Simple

def simple_scheduler(model_sampling, steps):

    s = model_sampling
    sigs = []
    ss = len(s.sigmas) / steps
    
    for x in range(steps):
        sigs += [float(s.sigmas[-(1 + int(x * ss))])]
   
	sigs += [0.0]
    return torch.FloatTensor(sigs)

Again a function of Signature 1 which is supposedly "simple" according to its same.
We return again a tensor of sigmas ending in a 0.0 sigma. We iterate over the number of steps and add the sigmas of the model at index of - (1 + (x*ss)) where x * ss is rounded to an integral. This effectively indexes the sigma values of the model in reverse order, from most noise to less.
ss is the ratio of total sigmas of the model divided by steps. So if the model has 12 sigmas, and we have 4 steps, ss is 3. Effectively picking -1, -4, -7 and -10 of the total model sigmas.

The Simple scheduler simply contrasts the target steps to the sigmas of the model and based on the ratio, cherry picks them from end to start in a regular interval given by that ratio.

DDIM Uniform

Finally we get to our special snow-flake, the misunderstood cook, that swam against the stream of all the other schedulers!

def ddim_scheduler(model_sampling, steps):

    s = model_sampling
    sigs = []
    x = 1

    if math.isclose(float(s.sigmas[x]), 0, abs_tol=0.00001):
        steps += 1
        sigs = []

    else:
        sigs = [0.0]

    ss = max(len(s.sigmas) // steps, 1)

    while x < len(s.sigmas):
        sigs += [float(s.sigmas[x])]
        x += ss

    sigs = sigs[::-1]

    return torch.FloatTensor(sigs)

It follows Signature 1 and returns like most a torch.FloatTensor of sigs.
The ::-1 index (start:end:step) effectively reverses the sigma sequence. This time the iteration is done through a while loop that sees x go all the way to len of the model_sampling sigmas.
x is accumulated by ss which is the count of model sigmas divided by steps and rounded to integer, with a minimum value of 1.

This is similar to the Simple scheduler, which takes descrete steps inside the full list of model sigmas based on the step size (the ratio of sigmas to steps). Conceptually it does most things Simple does, at different moments and with slightly different syntax. For example simple devided the model sigmas by steps (with a normal /) and inside the loop multiplied x by ss before casting it to int.
Here the amount of sigmas is divided using // by steps, effectively forcing the nearest integer, and then ensuring values higher than zero. And instead of indexing with a for loop and multiplying index by ss, it is added to the x inside a while loop which usually implies a less deterministic loop end condition. Two implementations with the same result.

However as SGM Uniform was the counterpart to Normal, this DDIM Uniform seems to be the counterpart to Simple. This analysis could suffice, were the two results of Simple and DDIM Uniform identical, but as you saw, they could not be further apart. I have to dig deeper, in the nuances.

First of all X which becomes the indexer inside the loop, is initialized at 1 meaning the first sigma (indexed at 0) is skipped, but since everything is flipped, this effectively skips the highest sigma of the model. And since we have introduced the initial offset of 1, every successive index is offset by 1. The distribution is then padded with 0.0.

I reimplemented the two schedulers using a dummy list of 80 sigmas and got these two results on three different step counts: 4, 12, 32.

# At 4 steps
Simple: [0.9875, 0.7375, 0.4875, 0.2375, 0.0]
DDIM: [0.7625, 0.5125, 0.2625, 0.0125, 0.0] # evenly spaced at 0.25
# At 12 steps
Simple: [0.9875, 0.9125, 0.825, 0.7375, 0.6625, 0.575, 0.4875, 0.4125, 0.325, 0.2375, 0.1625, 0.075, 0.0]
DDIM: [0.9875, 0.9125, 0.8375, 0.7625, 0.6875, 0.6125, 0.5375, 0.4625, 0.3875, 0.3125, 0.2375, 0.1625, 0.0875, 0.0125, 0.0] # evenly spaced at 0.0750
# At 32 steps
Simple: [0.9875, 0.9625, 0.925, 0.9, 0.8625, 0.8375, 0.8, 0.775, 0.7375, 0.7125, 0.675, 0.65, 0.6125, 0.5875, 0.55, 0.525, 0.4875, 0.4625, 0.425, 0.4, 0.3625, 0.3375, 0.3, 0.275, 0.2375, 0.2125, 0.175, 0.15, 0.1125, 0.0875, 0.05, 0.025, 0.0]
DDIM: [0.9875, 0.9625, 0.9375, 0.9125, 0.8875, 0.8625, 0.8375, 0.8125, 0.7875, 0.7625, 0.7375, 0.7125, 0.6875, 0.6625, 0.6375, 0.6125, 0.5875, 0.5625, 0.5375, 0.5125, 0.4875, 0.4625, 0.4375, 0.4125, 0.3875, 0.3625, 0.3375, 0.3125, 0.2875, 0.2625, 0.2375, 0.2125, 0.1875, 0.1625, 0.1375, 0.1125, 0.0875, 0.0625, 0.0375, 0.0125, 0.0] # evenly spaced at .0250

Of course this even spacing is due to an even distribution in my sigma sequence I provided as input, but nonetheless DDIM mantains a balanced distribution while Simple does not.
I also noticed something else.

Simple: [0.9875, 0.9625, 0.925, 0.9, 0.8625, 0.8375, 0.8, 0.775, 0.7375, 0.7125, 0.675, 0.65, 0.6125, 0.5875, 0.55, 0.525, 0.4875, 0.4625, 0.425, 0.4, 0.3625, 0.3375, 0.3, 0.275, 0.2375, 0.2125, 0.175, 0.15, 0.1125, 0.0875, 0.05, 0.025, 0.0] with 33 entries
DDIM: [0.9875, 0.9625, 0.9375, 0.9125, 0.8875, 0.8625, 0.8375, 0.8125, 0.7875, 0.7625, 0.7375, 0.7125, 0.6875, 0.6625, 0.6375, 0.6125, 0.5875, 0.5625, 0.5375, 0.5125, 0.4875, 0.4625, 0.4375, 0.4125, 0.3875, 0.3625, 0.3375, 0.3125, 0.2875, 0.2625, 0.2375, 0.2125, 0.1875, 0.1625, 0.1375, 0.1125, 0.0875, 0.0625, 0.0375, 0.0125, 0.0] with 41 entries

DDIM Uniform produces more sigmas than steps. While simple stays constant at (steps + 1).

Beta

def beta_scheduler(model_sampling, steps, alpha=0.6, beta=0.6):

    total_timesteps = (len(model_sampling.sigmas) - 1)
    ts = 1 - numpy.linspace(0, 1, steps, endpoint=False)
    ts = numpy.rint(scipy.stats.beta.ppf(ts, alpha, beta) * total_timesteps)
    sigs = []
    last_t = -1

    for t in ts:
        if t != last_t:
            sigs += [float(model_sampling.sigmas[int(t)])]
        last_t = t
    sigs += [0.0]
    return torch.FloatTensor(sigs)

Here we have Signature 1 and return torch.FloatTensor of sigs, which is padded once more by a final 0.0 value for the last step. Here we iterate through ts which seems to be a sequence of values that are being compared throughout the loop. If two values that follow each other are different t is rounded to int and used to index the sigma from the model. last_t is initialized to -1 as nil value. So the question now is, what is ts?

So first we initialize the value by filling a linear distribution (of which the last value is excluded (1)) with the result of the member-wise operation of 1 - member. In these domains between 0 and 1, 1 - is usually to invert the number. Think of it like flipping positive and negative space. A 0.2 (0.2 from 0) becomes a 0.8 (0.2 from 1).

Now we have a statistical distribution called beta which represents a beta continuous random variable (1) of which the ppf method returns the percent point function (which is the inverse of cdf the cumulative distribution function).

The short version is that the Beta distribution can be understood as representing a distribution of probabilities, that is, it represents all the possible values of a probability when we don't know what that probability is. (1)

This method takes a q which is our ts, an alpha and beta (which are defaulted to 0.6). This distribution is then multiplied by the timesteps and rounded to integer (member-wise)

So without going to deep into statistical nuances. This approach chooses quasi-random sigmas of the model guaranteeing that there are no two equal sigmas in succession.

Linear Quadratic

def linear_quadratic_schedule(model_sampling, steps, threshold_noise=0.025, linear_steps=None):

    if steps == 1:
        sigma_schedule = [1.0, 0.0]

    else:
        if linear_steps is None:
            linear_steps = steps // 2

        linear_sigma_schedule = [i * threshold_noise / linear_steps for i in range(linear_steps)]
        threshold_noise_step_diff = linear_steps - threshold_noise * steps
        quadratic_steps = steps - linear_steps
        quadratic_coef = threshold_noise_step_diff / (linear_steps * quadratic_steps ** 2)
        linear_coef = threshold_noise / linear_steps - 2 * threshold_noise_step_diff / (quadratic_steps ** 2)
        const = quadratic_coef * (linear_steps ** 2)
        quadratic_sigma_schedule = [
            quadratic_coef * (i ** 2) + linear_coef * i + const

            for i in range(linear_steps, steps)
        ]
        sigma_schedule = linear_sigma_schedule + quadratic_sigma_schedule + [1.0]
        sigma_schedule = [1.0 - x for x in sigma_schedule]
    return torch.FloatTensor(sigma_schedule) * model_sampling.sigma_max.cpu()

Well, this one is a beast, following Signature 1 and having two additional inputs: threshold_noise and linear_steps and (member-wise) multiplies the final torch.FloatTensor of sigma_schedule by the sigma_max of the model. Meaning the entire distribution is put into relation to the maximum sigma of the model.

sigma_schedule is the result of a linear_sigma_schedule appended by a quadratic_sigma_schedule and a 1.0. This compound sequence is again inverted (with the same 1.0 - technique observed in Beta) making the final 1.0, the same 0.0 endpoint as in all the remaining Schedulers. So now it is a matter of finding out how the linear_sigma_schedule and the quadratic_sigma_schedule are composed respectively.

First of all, it breaks out of the entire algorithm at steps of 1 returning a simple [1, 0] schedule. If the input linear_steps is not provided, the algorithm defaults to steps divided (and rounded) by 2, which effectively determines where the linear distribution ends, and the quadratic distribution begins (which in the default case is in the mathematical middle).
The linear_sigma_schedule is computed through a list comprehension in which from 0 to linear_steps the value is multiplied by the threshold_noise (which defaults to 0.025) divided by the given amount of linear_steps.

Before the quadratic_sigma_schedule is calculated two intermediary values are derived from the linear_steps value. threshold_noise_step_diff is the total steps multiplied by the threshold noise removed from the linear steps. quadratic_steps is the remaining step count when linear_steps are removed from the total steps count. The coefficient given by quadratic_coeff is given by the threshold_noise_step_diff divided by the linear_steps count multiplied by the quadratic_steps to the power of two (or squared) which is where this Scheduler gets its name from probably. The linear_coef is then the theshold_noise divided by linear_steps from which we remove the ratio of theshold_noise_step_diff and quadratic_steps (squared) doubled.
Finally a constant const is calculated by the quadratic_coef multiplied by linear_steps squared. The quadratic_sigma_schedule is then given by a list comprehension of linear_steps to steps (or the range from the midpoint to the end) in which each step is squared and multiplied with the quadratic_coef, added with the timestep multiplied with the linear_coef and finally added to const.

Once again this seems a very verbose shaping function which up to a certain point follows a linear progression, and then becomes exponential (quadratic) towards the end, where the contrast in shape is given by the noise_threshold value. I checked however and other than exponential and karras there is no dedicated linear-quadratic schedule node to customize the parameters.

KL Optimal

def kl_optimal_scheduler(n: int, sigma_min: float, sigma_max: float) -> torch.Tensor:

    adj_idxs = torch.arange(n, dtype=torch.float).div_(n - 1)
    sigmas = adj_idxs.new_zeros(n + 1)
    sigmas[:-1] = (adj_idxs * math.atan(sigma_min) + (1 - adj_idxs) * math.atan(sigma_max)).tan_()
    return sigmas

Our last built-in function complements our self-sufficient Signature 2 schedulers with no additional parameters. It returns sigmas unaltered, which implies, that we already produce and operate on a torch.Tensor inside the function somewhere.

It all begins with adj_idxs given by the distribution of 0 to steps by incrementing by 1 (since everything is defaulted (1)) and (member-wise) divided by steps - 1.
sigmas is then given by a series of 0.0 of the size n+1 based on the type of adj_idxs (1)
The assignment to sigmas with the [:-1] index replaces everything except the last value which due to having a .new_zeroes of one more than steps, effectively fills sigmas except for a last 0.0, and is yet another approach to achieve the zero padding of the other implementations. This should be another nice example, how the concept of an algorithm can be translated using different implementation techniques.
The values before are given by a member-wise operation of member * tan(sigma_min) added to the inverted (1 -) multiplied by atan of sigma_max. Of which then is taken the member-wise tangent.

Trigonometric functions always return values between 0 and 1 given any value as they describe features of angles and circle, which are of circular nature. Effectively creating an oscillating shape based on sigma_min and sigma_max.

Conclusion

Schedulers are the algorithms preparing the sequence of noise called the noise schedule for the samplers to solve. Conceptually the noise schedule is a distribution going from a lot of noise to no noise at the end. We generally have seen two approaches to this.
One more thing, being now able to visualize the effect of the scheduler as a distribution with a certain shape given by the algorithm, the shape will probably become more apparent with more data points, or in this case higher the step count. So I repeated the experiment with 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 60, 120 steps. This time with the Heun2pp Sampler.

Signature 1 functions are strongly tied to the model's specific sigmas.

Normal

anime_00016_.png

The normal scheduler distributes timesteps in a linear distribution and then based on the timestep picks sigmas from the model. This results in acceptable quality in relation to each step. Also it provides a good benchmark to compare the other schedulers against.

SGM Uniform

anime_00019_.png

SGM Uniform stretches the timestep distribution, tightening the interval and skipping the last entry which results in uniform steps. Other than that, it works like normal (as it uses the same function implementation). Normal and SGM Uniform both converge to almost the same picture at 120 steps. However at 60 SGM Uniform produces a more consistent pikachu (proportion-wise) to it's results at lower steps. Also SGM Uniform hallucinates some geometric patterns into the background in the middle steps, where Normal renders clouds of various detail levels, except for 24. Also I find that SGM Uniform outperforms Normal on lower step counts.

Simple

anime_00020_.png

Simple deals with indexing sigmas directly, in a regular interval given by the ratio of sigmas to steps. Since both Normal and Simple rely on the builtin sigmas of the model they converge to a similar result. What seems apparent on closer look, is that Simple is always a few steps ahead of Normal and can produce sightly crisper results on lower steps.
This could be due to the slight indirection normal introduces by using timesteps to resolve sigmas, instead of directly indexing the sigmas inside the model_sampling layer. The biggest discrepancy of quality happens between steps 8 and 32. But both converge to the same result at 60 and 120. (Pikachu however has his eyes closed on 60 with the simple scheduler)

DDIM Uniform

anime_00021_.png

DDIM Uniform prioritizes a uniform (direct) indexing strategy, not always respecting the steps. The algorithm skips the first index (or the highest sigma), which on a low step count means, that almost no denoising takes place. This explains why DDIM Uniform takes so long to warm up. Also we have seen that it produce uniform steps by adjusting the total sigmas far beyond the step count, which should be visible in the total diffusion time.
If we compare a low step count like 8 between simple (70.36s) and DDIM Uniform (72.37s) we can see slightly slower result which at 120 respectively (1125.16s) and (1132.57s) is confirmed.

Beta

anime_00022_.png

Beta produces a random variable distribution multiplied by the timesteps producing a random spread of timesteps which are used to index the existing sigmas (a similar indirection as Normal) avoiding adding identical sigmas in succession. As the algorithm confirms, the Beta scheduler produces the most varied results based on it's step count. Also compared to Normal beta has the most varied styles and details, which can be observed on the bear, city and pikachu between steps 8 and 24. However when converging on steps 60 and 120 while the results look similar to other schedulers, Beta tends to resolve more than necessary, ending up removing details: Pikachu loses some limbs and generally the picture is getting destroyed a bit. If we remember the algorithm, the results are highly unpredictable. The more steps we have the higher the chance is that we could have high sigmas (denoising) at later stages, which can explain this behavior.

Linear Quadratic

anime_00023_.png

Finally linear_quadratic is a compound sequence given a linear part and a quadratic part, split at the halftime of the timesteps by default. The Linear Quadratic Scheduler produces very interesting results at lower steps like 2, 4 , being the fastest to converge to the final design at around 8 and 12, starting to cut off parts of the composition, which can be observed in the city backdrop.


Then we have Signature 2 Schedulers which take a different approach, being more deterministic and self-contained, ignoring the existing sigma distribution of the model and generating their own.

Karras

anime_00017_.png

We have Karras which creates a linear distribution from 0 to 1 which is combined with a custom shaping function given by rho, sigma_min and sigma_max. Karras is the slowest to converge. For example we can see the pikachu ears on the bear only disappearing on very high steps. It is behind Normal by many steps. If you look at the result of 24 steps, which only then comes close to the result at 8 steps of Normal. It converges to a different result at 120 which is similar to the result of Normal at 28 steps. This probably is due to the low default rho. More experiments need to be done with the custom node, which allows to customize the shape of the distribution.

Exponential

anime_00018_.png

The exponential scheduler, returns an exponential distribution from 0 to 1 based on sigma_min and sigma_max. We will compare this to our Normal benchmark and to Karras. This one is even slower to converge to the results of Normal, here the result at 60 steps of Exponential corresponds to a combination of step 8 and 12 of Normal. Also step 120 of Exponential corresponds to step 20 of Normal with slightly different proportions and details. Compared to Karras it produces almost identical results on lower steps, until Karras drifts off with slightly more varied results.

KL Optimal

anime_00024_.png

And finally KL Optimal that oscillates between 0 and 1 based on sigma_min and sigma_max. Again we compare this one to our Normal benchmark. But I would like to also compare it to Beta, as similar to it the chances for high sigmas are high towards the final steps based on it's oscillating nature. It is the only scheduler producing an empty black image at 1 step. On lower steps, the results are very similar to Normal. I feel at 8 steps, with KL Optimal the bear has the most angry expression. It converges faster than Karras and Exponential almost following Normal. Interestingly enough the results at higher step counts are very different, confirming the behavior observed with Beta. At 120 steps the result looks very similar to the one at 60 steps of Normal. Compared to Beta, KL Optimal does not showcase as much variation on lower steps. For example observe the difference between steps 8 and 24. The variation only start to take place at higher steps.


This shows me that with Flux.Schnell it is better to stick with the Signature 1 schedulers.
DDIM Uniform needs higher steps to get started, while Simple and Normal are good base options. Beta produces the most unpredictable results based on the step count. And Linear Quadratic is an efficient scheduler converging faster at lower steps.
Signature 2 scheduler on the other hand are very slow to converge, and need higher steps for the most variation. I think they are only worth using when controlled and customized using custom nodes, specifically designed for them.

Samplers

For the sampler, I have repeated the same experiment, with the same seed, prompt, fixed steps twice for each sampler with the normal scheduler (right) and the beta scheduler (left). I chose a low enough step size so when in doubt we can follow the process step by step, and high enough to walk past the threshold of some schedulers underperforming, due to how the distributions are shaped (looking at you DDIM Uniform).

anime_00015_2cols.png

So what should become apparent is that not every sampler works with Flux.Schnell, namely euler_cfg_pp, euler_ancestral_cfg_pp, dpmpp_2s_ancestral_cfg_pp, dpmpp_sde, dpmpp_sde_gpu, dpmpp_2m_cfg_pp, dpmpp_2m_sde, dpmpp_2m_sde_gpu, dpmpp_3m_sde, dpmpp_3m_sde_gpu, res_multistep_cfg_pp, uni_pc, uni_pc_bh2 and ddpm perform very poorly (at least at 8 steps). We can see some common features.
All the cfg_pp and _sde ending samplers have problems, aswell as uni_pc and ddpm.
Former tells me that Flux is not well optimized on the cfg and sde variations, while latter tell me that the sampler uni_pc and ddpm might be outdated or not well suited for the flux architecture.

Also we have several different families of Euler, Heun, DPM (with the most variations and improvements), LMS, DDPM, LCM, IPNDM, Deis, Res Multistep, Gradient Estimation, DDIM and Uni PC.

While some Samplers have variants, which we will cover in their implementation analysis, most of them share common variants with familiar names: Ancestral, PP, CFG, SDE which we will briefly cover here before diving in.

Ancestral

To understand ancestral, we need to remember some observations from our scheduler experiment, namely the point of convergence, where a design does not change much, or new variants are entroduced. The design stays stable and the denoising only adds details or improves the crispiness of the image. Ancestral samplers are different.

An ancestral sampler adds noise to the image at each sampling step. They are stochastic samplers because the sampling outcome has some randomness to it... using an ancestral sampler is that the image would not converge.(1)

CFG

Classifier-Free Guidance (CFG) is a technique used in diffusion models to improve the quality and controllability of generated samples. It has become a crucial component in state-of-the-art text-to-image generation systems, such as Stable Diffusion.
... This interpolation is controlled by a guidance scale, which determines how much influence the conditioning information has on the generation process. A higher guidance scale typically results in outputs that more closely match the input prompt, but can also lead to artifacts and reduced sample diversity. (1)

However, as of the time of writing this article, according to my knowledge Flux.1 is not affected by CFG, which might explain the incompatibility with the samplers emphasizing it. Or rather to keep CFG < 1, as Flux does not interact with the negative prompt. (1)

PP

So the pp stands for ++ most likely, like it is the case with "c++" and .cpp files. 1
So they are generally a suffix for an improved version.

SDE and ODE

According to (1) SDE and ODE are mathematically derived techniques called Stochastic Differential Equation and Ordinary Differential Equations. Where SDE seem to be related to score-based diffusion models.

Score-based diffusion models convert data into noise through a diffusion process governed by a stochastic differential equation (SDE). Generating new data points is achieved by sampling noise particles and simulating a reverse-time dynamics of this diffusion process, driven by an equation known as the reverse SDE. The reverse SDE has a closed-form expression which depends solely on the time-dependent gradient field (the so-called score) of the logarithm of the perturbed data distribution.

ODE on the other hand facilitate deterministic dynamics, controlled by a probability flow.
To evaluate the quality of images created with diffusion models a metric is used called
Frèchet inception distance (FID) score.

For those curious here is a paper talking in depth about SDE in the context of market and stock price evolution. Same goes for ODE in this article. We are just interested about their concept, how they relate to sampling, and how they affect the result.
ODEs are differential equations dependent on a single independent variable, and its uknowns consist of functions that involve derivatives. In SDE that progression is random.

For example Euler, Heun and LMS are considered to be old-school ODE solvers (1).
While SDE... don't really work with Flux.


As we did with the Schedulers let's try to find an entry point where all Samplers are registered. This helps us to provide the application context of the algorithm, and identify any differences between them on the surface.


def ksampler(sampler_name, extra_options={}, inpaint_options={}):

    if sampler_name == "dpm_fast":
...
    elif sampler_name == "dpm_adaptive":
...
    else:
        sampler_function = getattr(k_diffusion_sampling, "sample_{}".format(sampler_name))
    return KSAMPLER(sampler_function, extra_options, inpaint_options)

In the ksampler function which is used as middle-step to resolve the name to the implementation inside the KSampler class, we can already observe a few things.

  1. dmp_fast and dpm_adaptive have a special treatment, probably some pre-processing/preparation steps before the main algorithm
  2. getattr is used to transform the name to a function called sample_name and then converted to a python symbol (a function in this case)
  3. The sampler_function is then passed into a KSAMPLER construct together with options and inpaint options.
    KSAMPLER is a class that has a sample method.

def sample(self, model_wrap, sigmas, extra_args, callback, noise, latent_image=None, denoise_mask=None, disable_pbar=False):

        extra_args["denoise_mask"] = denoise_mask
        model_k = KSamplerX0Inpaint(model_wrap, sigmas)
        model_k.latent_image = latent_image

        if self.inpaint_options.get("random", False): #TODO: Should this be the default?
            generator = torch.manual_seed(extra_args.get("seed", 41) + 1)
            model_k.noise = torch.randn(noise.shape, generator=generator, device="cpu").to(noise.dtype).to(noise.device)

        else:
            model_k.noise = noise
        noise = model_wrap.inner_model.model_sampling.noise_scaling(sigmas[0], noise, latent_image, self.max_denoise(model_wrap, sigmas))
        k_callback = None
        total_steps = len(sigmas) - 1
        
        if callback is not None:
            k_callback = lambda x: callback(x["i"], x["denoised"], x["x"], total_steps)

        samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)

        samples = model_wrap.inner_model.model_sampling.inverse_noise_scaling(sigmas[-1], samples)

        return samples

This is where the magic happens. Sample takes a model_wrap , sigmas (the noise schedule from our scheduler), some extra args, a callback, noise and the latent image, aswell as mask.
Here we can see the main algorithm. Steps are derived from the sigmas by substracting 1.

Sigmas is the noise schedule produced by the schedulers. Noise is derived by the inner_model's model_sampling using a function for noise_scaling based on the max_sigma (or the first entry of our noise schedule).

A first sampling pass is stored in samples by running the sampler_fuction with the diffusion model, noise and sigmas. The result is then passed into the inner_model.inverse_noise_scaling where it prepares the sample for the next sample iteration, on the last sigma's entry; the lowest sigma.

Inside the KSampler class I have also found this implementation of sample.

def sample(self, noise, positive, negative, cfg, latent_image=None, start_step=None, last_step=None, force_full_denoise=False, denoise_mask=None, sigmas=None, callback=None, disable_pbar=False, seed=None):

        if sigmas is None:
            sigmas = self.sigmas
  
        if last_step is not None and last_step < (len(sigmas) - 1):
            sigmas = sigmas[:last_step + 1]

            if force_full_denoise:
                sigmas[-1] = 0

        if start_step is not None:

            if start_step < (len(sigmas) - 1):
                sigmas = sigmas[start_step:]

            else:

                if latent_image is not None:
                    return latent_image

                else:
                    return torch.zeros_like(noise)

        sampler = sampler_object(self.sampler)

        return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)

At first it looks like a recursive function, but if you examine the signature carefully, it actually reveals to be yet another wrapper for another sample function.

def sample(model, noise, positive, negative, cfg, device, sampler, sigmas, model_options={}, latent_image=None, denoise_mask=None, callback=None, disable_pbar=False, seed=None):

    cfg_guider = CFGGuider(model)

    cfg_guider.set_conds(positive, negative)

    cfg_guider.set_cfg(cfg)

    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)

This function now implies the use of the CFGGuider, a class with yet another collection of sample methods. All the references to other functions inside this class result in covering the full content of the samplers.py (1120 lines of code), so I can't show a listing of all the middle-functions I went through. Suffice it to say the entire process is hidden behind multiple layers for extensibility and flexibility. Inside inner_sampler of CFGGuider the sampler class and the sampler.sample method are wrapped into ExecutableWrapper.
This all resolves back into the sampler_object utility function from the start. However there is an interesting observation:

def sampler_object(name):
    if name == "uni_pc":
        sampler = KSAMPLER(uni_pc.sample_unipc)

    elif name == "uni_pc_bh2":
        sampler = KSAMPLER(uni_pc.sample_unipc_bh2)

    elif name == "ddim":
        sampler = ksampler("euler", inpaint_options={"random": True})

    else:
        sampler = ksampler(name)

    return sampler

The names are resolved to a function with the ksampler name, except for uni_pc, uni_pc_bh2 and ddim, coincidentally 3 of the samplers that did underperform in our example. However one pressing question remains open. The process has been described as iterative, and the reason I am doing all this backtracking is to find the context in which we iterate the sampling. But no matter how far I track it, I can't find any loops or recursive calls.

def sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=1.0, disable_noise=False, start_step=None, last_step=None, force_full_denoise=False, noise_mask=None, sigmas=None, callback=None, disable_pbar=False, seed=None):

    sampler = comfy.samplers.KSampler(model, steps=steps, device=model.load_device, sampler=sampler_name, scheduler=scheduler, denoise=denoise, model_options=model.model_options)

    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)

    samples = samples.to(comfy.model_management.intermediate_device())

    return samples

KSampler (not KSAMPLER and not ksampler) is called within, yet another sample function.
(I am showing you this, so you can comprehend the agony I am subjecting myself through.)
It seems that the iteration takes place inside the individual sampler functions after all. So let's get to it!

Euler

Since this is our first Sampler implementation we are going to find out a lot of new concepts and question applicable to Samplers in general, so bare with me, it is going to be a long section.

@torch.no_grad()

def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
    """Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
    extra_args = {} if extra_args is None else extra_args
    s_in = x.new_ones([x.shape[0]])
    
    for i in trange(len(sigmas) - 1, disable=disable):
    
        if s_churn > 0:
            gamma = min(s_churn / (len(sigmas) - 1), 2 ** 0.5 - 1) if s_tmin <= sigmas[i] <= s_tmax else 0.
            sigma_hat = sigmas[i] * (gamma + 1)
        else:
            gamma = 0
            sigma_hat = sigmas[i]
        if gamma > 0:
            eps = torch.randn_like(x) * s_noise
            x = x + eps * (sigma_hat ** 2 - sigmas[i] ** 2) ** 0.5
        denoised = model(x, sigma_hat * s_in, **extra_args)
        d = to_d(x, sigma_hat, denoised)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigma_hat, 'denoised': denoised})
        dt = sigmas[i + 1] - sigma_hat
        # Euler method
        x = x + d * dt

    return x

We will proceed like we did with the Scheduler analysis. From the outside in, and then from the end to the start. First thing we can see here is the @decorator called @torch.no_grad(). It is a decorator (context-manager) that disables gradient calculation, saving on memory and useful for functions that are used for inference (1).
This tells us, we are where we need to be, in the place where inference takes place.
Our inputs are: model (the model), x (which is passed by the KSAMPLER as noise), sigmas (the noise schedule computet with the Schedulers), extra_args (we will ignore), callback (we will ignore), disable, s_churn, s_tmin, s_tmax, s_noise.

We return x. What is x? Probably a tensor. But what is it conceptually? According to my previous contextualization of the function, it is noise. But at the moment I don't really understand what that means. I thought noise comes from the sigmas. Maybe it is the generated noise from a seed?

My back-tracing ended up here, in the nodes.py file calling comfy.sample.sample (Do I have to comment this any further?...).

def common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent, denoise=1.0, disable_noise=False, start_step=None, last_step=None, force_full_denoise=False):
...
        noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds)
...
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
...

And this is where noise is generated

def prepare_noise(latent_image, seed, noise_inds=None):
    """
    creates random noise given a latent image and a seed.
    optional arg skip can be used to skip and discard x number of noise generations for a given seed
    """
    generator = torch.manual_seed(seed)

    if noise_inds is None:

        return torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")

    unique_inds, inverse = np.unique(noise_inds, return_inverse=True)

    noises = []

    for i in range(unique_inds[-1]+1):

        noise = torch.randn([1] + list(latent_image.size())[1:], dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")

        if i in unique_inds:

            noises.append(noise)

    noises = [noises[i] for i in inverse]
    noises = torch.cat(noises, axis=0)

    return noises

The function called prepare_noise generates noises. Ufff....
But let's limit to what we need to know. Based on noise_inds it creates a collection of noise. Noise is generated using torch.randn using the layout of the latent_image.

Back to the sampler.

X is the collection of noise in the latent_image layout and is a tensor.
X is given by x + d * dt and is called the Euler Method. So let us see what d and dt are.
This all takes place inside a loop: our sampling process!

dt is sigma_hat removed from the next Sigma in the Schedule. d is the result of the function to_d which takes x (the noise), sigma_hat and denoised. Based on a comment in the to_d definition, it converts a denoiser output to a Karras ODE Derivative. So d is the ODE Derivative... Of what?

denoised is the result of the model given x (the noise), sigma_hat by s_in. This is where algorithm meets AI model. Great, so the model does what it learned, given noise, it denoises it. I think this is what all the wrapper layers were for: dependency injecton, to get all the conditioning into the model, without having to pass it through all the layers explicitely all the way to the sampling method. Also my first guess for what the sampler does now is: The sampler is responsible, given a certain noise schedule, to prepare the noise for the model, let the model denoise it and then process the denoised result to be applied to the accumulated remaining noise, preparing it for the next iteration.

The derivative that to_d finds therefor is from the current noise to denoised based on sigma_hat which implies our noise schedule.

Now we get to the complicated "mathy" stuff. Let's break it down. We have a branch based on s_churn, wether it is positive or negative. s_churn is part of a collection of parameters that can be passed into the function through **extra_options but inside the ksampler factory I could not see any of them being passed. So I will use the default values from here on: 0. Which defines gamma as 0 and sigma_hat as the current sigma of the schedule.
So s_churn and in consequence gamma seem to be a parameter to adjust the sigma (or the standard deviation of the normal distribution for this sampling step). So sigma_hat is simply the current sigma.

The loop iterates over trange of the sigma count minus 1, probably for the calculation of dt ending one short. Remember our schedulers always assigned 0.0 at the end of the schedule? This could be the reason for that. trange is an alias for tarange which in turn is a short form for tqdm.asyncio.tqdm which takes a range and **kwargs.
tqdm Is a customizable progressbar decorator for iterators, that includes a default range iterator printing to stderr. In simpler words, this is the reason while during the sampling process the program does not freeze, but instead we can observe the progress bar on top of theKSampler node and continue working. It is relevant for the UI/UX but not for the sampling algorithm.

So to answer our question: How does the sampling algorithm affect the result and how it can be conceptualized, we need to translate d and dt to concepts.

def to_d(x, sigma, denoised):
    """Converts a denoiser output to a Karras ODE derivative."""
    return (x - denoised) / utils.append_dims(sigma, x.ndim)

to_d takes x (the noise), the sigma and and the denoised sample, subtracts the denoised from the noise, and divides it by append_dims of sigma and x.ndim (which returns the number of dimensions of the tensor (1) ). append_dims is a utility function that appends dimensions to a tensor until it reaches that number. Conceptually this is the 0-padding technique applied to Machine Learning. This is probably done so that the tensor operators work together. But the whole x-denoised/sigma is very very underwhelming. How can this produce a Derivative? I will dig a bit deeper into tensor operations and also maybe into the Karras paper. dt Is the difference between the current sigma and the next.

So Euler simply applies the derivative of the noise to the denoised and multiplies it by the sigma jump from the current to the next. Effectively producing the next iteration of noise for the model. With my current knowledge, this reminds me of matrix multiplication in linear algebra, and since matrices are rank-2 tensors, I am using this simplification to explain what is going on. Matrices at first also look like an arbitrary collection of numbers, however if we multiply matrices and vectors we can combine or apply transforms. We can express all the manipulations of 2d and 3d space with this simple operation with rank-1 and rank-2 vectors.
So imagine what could be expressed applied to n-ranked vectors.

Over the next months I am going to complement my analysis by studying a bit advanced math concepts, and will then revise some of the observations a fill in the gaps, that I am currently abstracting. But I think for a conceptual overview this will suffice for now.

Euler A(ncestral)

Euler CFG PP

Euler A(ncestral) CFG PP

DPM Adaptive