Analyzing Samplers And Schedulers For ComfyUI In Latent Diffusion Workflows
Motivation
I have always been reluctant towards AI due to the black box effect. Let me explain. As a former creative turned technologist I took great pride translating the vision, the effect or the process into an algorithm: I still do. This algorithm would then become a piece in a bigger system which would grow based of a similar principle. This practical fascination, is accompanied on a conceptual level, as each algorithm is a translation of the creative intent into a mechanization that teaches the System exact instructions to produce a new result. A new concept. From this perspective I understood programming as craft, or dare I say it, art form. A form of expression that transcends result-driven creativity, into process-driven creativity. However AI for me was this new concept initially, that abstracted out the process -- the algorithm -- and based on an input "just" produced an output. I felt this was the downfall of the algorithmic ingenuity which I grew to like across the boundaries of these two disciplines.
Until recently...
I have been playing around a lot with AI, in form or LLMs and Diffusion Models for text to image and image to image workflows, to produce concepts and assets for my game prototypes. Thanks to ComfyUI I was able to work with AI models, combine them and experiment with different niche workflows, specific to my needs. The AI models became pieces in a bigger system, which where layed out into an overarching workflow. By myself. But this is not all, I realized that a model without the right nodes, that are capable of processing them, the models are very very useless. And put simply, everything in the system that is not a model, has to be an algorithm, right? This lead to a follow-up observation, that contrary to my beliefs, AI actually requires a ton of algorithms and I explored where the models interfaced with algorithms, and how these algorithm result in impact to the creative process.
Data collection, Data preparation, Learning, are all modeled as algorithms, shaping the potential output
of an AI. In Latent Diffusion (this for me includes all Models ranging from the original Stable Diffusion to the newer Flux.1) some key elements for great success are parameters like step number, samplers and schedulers. They shape the actual output
of the AI. Trying desperately to close the gap between AI and my obsession with algorithmic ingenuity, I came to accept AI as the interplay of Algorithms and Models. Sounds familiar?
The building blocks of any program are Algorithm and Data Structures, interacting inside an Architecture. To fit AI into this conceptual model, I simply abstract the AI model to a very very complex Data Structure that also incorporates a very complex function as it's interface. But the key point is, it is not self-sufficient. This realization felt liberating.
Goal of the Study
This personal insight sparked a new found curiosity: getting into AI my self, by analyzing algorithms that interact directly with the model. A topic which regretfully I had put aside for far too long. Since I love to analyze things, and derive meaning from processes, I thought this could be a fruitful gateway into this domain. Which brings me to this article. One element which seems to greatly affect the results of my outputs inside the ComfyUI pipeline are the mystical Samplers
and Schedulers
with very cryptic names.
There is multiple articles (1,2,3,4,5) covering the different types of samplers and schedulers inside the Stable Diffusion pipeline, and more practical guides for which applications they are suitable.
However the tinkerer in me is often not satisfied by application driven examples alone, and I therefor want to examine the algorithm themselves to find out:
- How does the Algorithm produce the observable effect?
- How does the Algorithm internally impact the output of the image?
- Based on these two question, what is the
conceptualization
of each algorithm?
This article will grow and change over time, and become an analytical playground for myself sharing all my thoughts while breaking down the source code of the modules implementing the samplers and schedulers. I will try my best with available resources to make sense of the mathematical formulas employed, understanding the "why" behind their implementations. When possible I will consult the official papers of the methods.
Disclaimer
An algorithm as envisioned by the original author will never be 100% reproducible. Similar to a cooking recipe, every cook puts his own flair on the final dish often involuntarily by his own ability or the techniques used. This is why I think understanding the fundamental conceptual ideas behind each implementation is much more valuable. The observations that I make here are purely for an educational purpose, for people like me curious about the inner working of these components. Also these observations only apply to the specific implementation inside the ComfyUI system.
Some Context: Latent Diffusion and ComfyUI
Let us begin giving this research endeavor a little bit of context.
First thing, what is ComfyUI?
They describe themselves as:
The most powerful open source node-based application for creating images, videos, and audio with GenAI (1).
They started as an open source project on GitHub and are to this day an active community making the usage of Latent Diffusion models accessible and user friendly. They first came to my attention as an alternative to Automatic1111's WebUI which followed a similar goal. However they sacrificed flexibility for simplicity and accessibility.
I was already familiar with node-based development due to my Unreal Engine experience with Blueprints, or the node software VVVV. So I find it quite attractive. Nowadays ComyUI features a fully self-contained Desktop-App with a model and node manager, centralizing all your diffusion AI needs in one place.
Everyone can develop new custom nodes
, either to add utility functions, to interact with simpler data types like numbers, strings and images. Or to extend the functionality in interacting with AI models.
This is how you build the simplest "base" workflow for any diffusion workflow. It is the hello world
of comfy ui so to say.
- You load models (Diffusion U-Net, CLIP, VAE), or as in this case you use a checkpoint which packages all of them together.
- You create a text prompt which is then encoded into an embedding for the U-Net model using Clip.
- Generate an empty image in the
latent
space - You connect them all to a
ksampler
which uses the embeddings to denoise the image after adding random noise with a seed. And using multiple stepsSamples
the latent image after each step according to ascheduler
and denoises it into a final latent image - The denoised latent image is then turned from latent space into pixel space by the VAE (Variational Auto Encoder)
- And the final image can then be saved or previewed with the according final node.
I think now has come the time, we have to talk about the technical details.
What is a Diffusion Model? What is a CLIP and what is a VAE?
Since this article is already going to be very long I will address these questions by explaining them in simple concepts linking to full guides, as I am myself not a Machine Learning expert, and lack the terminology to go into details.
Let us begin conceptualizing it. The idea is to take an AI model that has been trained on denoising images (by prediction): the U-Net, and to do so in Latent
space, which is another word for hidden, or abstracted. All the inputs, be it pixels (empty or another image) or prompts in form of the embeddings feed into a complex input tensors (the official name for vectors, matrixes that extend into higher dimensions) which the AI can make sense of. (1,2,3) The embeddings
are a collection of numbers derived from the used keywords and text prompt, derived by yet another AI Model. The CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. (1)
In other words given an image, it can tell you the most relevant keywords to describe that image. Well in case of Stable Diffusion, this has been flipped or rather used as a guider in telling the AI if that what it is denoising corresponds to the keywords used: Algorhithmic ingenuity! :D.
At this point one has to mention two additional model types, that are not required but make the workflow even more powerful. LoRAs Which allows for submodels being trained on specific keywords (the triggers) to influence the conditioning. Or in other words introduce new concepts for the ai to represent. Like specific characters, art styles or objects. There is a fishnet LoRA for example.
Yeah...
And then there is ControlNets which complement the conceptual with the compositional. They are of many various types, like Depth, or Pose or Scribble and help to give the picture the structure you want. For example a ControlNet that encodes a functioning QR code into your image.
And finally the VAE (Variational Auto-Encoder) are Autoencoders, a type of neural network designed to learn efficient data representations, primarily for the purpose of dimensionality reduction or feature learning. (1) Autoencoders put simply can learn to represent something with many many data points, into an abstract space with way less parameters, and then based on these parameters reproduce something from the original data point set applying these latent
parameters. It is the magician turning the hidden latent image into a visible image. While this sounds like a glorified ai compressor and decompressor -- which in this pipeline (or dare I say algorithm) it is. One can use Auto encoders for creative applications, reducing a design space down to a few parameters play around with those and generate new results similar to the "decompressed" originals, like levels, character designs and much more. (1)
But the most important component in all this is the KSampler.
The KSampler node is designed for advanced sampling operations within generative models, allowing for the customization of sampling processes through various parameters. It facilitates the generation of new data samples by manipulating latent space representations, leveraging conditioning, and adjusting noise levels. (1)
Or in other words, it is the bridge between U-Net, CLIP, and VAE to direct all of these models into a synergy which allows us to turn a text description into juicy anime boobs.
But more importantly it is filled with juicy juicy algorithms, ready to be explored.
The KSampler has various parameters, like noise (with a seed) as starting point, amounts of steps, a cfg value, and denoising. These are all numbers, quantitative parameters, which need to be interpreted by an algorithm. However sampler name
and scheduler
have names, labels (cryptic ones that is). Meaning they hide concepts behind them, ideas and patterns, following certain strategies to solve the same problem. How do this concepts/strategies work? How do they "intepret" our numerical values and to what effect? This is what we will explore now diving deep into their implementation.
Inside comfy/kdiffusion/sampling.py
and comfy/samplers.py
we shall find our answers.
First Steps
First of all let us analyze the labels themselves. First thing that becomes appearent, a lot of acronyms and/or names (people names). Also what sticks out is common suffixes like:
- ancestral
- cfg_pp
- sde
- sde_gpu
- _2(s) and _3(m)
On the scheduler side we have a handful with clearer names. As they are less and also at first glance their implementation is limited to a self-contained function, I will start with them, and might expand to samplers if I need more context.
Schedulers
First thing that becomes appearent from research is that Scheduler and Sampler are used interchangeably in many sources, and in some UIs there is only the option for the scheduler, , which is the combination of a Sampler and a Scheduler (1,2). While others differentiate between Schedulers
and Schedule Type
(3)
So it seems once again the difference is made explicit in ComfyUI for increased control. Let us see, what they have to say.
All I could find is
The type of schedule to use, see the samplers page for more details on the available schedules.(1)
As such scheduling is part of the sampling process, and therefor a subcomponent of Samplers (1).
The noise schedule controls the noise level at each sampling step. The noise is highest at the first step and gradually reduces to zero at the last step.
At each step, the sampler’s job is to produce an image with a noise level matching the noise schedule. (1)
Samplers
So let us go back to what other articles call Scheduler
to understand what Samplers are in our case.
Schedulers are algorithms that are used alongside the UNet component of the Stable Diffusion pipeline. They play a key role in the denoising process and run multiple times iteratively (called steps) to create a clean image from a completely random noisy image. The key to all these scheduler algorithms is to progressively perturb data with intensifying random noise (called the “diffusion” process), then successively remove noise to generate new data samples. Sometimes, Schedulers are also referred to as Samplers (1).
Algorithms! Hurray. Also interestingly enough the steps are not an input to the algorithm but the count of how many times the algorithm is applied!
One of the mentioned Samplers/Schedulers in the article is DPM2 Karras
. A sampler that is also very familiar to my times using Automatic1111, which also did not distinguish between sampler and scheduler. Looking at how comfyUI presents them, karras
is in fact a scheduler, which in this setup can be applied to any sampler, not just DPM2
.
Before we finally begin with the code analysis let us quickly break down the pattern each algorithm re-inteprets in their own way:
- progressively perturb data with intensifying random noise (called the “diffusion” process)
- then successively remove noise to generate new data samples
"Many schedulers are implemented from the k-diffusion library by Katherine Crowson, and they’re also widely used in A1111."(1)
Schedulers
This is an experiment I did in ComfyUI using the Flux.1.Schnell UNet and the custom node flux-sampler-paremeters
to spit out a matrix of a variety of results combining the sampler Euler
with all the Schedulers
and each from 1, 2, 4, 8, 16 steps. (Flux Schnell does not require many steps). The prompt was
anime grizzly bear fighting a pikachu mid air. The bear wears a blue necklace with big shiny beads. The pickachu is curled up and emitting a potent thunderstorm. The perspective is from above. Seing the city below them in a dynamic perspective
I chose this prompt to combine many elements, on which I wanted to challenge the AI and see how the samplers influence the result. A style (Anime), a species of animal (grizzly bear) and it's interpretation in the style, an iconic character (pikachu) specific details (blue necklace with big shiny beads) a conceptual action (fighting mid air) (emitting a potent thunderstorm) and a specific composition and challenging perspective (from above, city below). This relates to the two components I described briefly before (conceptual -> LoRA and compositional -> ControlNet), but I did not use any of them.
To help you make sense of the image grid: All images were made with the Euler
sampler. Each row represents one Scheduler
going from 1 to 16 steps. While most are similar, there is a few note-worthy outliers. KL Optimal
produced an empty black image on the first step, while DDIM Uniform
works with a flat gray image for the most part turning it into a distinct composition (from the other schedulers) only after more steps. I find the result at 4 steps particularly cute, as it looks like a soup of tiny image molecules, almost as if DDIM Uniform was not quite finished cooking yet...
Considering that this was it's result at 32
steps
What I am showing you next is the registration of the labels and their internal implementations
SCHEDULER_HANDLERS = {
"normal": SchedulerHandler(normal_scheduler),
"karras": SchedulerHandler(k_diffusion_sampling.get_sigmas_karras, use_ms=False),
"exponential": SchedulerHandler(k_diffusion_sampling.get_sigmas_exponential, use_ms=False),
"sgm_uniform": SchedulerHandler(partial(normal_scheduler, sgm=True)),
"simple": SchedulerHandler(simple_scheduler),
"ddim_uniform": SchedulerHandler(ddim_scheduler),
"beta": SchedulerHandler(beta_scheduler),
"linear_quadratic": SchedulerHandler(linear_quadratic_schedule),
"kl_optimal": SchedulerHandler(kl_optimal_scheduler, use_ms=False),
}
Note-worthy here is that karras
, exponential
are imported from the k_diffusion_sampling
module. They have a different function signature and therefor are integrated differently into the system.
Signature 1: (model_sampling, steps)
Signature 1 of the functions residing in comfy/sampling.py
. Here we have model_sampling
and steps
as common denominator.
The number of steps
are passed into the function as input together with model_sampling
(which I assume is the sampler itself). On a first glance it seems that model_sampling
is queried for some parameters internally. After a more thorough look, I could identify where the scheduler functions are invoked.
def calculate_sigmas(model_sampling: object, scheduler_name: str, steps: int) -> torch.Tensor:
handler = SCHEDULER_HANDLERS.get(scheduler_name)
if handler is None:
err = f"error invalid scheduler {scheduler_name}"
logging.error(err)
raise ValueError(err)
if handler.use_ms:
return handler.handler(model_sampling, steps)
return handler.handler(n=steps, sigma_min=float(model_sampling.sigma_min), sigma_max=float(model_sampling.sigma_max))
This function is in turn invoked inside the KSampler
class, in the method definition of the same name calculate_sigmas
class KSampler:
...
def calculate_sigmas(self, steps):
...
calculate_sigmas(self.model.get_model_object("model_sampling"), self.scheduler, steps)
...
After a total tracing through 5-6 layers, model_sampling
seems to resolve into a torn.nn.Module
(Base class for all neural network modules, in which you can assign the submodules as regular attributes (1) ) using a util function that retrieves parts of the model that transform the string name into an attribute. This model_sampling
attribute is set here based on the type of the U-Net Architecture.
class BaseModel(torch.nn.Module):
...
self.model_sampling = model_sampling(model_config, model_type)
...
def model_sampling(model_config, model_type):
s = ModelSamplingDiscrete
if model_type == ModelType.EPS:
c = EPS
elif model_type == ModelType.V_PREDICTION:
c = V_PREDICTION
elif model_type == ModelType.V_PREDICTION_EDM:
c = V_PREDICTION
s = ModelSamplingContinuousEDM
elif model_type == ModelType.FLOW:
c = comfy.model_sampling.CONST
s = comfy.model_sampling.ModelSamplingDiscreteFlow
elif model_type == ModelType.STABLE_CASCADE:
c = EPS
s = StableCascadeSampling
elif model_type == ModelType.EDM:
c = EDM
s = ModelSamplingContinuousEDM
elif model_type == ModelType.V_PREDICTION_CONTINUOUS:
c = V_PREDICTION
s = ModelSamplingContinuousV
elif model_type == ModelType.FLUX:
c = comfy.model_sampling.CONST
s = comfy.model_sampling.ModelSamplingFlux
class ModelSampling(s, c):
pass
return ModelSampling(model_config)
We will see in each implementation what they influence.
Returning back into the KSampler
class, we will have a look how steps
are passed in and processed.
class KSampler:
...
self.set_steps(steps, denoise)
...
def set_steps(self, steps, denoise=None):
self.steps = steps
if denoise is None or denoise > 0.9999:
self.sigmas = self.calculate_sigmas(steps).to(self.device)
else:
if denoise <= 0.0:
self.sigmas = torch.FloatTensor([])
else:
new_steps = int(steps/denoise)
sigmas = self.calculate_sigmas(new_steps).to(self.device)
self.sigmas = sigmas[-(steps + 1):]
Here two interesting things can be observed.
Steps
anddenoise
interact. Steps are divided by the amount of denoise or rather, dividing by a value smaller than 1 and bigger than 0 results into a multiplication. So the steps are increased the lower the denoising.- Also, the scheduler implementation is directly called inside the
set_step
method of theKSampler
through thecalculate_sigmas
method.
Signature 2: (n, sigma_min, sigma_max)
The first thing that becomes apparent, this signature is not directly influenced by the model_sampling
as is the case with the first signature. Also we have encountered this signature already in this layer of abstraction:
def calculate_sigmas(model_sampling: object, scheduler_name: str, steps: int) -> torch.Tensor:
handler = SCHEDULER_HANDLERS.get(scheduler_name)
if handler is None:
err = f"error invalid scheduler {scheduler_name}"
logging.error(err)
raise ValueError(err)
if handler.use_ms:
return handler.handler(model_sampling, steps)
return handler.handler(n=steps, sigma_min=float(model_sampling.sigma_min), sigma_max=float(model_sampling.sigma_max))
This shows us several things.
n
is the same assteps
from signature 1- The distinguishing aspect between schedulers using signature 1 and 2 is the property
use_ms
- Signature 2 deals with
sigma
,_min
and_max
, which are taken from themodel_sampling
, meaning they are the only two properties taken into account
So before we proceed we need to understand a few things, what does "use ms" mean? and what are sigmas?
class SchedulerHandler(NamedTuple):
handler: Callable[..., torch.Tensor]
# Boolean indicates whether to call the handler like:
# scheduler_function(model_sampling, steps) or
# scheduler_function(n, sigma_min: float, sigma_max: float)
use_ms: bool = True
All I could find about use_ms
is this single flag which confirms what has become appearent from my analysis already, it is to decide between the two signatures. My research in the internet also did not show any clear evidence for what ms
stands for. But since there is no explicit mention, it is probably short for use model sampling, as that is also the distinguish feature of Signature 1
Schedulers.
Sigmas
Bare with me, this is going to be the most technical and mathematical part of the article, but one necessary to associate a lot of the calculations to a meaningful concept.
An important detail related to the diffusion process is the noise schedule. This is the level of noise (a.k.a. sigma, a.k.a. standard deviation of normal distribution) added to the image depending on the time step, a.k.a. t, during the diffusion process. Namely, we add random numbers coming from the normal distribution with a given sigma parameter to each pixel value in each channel of an image. (1)
...Additive Gaussian noise is most commonly used as the corruption method. ... The total standard deviation is found as σ=√∑iσ2iσ=∑iσi2, where σ1,σ2,…σ1,σ2,… are the standard deviations of the noise added at each point in the sequence. Therefore, at each point in the corruption process, we can ask: what is the total amount of noise that has been added so far – what is its standard deviation? We can write this as σ(t)σ(t), where tt is a time variable that indicates how far the corruption process has progressed. This function σ(t)σ(t) is what we typically refer to as the noise schedule. (2)
Sigma, the greek letter: "σ" stands for the standard deviation of the normal distribution.
Sigmas represent the magnitude of noise added to the image at each timestep and the sequence of sigma values represents the noise schedule.
Now that we have established the context and the most important pieces let us finally dive into the different implementations.
Normal
def normal_scheduler(model_sampling, steps, sgm=False, floor=False):
s = model_sampling
start = s.timestep(s.sigma_max)
end = s.timestep(s.sigma_min)
append_zero = True
if sgm:
timesteps = torch.linspace(start, end, steps + 1)[:-1]
else:
if math.isclose(float(s.sigma(end)), 0, abs_tol=0.00001):
steps += 1
append_zero = False
timesteps = torch.linspace(start, end, steps)
sigs = []
for x in range(len(timesteps)):
ts = timesteps[x]
sigs.append(float(s.sigma(ts)))
if append_zero:
sigs += [0.0]
return torch.FloatTensor(sigs)
The Normal
scheduler follows Signature 1
and has an additional sgm
and floor
parameter. There is a flag called sgm
. I will skip it for now, because this is the distinguishing parameter between normal
and SGM Uniform
, and as such we will dedicate that section to it. Floor is a dead parameter, it is never used inside the function. So we will ignore it. Probably a remainder from an older implementation.
The function returns torch.FloatTensor
(1) of sigs
. A Tensor
as we already have briefly touched on is a multi-dimensional matrix containing elements of a single data type (1). I leave here a more in-depth explaination for the curious about what a tensor is.
They are the fundamental data structures used in machine learning (2) and are particularly useful for handling complex data such as images, audio, and text. For example, a color image can be represented as a 3D tensor with dimensions corresponding to height, width, and color channels (red, green, and blue). Similarly, a video can be represented as a 4D tensor with time being the additional dimension. This warrants an entirely own article, of how Tensors
can be used to represent and model all sorts of concepts for the purpose of machine learning. The most essential information for us is that tensors are the parent class of many structures that are more familiar, corresponding to different ranks
: A scalar
(or single value) is a rank 0
tensor, a vector
is a rank 1
tensor and a matrix
is a rank 2
tensor. Simply put a rank corresponds to it's dimensionality. And essentially a tensor is a n-dimensional
parameter for our AI model. In terms of python it is a nested list. Which we can see in the code, eventhough it seems to be a rank 1
tensor in this case.
Sigs
which I assume are the sigmas, or the sequence of levels of noise is initialized as an empty list. Then we iterate in a range over the length of timesteps
. We then append the result of the model_sampling
method sigma
given the timestep. If the flag append_zero
is true a last sigma of 0.0
is added, which I assume is to run a last cleaning step with no noise for the final polish. What stands out here is:
- The model is responsible to calculate the sigma
- The timestep value is the most impactful
- There is a final 0.0 value added to the sigma sequence
We have two branches (one of which we ignore, as it deals with sgm
). If end
(which is the timestep of the model at sigma_min) is almost 0, we do not append_zero and add an additional step. Essentially having a similar result. Either the last native step is near 0, or we add our own 0 to it, to "normalize" the sequence.
And finally we have timesteps
as the result of torch.linspace
of start (the timestep of the model's highest sigma), end (the timestep of the model's lowest sigma) and the steps parameter provided by the user. torch.linspace
Creates a one-dimensional tensor of size steps
whose values are evenly spaced from start
to end
, inclusive. (1)
The Normal
scheduler relies on the built-in timesteps of the model, distributing them evenly and normalizing with a 0.0 sigma at the final timestep. This feels like the most "vanilla" (unaltered) version of scheduling.
Karras
def get_sigmas_karras(n, sigma_min, sigma_max, rho=7., device='cpu'):
"""Constructs the noise schedule of Karras et al. (2022)."""
ramp = torch.linspace(0, 1, n, device=device)
min_inv_rho = sigma_min ** (1 / rho)
max_inv_rho = sigma_max ** (1 / rho)
sigmas = (max_inv_rho + ramp * (min_inv_rho - max_inv_rho)) ** rho
return append_zero(sigmas).to(device)
We will now see the first Scheduler using Signature 2
. Here sigma_min
and sigma_max
are passed in externally and the scheduler does not rely on model specific definitions. So the implementations of Signature 2
are the most self-sufficient and probably the most influential.
The function returns append_zero
of sigmas
converted to device
. I suppose append_zero
is a custom utility function. So let's track it down.
def append_zero(x):
return torch.cat([x, x.new_zeros([1])])
Given a tensor x
it concatenates a zero tensor to it. For me this looks like an inline variant of what the Normal
scheduler has done with the append_zero
and if
block. And the .to
performs Tensor dtype and/or device conversion (1).
Sigmas are produced by ramp
which is a distribution from 0 to 1 with n
steps. Other than in the Normal
scheduler here a flat 0 to 1 distribution is taken instead of the built-in sigmas by timestep of the model. Then a min_inv_rho
and max_inv_rho
are calculated, which seems to be an inverted power of rho. A shaping function (1).
Ramp, which is a tensor (rank 1) is then added to max_inv
and multiplied by the difference of their range and raised to the power of rho
. Effectively creating a non-linear noise schedule. The use of rho
as an exponent allows for fine-tuning the curve of the noise schedule.
What is difficult to comprehend here is that we are talking about a domain of numbers between 0 and 1, which inverts most operations. Division is effectively multiplication, multiplication effectively division, and therefor power is also inverted. The inv
values are higher than the original values. And the final power of rho
reduced them down again, reversing the operation.
As demonstrated here
>>> 0.2 ** (1/8)
0.8177654339579425
>>> 0.8177654339579425 ** 8
0.20000000000000007
Now the question remains, do I as a comfyUI user get to use rho
?
And the short answer is, no but also yes. The rho value defaults to 7 if we select the scheduler inside the KSampler node. However there is a dedicated node just for the Karras scheduler, that gives access to the rho parameter.
Karras
is a scheduling strategy presented in the paper of the same name (Karras et. all, 2022) which is a research group from NVIDIA introducing multiple new algorithms for the sake of a more clear design space. The value 7
has been found to be a good default value for a well balanced design space.
To sum up Karras is a deterministic and universal (model and sampler agnostic) scheduler that produces a non-linear noise schedule. Rho controls how much the steps near sigma_min are shortened at the expense of longer steps near sigma_max (1).
Exponential
def get_sigmas_exponential(n, sigma_min, sigma_max, device='cpu'):
"""Constructs an exponential noise schedule."""
sigmas = torch.linspace(math.log(sigma_max), math.log(sigma_min), n, device=device).exp()
return append_zero(sigmas)
This one also follows Signature 2
and looks quite simple. Again it returns append_zero
applied to sigmas
.
Which is given by a linear distribution from log
of sigma_max
to log of sigma_min
.
The logarithm of a number gives the exponent that the base needs to be raised to, to produce that number. If the base is not given in python, it defaults to e
, which is the base of the natural logarithm and exponential function.
>>> math.e
2.718281828459045
>>> math.log(0.2)
-1.6094379124341003
>>> math.e ** -1.6094379124341003
0.20000000000000004
So we have a similar mechanism taking place here, where we first apply the inverted operation to our base values to get the distribution and then we return it to a range between 0 and 1 with the actual intended operation.
Exponential
is straight-forward, it gives a deterministic exponential distribution.
SGM Uniform
SGM Uniform uses the Normal
implementation with the sgm
flag set to True
and thrown through the Partial
function . This leads to another observation.
So what is SGM, and what does it do?
Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target.(1)
Ok, but this sound very dynamic. Is this reflected in the implementation?
def normal_scheduler(model_sampling, steps, sgm=False, floor=False):
s = model_sampling
start = s.timestep(s.sigma_max)
end = s.timestep(s.sigma_min)
...
if sgm:
timesteps = torch.linspace(start, end, steps + 1)[:-1]
...
sigs = []
for x in range(len(timesteps)):
ts = timesteps[x]
sigs.append(float(s.sigma(ts)))
...
return torch.FloatTensor(sigs)
Based on the implementation no such dynamic becomes appearent. So after researching I found out that this schedule is tailored to a specific category of models (1). So contrary to Signature 2
schedulers, this one is very use-case specific.
Anyway, in the algorithm here we increment the target steps by one stretching the linear distribution
slightly (also reducing the step size) before cutting it of, removing the last entry (countering the previously increased step size).
I have created a dummy distribution function to demonstrate.
def distribution(minV, maxV, n):
if n < 3:
return [minV, maxV]
result = []
step = (maxV - minV) / (n - 1)
for x in range(n-1):
result.append(minV + x * step)
result.append(maxV)
return result
Which given
print(distribution(0.2, 0.8, 8)) # min_sigma, max_sigma, steps
print(distribution(0.2, 0.8, 8+1)[:-1])
produces
[0.2, 0.28571428571428575, 0.37142857142857144, 0.4571428571428572, 0.5428571428571429, 0.6285714285714287, 0.7142857142857144, 0.8] # normal
[0.2, 0.275, 0.35000000000000003, 0.42500000000000004, 0.5, 0.5750000000000001, 0.6500000000000001, 0.7250000000000001] # sgm uniform
As we can see the values of the sgm uniform, are more uniform and don't end with the timestep
of sigma_min
.
Simple
def simple_scheduler(model_sampling, steps):
s = model_sampling
sigs = []
ss = len(s.sigmas) / steps
for x in range(steps):
sigs += [float(s.sigmas[-(1 + int(x * ss))])]
sigs += [0.0]
return torch.FloatTensor(sigs)
Again a function of Signature 1
which is supposedly "simple" according to its same.
We return again a tensor of sigmas ending in a 0.0 sigma. We iterate over the number of steps and add the sigmas of the model at index of - (1 + (x*ss))
where x * ss is rounded to an integral. This effectively indexes the sigma values of the model in reverse order, from most noise to less.
ss
is the ratio of total sigmas of the model divided by steps. So if the model has 12 sigmas, and we have 4 steps, ss is 3. Effectively picking -1, -4, -7 and -10 of the total model sigmas.
The Simple
scheduler simply contrasts the target steps to the sigmas of the model and based on the ratio, cherry picks them from end to start in a regular interval given by that ratio.
DDIM Uniform
Finally we get to our special snow-flake, the misunderstood cook, that swam against the stream of all the other schedulers!
def ddim_scheduler(model_sampling, steps):
s = model_sampling
sigs = []
x = 1
if math.isclose(float(s.sigmas[x]), 0, abs_tol=0.00001):
steps += 1
sigs = []
else:
sigs = [0.0]
ss = max(len(s.sigmas) // steps, 1)
while x < len(s.sigmas):
sigs += [float(s.sigmas[x])]
x += ss
sigs = sigs[::-1]
return torch.FloatTensor(sigs)
It follows Signature 1
and returns like most a torch.FloatTensor
of sigs
.
The ::-1
index (start:end:step) effectively reverses the sigma sequence. This time the iteration is done through a while loop that sees x
go all the way to len
of the model_sampling
sigmas.
x
is accumulated by ss
which is the count of model sigmas divided by steps and rounded to integer, with a minimum value of 1.
This is similar to the Simple
scheduler, which takes descrete steps inside the full list of model sigmas based on the step size (the ratio of sigmas to steps). Conceptually it does most things Simple
does, at different moments and with slightly different syntax. For example simple devided the model sigmas by steps (with a normal /
) and inside the loop multiplied x by ss before casting it to int.
Here the amount of sigmas is divided using //
by steps, effectively forcing the nearest integer, and then ensuring values higher than zero. And instead of indexing with a for loop and multiplying index by ss
, it is added to the x
inside a while
loop which usually implies a less deterministic loop end condition. Two implementations with the same result.
However as SGM Uniform
was the counterpart to Normal
, this DDIM Uniform
seems to be the counterpart to Simple
. This analysis could suffice, were the two results of Simple
and DDIM Uniform
identical, but as you saw, they could not be further apart. I have to dig deeper, in the nuances.
First of all X
which becomes the indexer inside the loop, is initialized at 1 meaning the first sigma (indexed at 0) is skipped, but since everything is flipped, this effectively skips the highest sigma of the model. And since we have introduced the initial offset of 1, every successive index is offset by 1. The distribution is then padded with 0.0.
I reimplemented the two schedulers using a dummy list of 80 sigmas and got these two results on three different step counts: 4, 12, 32.
# At 4 steps
Simple: [0.9875, 0.7375, 0.4875, 0.2375, 0.0]
DDIM: [0.7625, 0.5125, 0.2625, 0.0125, 0.0] # evenly spaced at 0.25
# At 12 steps
Simple: [0.9875, 0.9125, 0.825, 0.7375, 0.6625, 0.575, 0.4875, 0.4125, 0.325, 0.2375, 0.1625, 0.075, 0.0]
DDIM: [0.9875, 0.9125, 0.8375, 0.7625, 0.6875, 0.6125, 0.5375, 0.4625, 0.3875, 0.3125, 0.2375, 0.1625, 0.0875, 0.0125, 0.0] # evenly spaced at 0.0750
# At 32 steps
Simple: [0.9875, 0.9625, 0.925, 0.9, 0.8625, 0.8375, 0.8, 0.775, 0.7375, 0.7125, 0.675, 0.65, 0.6125, 0.5875, 0.55, 0.525, 0.4875, 0.4625, 0.425, 0.4, 0.3625, 0.3375, 0.3, 0.275, 0.2375, 0.2125, 0.175, 0.15, 0.1125, 0.0875, 0.05, 0.025, 0.0]
DDIM: [0.9875, 0.9625, 0.9375, 0.9125, 0.8875, 0.8625, 0.8375, 0.8125, 0.7875, 0.7625, 0.7375, 0.7125, 0.6875, 0.6625, 0.6375, 0.6125, 0.5875, 0.5625, 0.5375, 0.5125, 0.4875, 0.4625, 0.4375, 0.4125, 0.3875, 0.3625, 0.3375, 0.3125, 0.2875, 0.2625, 0.2375, 0.2125, 0.1875, 0.1625, 0.1375, 0.1125, 0.0875, 0.0625, 0.0375, 0.0125, 0.0] # evenly spaced at .0250
Of course this even spacing is due to an even distribution in my sigma sequence I provided as input, but nonetheless DDIM
mantains a balanced distribution while Simple
does not.
I also noticed something else.
Simple: [0.9875, 0.9625, 0.925, 0.9, 0.8625, 0.8375, 0.8, 0.775, 0.7375, 0.7125, 0.675, 0.65, 0.6125, 0.5875, 0.55, 0.525, 0.4875, 0.4625, 0.425, 0.4, 0.3625, 0.3375, 0.3, 0.275, 0.2375, 0.2125, 0.175, 0.15, 0.1125, 0.0875, 0.05, 0.025, 0.0] with 33 entries
DDIM: [0.9875, 0.9625, 0.9375, 0.9125, 0.8875, 0.8625, 0.8375, 0.8125, 0.7875, 0.7625, 0.7375, 0.7125, 0.6875, 0.6625, 0.6375, 0.6125, 0.5875, 0.5625, 0.5375, 0.5125, 0.4875, 0.4625, 0.4375, 0.4125, 0.3875, 0.3625, 0.3375, 0.3125, 0.2875, 0.2625, 0.2375, 0.2125, 0.1875, 0.1625, 0.1375, 0.1125, 0.0875, 0.0625, 0.0375, 0.0125, 0.0] with 41 entries
DDIM
Uniform produces more sigmas than steps. While simple stays constant at (steps + 1).
Beta
def beta_scheduler(model_sampling, steps, alpha=0.6, beta=0.6):
total_timesteps = (len(model_sampling.sigmas) - 1)
ts = 1 - numpy.linspace(0, 1, steps, endpoint=False)
ts = numpy.rint(scipy.stats.beta.ppf(ts, alpha, beta) * total_timesteps)
sigs = []
last_t = -1
for t in ts:
if t != last_t:
sigs += [float(model_sampling.sigmas[int(t)])]
last_t = t
sigs += [0.0]
return torch.FloatTensor(sigs)
Here we have Signature 1
and return torch.FloatTensor
of sigs
, which is padded once more by a final 0.0 value for the last step. Here we iterate through ts
which seems to be a sequence of values that are being compared throughout the loop. If two values that follow each other are different t
is rounded to int and used to index the sigma from the model. last_t
is initialized to -1 as nil value. So the question now is, what is ts
?
So first we initialize the value by filling a linear distribution (of which the last value is excluded (1)) with the result of the member-wise operation of 1 - member
. In these domains between 0 and 1, 1 - is usually to invert the number. Think of it like flipping positive and negative space. A 0.2 (0.2 from 0) becomes a 0.8 (0.2 from 1).
Now we have a statistical distribution called beta
which represents a beta continuous random variable (1) of which the ppf
method returns the percent point function (which is the inverse of cdf
the cumulative distribution function).
The short version is that the Beta distribution can be understood as representing a distribution of probabilities, that is, it represents all the possible values of a probability when we don't know what that probability is. (1)
This method takes a q
which is our ts
, an alpha and beta (which are defaulted to 0.6). This distribution is then multiplied by the timesteps and rounded to integer (member-wise)
So without going to deep into statistical nuances. This approach chooses quasi-random sigmas of the model guaranteeing that there are no two equal sigmas in succession.
Linear Quadratic
def linear_quadratic_schedule(model_sampling, steps, threshold_noise=0.025, linear_steps=None):
if steps == 1:
sigma_schedule = [1.0, 0.0]
else:
if linear_steps is None:
linear_steps = steps // 2
linear_sigma_schedule = [i * threshold_noise / linear_steps for i in range(linear_steps)]
threshold_noise_step_diff = linear_steps - threshold_noise * steps
quadratic_steps = steps - linear_steps
quadratic_coef = threshold_noise_step_diff / (linear_steps * quadratic_steps ** 2)
linear_coef = threshold_noise / linear_steps - 2 * threshold_noise_step_diff / (quadratic_steps ** 2)
const = quadratic_coef * (linear_steps ** 2)
quadratic_sigma_schedule = [
quadratic_coef * (i ** 2) + linear_coef * i + const
for i in range(linear_steps, steps)
]
sigma_schedule = linear_sigma_schedule + quadratic_sigma_schedule + [1.0]
sigma_schedule = [1.0 - x for x in sigma_schedule]
return torch.FloatTensor(sigma_schedule) * model_sampling.sigma_max.cpu()
Well, this one is a beast, following Signature 1
and having two additional inputs: threshold_noise
and linear_steps
and (member-wise) multiplies the final torch.FloatTensor
of sigma_schedule
by the sigma_max
of the model. Meaning the entire distribution is put into relation to the maximum sigma of the model.
sigma_schedule
is the result of a linear_sigma_schedule
appended by a quadratic_sigma_schedule
and a 1.0. This compound sequence is again inverted (with the same 1.0 -
technique observed in Beta
) making the final 1.0, the same 0.0 endpoint as in all the remaining Schedulers. So now it is a matter of finding out how the linear_sigma_schedule
and the quadratic_sigma_schedule
are composed respectively.
First of all, it breaks out of the entire algorithm at steps
of 1 returning a simple [1, 0]
schedule. If the input linear_steps
is not provided, the algorithm defaults to steps
divided (and rounded) by 2, which effectively determines where the linear distribution ends, and the quadratic distribution begins (which in the default case is in the mathematical middle).
The linear_sigma_schedule
is computed through a list comprehension in which from 0 to linear_steps
the value is multiplied by the threshold_noise
(which defaults to 0.025) divided by the given amount of linear_steps
.
Before the quadratic_sigma_schedule
is calculated two intermediary values are derived from the linear_steps
value. threshold_noise_step_diff
is the total steps multiplied by the threshold noise removed from the linear steps. quadratic_steps
is the remaining step count when linear_steps
are removed from the total steps
count. The coefficient given by quadratic_coeff
is given by the threshold_noise_step_diff
divided by the linear_steps
count multiplied by the quadratic_steps
to the power of two (or squared) which is where this Scheduler gets its name from probably. The linear_coef
is then the theshold_noise
divided by linear_steps
from which we remove the ratio of theshold_noise_step_diff
and quadratic_steps
(squared) doubled.
Finally a constant const
is calculated by the quadratic_coef
multiplied by linear_steps
squared. The quadratic_sigma_schedule
is then given by a list comprehension of linear_steps
to steps
(or the range from the midpoint to the end) in which each step is squared and multiplied with the quadratic_coef
, added with the timestep multiplied with the linear_coef
and finally added to const
.
Once again this seems a very verbose shaping function which up to a certain point follows a linear progression, and then becomes exponential (quadratic) towards the end, where the contrast in shape is given by the noise_threshold
value. I checked however and other than exponential
and karras
there is no dedicated linear-quadratic schedule node to customize the parameters.
KL Optimal
def kl_optimal_scheduler(n: int, sigma_min: float, sigma_max: float) -> torch.Tensor:
adj_idxs = torch.arange(n, dtype=torch.float).div_(n - 1)
sigmas = adj_idxs.new_zeros(n + 1)
sigmas[:-1] = (adj_idxs * math.atan(sigma_min) + (1 - adj_idxs) * math.atan(sigma_max)).tan_()
return sigmas
Our last built-in function complements our self-sufficient Signature 2
schedulers with no additional parameters. It returns sigmas
unaltered, which implies, that we already produce and operate on a torch.Tensor
inside the function somewhere.
It all begins with adj_idxs
given by the distribution of 0 to steps by incrementing by 1 (since everything is defaulted (1)) and (member-wise) divided by steps - 1
.
sigmas
is then given by a series of 0.0 of the size n+1
based on the type of adj_idxs
(1)
The assignment to sigmas
with the [:-1]
index replaces everything except the last value which due to having a .new_zeroes
of one more than steps, effectively fills sigmas except for a last 0.0, and is yet another approach to achieve the zero padding of the other implementations. This should be another nice example, how the concept of an algorithm can be translated using different implementation techniques.
The values before are given by a member-wise operation of member * tan(sigma_min)
added to the inverted (1 -) multiplied by atan
of sigma_max
. Of which then is taken the member-wise tangent.
Trigonometric functions always return values between 0 and 1 given any value as they describe features of angles and circle, which are of circular nature. Effectively creating an oscillating shape based on sigma_min
and sigma_max
.
Conclusion
Schedulers
are the algorithms preparing the sequence of noise called the noise schedule for the samplers to solve. Conceptually the noise schedule is a distribution going from a lot of noise to no noise at the end. We generally have seen two approaches to this.
One more thing, being now able to visualize the effect of the scheduler as a distribution with a certain shape given by the algorithm, the shape will probably become more apparent with more data points, or in this case higher the step count. So I repeated the experiment with 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 60, 120 steps. This time with the Heun2pp
Sampler.
Signature 1
functions are strongly tied to the model's specific sigmas.
Normal
The normal
scheduler distributes timesteps in a linear distribution and then based on the timestep picks sigmas from the model. This results in acceptable quality in relation to each step. Also it provides a good benchmark to compare the other schedulers against.
SGM Uniform
SGM Uniform
stretches the timestep distribution, tightening the interval and skipping the last entry which results in uniform steps. Other than that, it works like normal (as it uses the same function implementation). Normal
and SGM Uniform
both converge to almost the same picture at 120
steps. However at 60
SGM Uniform
produces a more consistent pikachu (proportion-wise) to it's results at lower steps. Also SGM Uniform
hallucinates some geometric patterns into the background in the middle steps, where Normal
renders clouds of various detail levels, except for 24
. Also I find that SGM Uniform
outperforms Normal
on lower step counts.
Simple
Simple
deals with indexing sigmas directly, in a regular interval given by the ratio of sigmas to steps. Since both Normal
and Simple
rely on the builtin sigmas of the model they converge to a similar result. What seems apparent on closer look, is that Simple
is always a few steps ahead of Normal
and can produce sightly crisper results on lower steps.
This could be due to the slight indirection normal
introduces by using timesteps to resolve sigmas, instead of directly indexing the sigmas
inside the model_sampling
layer. The biggest discrepancy of quality happens between steps 8
and 32
. But both converge to the same result at 60
and 120
. (Pikachu however has his eyes closed on 60
with the simple
scheduler)
DDIM Uniform
DDIM Uniform
prioritizes a uniform (direct) indexing strategy, not always respecting the steps. The algorithm skips the first index (or the highest sigma), which on a low step count means, that almost no denoising takes place. This explains why DDIM Uniform
takes so long to warm up. Also we have seen that it produce uniform steps by adjusting the total sigmas far beyond the step count, which should be visible in the total diffusion time.
If we compare a low step count like 8
between simple
(70.36s) and DDIM Uniform
(72.37s) we can see slightly slower result which at 120
respectively (1125.16s) and (1132.57s) is confirmed.
Beta
Beta
produces a random variable distribution multiplied by the timesteps producing a random spread of timesteps which are used to index the existing sigmas (a similar indirection as Normal) avoiding adding identical sigmas in succession. As the algorithm confirms, the Beta
scheduler produces the most varied results based on it's step
count. Also compared to Normal
beta has the most varied styles and details, which can be observed on the bear, city and pikachu between steps 8
and 24
. However when converging on steps 60
and 120
while the results look similar to other schedulers, Beta
tends to resolve more than necessary, ending up removing details: Pikachu loses some limbs and generally the picture is getting destroyed a bit. If we remember the algorithm, the results are highly unpredictable. The more steps we have the higher the chance is that we could have high sigmas (denoising) at later stages, which can explain this behavior.
Linear Quadratic
Finally linear_quadratic
is a compound sequence given a linear part and a quadratic part, split at the halftime of the timesteps by default. The Linear Quadratic
Scheduler produces very interesting results at lower steps like 2
, 4
, being the fastest to converge to the final design at around 8
and 12
, starting to cut off parts of the composition, which can be observed in the city backdrop.
Then we have Signature 2
Schedulers which take a different approach, being more deterministic and self-contained, ignoring the existing sigma distribution of the model and generating their own.
Karras
We have Karras
which creates a linear distribution from 0 to 1 which is combined with a custom shaping function given by rho
, sigma_min
and sigma_max
. Karras
is the slowest to converge. For example we can see the pikachu ears on the bear only disappearing on very high steps
. It is behind Normal
by many steps. If you look at the result of 24
steps, which only then comes close to the result at 8
steps of Normal
. It converges to a different result at 120
which is similar to the result of Normal
at 28
steps. This probably is due to the low default rho
. More experiments need to be done with the custom node, which allows to customize the shape of the distribution.
Exponential
The exponential
scheduler, returns an exponential distribution from 0 to 1 based on sigma_min
and sigma_max
. We will compare this to our Normal
benchmark and to Karras
. This one is even slower to converge to the results of Normal
, here the result at 60
steps of Exponential
corresponds to a combination of step 8
and 12
of Normal
. Also step 120
of Exponential
corresponds to step 20
of Normal
with slightly different proportions and details. Compared to Karras
it produces almost identical results on lower steps, until Karras
drifts off with slightly more varied results.
KL Optimal
And finally KL Optimal
that oscillates between 0 and 1 based on sigma_min
and sigma_max
. Again we compare this one to our Normal
benchmark. But I would like to also compare it to Beta
, as similar to it the chances for high sigmas are high towards the final steps based on it's oscillating nature. It is the only scheduler producing an empty black image at 1
step. On lower steps, the results are very similar to Normal
. I feel at 8
steps, with KL Optimal
the bear has the most angry expression. It converges faster than Karras
and Exponential
almost following Normal
. Interestingly enough the results at higher step counts are very different, confirming the behavior observed with Beta
. At 120
steps the result looks very similar to the one at 60
steps of Normal
. Compared to Beta
, KL Optimal
does not showcase as much variation on lower steps. For example observe the difference between steps 8
and 24
. The variation only start to take place at higher steps.
This shows me that with Flux.Schnell
it is better to stick with the Signature 1
schedulers.
DDIM Uniform
needs higher steps to get started, while Simple
and Normal
are good base options. Beta
produces the most unpredictable results based on the step count. And Linear Quadratic
is an efficient scheduler converging faster at lower steps.
Signature 2
scheduler on the other hand are very slow to converge, and need higher steps for the most variation. I think they are only worth using when controlled and customized using custom nodes, specifically designed for them.
Samplers
For the sampler, I have repeated the same experiment, with the same seed
, prompt
, fixed steps
twice for each sampler
with the normal
scheduler (right) and the beta
scheduler (left). I chose a low enough step size so when in doubt we can follow the process step by step, and high enough to walk past the threshold of some schedulers underperforming, due to how the distributions are shaped (looking at you DDIM Uniform
).
So what should become apparent is that not every sampler works with Flux.Schnell
, namely euler_cfg_pp
, euler_ancestral_cfg_pp
, dpmpp_2s_ancestral_cfg_pp
, dpmpp_sde
, dpmpp_sde_gpu
, dpmpp_2m_cfg_pp
, dpmpp_2m_sde
, dpmpp_2m_sde_gpu
, dpmpp_3m_sde
, dpmpp_3m_sde_gpu
, res_multistep_cfg_pp
, uni_pc
, uni_pc_bh2
and ddpm
perform very poorly (at least at 8
steps). We can see some common features.
All the cfg_pp
and _sde
ending samplers have problems, aswell as uni_pc
and ddpm
.
Former tells me that Flux is not well optimized on the cfg
and sde
variations, while latter tell me that the sampler uni_pc
and ddpm
might be outdated or not well suited for the flux architecture.
Also we have several different families of Euler
, Heun
, DPM
(with the most variations and improvements), LMS
, DDPM
, LCM
, IPNDM
, Deis
, Res Multistep
, Gradient Estimation
, DDIM
and Uni PC
.
While some Samplers have variants, which we will cover in their implementation analysis, most of them share common variants with familiar names: Ancestral
, PP
, CFG
, SDE
which we will briefly cover here before diving in.
Ancestral
To understand ancestral, we need to remember some observations from our scheduler experiment, namely the point of convergence, where a design does not change much, or new variants are entroduced. The design stays stable and the denoising only adds details or improves the crispiness of the image. Ancestral samplers are different.
An ancestral sampler adds noise to the image at each sampling step. They are stochastic samplers because the sampling outcome has some randomness to it... using an ancestral sampler is that the image would not converge.(1)
CFG
Classifier-Free Guidance (CFG) is a technique used in diffusion models to improve the quality and controllability of generated samples. It has become a crucial component in state-of-the-art text-to-image generation systems, such as Stable Diffusion.
... This interpolation is controlled by a guidance scale, which determines how much influence the conditioning information has on the generation process. A higher guidance scale typically results in outputs that more closely match the input prompt, but can also lead to artifacts and reduced sample diversity. (1)
However, as of the time of writing this article, according to my knowledge Flux.1
is not affected by CFG, which might explain the incompatibility with the samplers emphasizing it. Or rather to keep CFG < 1, as Flux does not interact with the negative prompt. (1)
PP
So the pp
stands for ++
most likely, like it is the case with "c++" and .cpp files. 1
So they are generally a suffix for an improved version.
SDE and ODE
According to (1) SDE and ODE are mathematically derived techniques called Stochastic Differential Equation and Ordinary Differential Equations. Where SDE seem to be related to score-based diffusion models.
Score-based diffusion models convert data into noise through a diffusion process governed by a stochastic differential equation (SDE). Generating new data points is achieved by sampling noise particles and simulating a reverse-time dynamics of this diffusion process, driven by an equation known as the reverse SDE. The reverse SDE has a closed-form expression which depends solely on the time-dependent gradient field (the so-called score) of the logarithm of the perturbed data distribution.
ODE on the other hand facilitate deterministic dynamics, controlled by a probability flow.
To evaluate the quality of images created with diffusion models a metric is used called
Frèchet inception distance (FID) score.
For those curious here is a paper talking in depth about SDE in the context of market and stock price evolution. Same goes for ODE in this article. We are just interested about their concept, how they relate to sampling, and how they affect the result.
ODEs are differential equations dependent on a single independent variable, and its uknowns consist of functions that involve derivatives. In SDE that progression is random.
For example Euler
, Heun
and LMS
are considered to be old-school ODE solvers (1).
While SDE... don't really work with Flux.
As we did with the Schedulers let's try to find an entry point where all Samplers are registered. This helps us to provide the application context of the algorithm, and identify any differences between them on the surface.
def ksampler(sampler_name, extra_options={}, inpaint_options={}):
if sampler_name == "dpm_fast":
...
elif sampler_name == "dpm_adaptive":
...
else:
sampler_function = getattr(k_diffusion_sampling, "sample_{}".format(sampler_name))
return KSAMPLER(sampler_function, extra_options, inpaint_options)
In the ksampler
function which is used as middle-step to resolve the name to the implementation inside the KSampler
class, we can already observe a few things.
dmp_fast
anddpm_adaptive
have a special treatment, probably some pre-processing/preparation steps before the main algorithmgetattr
is used to transform the name to a function calledsample_name
and then converted to a python symbol (a function in this case)- The
sampler_function
is then passed into aKSAMPLER
construct together with options and inpaint options.
KSAMPLER
is a class that has asample
method.
def sample(self, model_wrap, sigmas, extra_args, callback, noise, latent_image=None, denoise_mask=None, disable_pbar=False):
extra_args["denoise_mask"] = denoise_mask
model_k = KSamplerX0Inpaint(model_wrap, sigmas)
model_k.latent_image = latent_image
if self.inpaint_options.get("random", False): #TODO: Should this be the default?
generator = torch.manual_seed(extra_args.get("seed", 41) + 1)
model_k.noise = torch.randn(noise.shape, generator=generator, device="cpu").to(noise.dtype).to(noise.device)
else:
model_k.noise = noise
noise = model_wrap.inner_model.model_sampling.noise_scaling(sigmas[0], noise, latent_image, self.max_denoise(model_wrap, sigmas))
k_callback = None
total_steps = len(sigmas) - 1
if callback is not None:
k_callback = lambda x: callback(x["i"], x["denoised"], x["x"], total_steps)
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
samples = model_wrap.inner_model.model_sampling.inverse_noise_scaling(sigmas[-1], samples)
return samples
This is where the magic happens. Sample
takes a model_wrap
, sigmas
(the noise schedule from our scheduler), some extra args, a callback, noise and the latent image, aswell as mask.
Here we can see the main algorithm. Steps are derived from the sigmas
by substracting 1.
Sigmas is the noise schedule produced by the schedulers. Noise is derived by the inner_model
's model_sampling
using a function for noise_scaling
based on the max_sigma
(or the first entry of our noise schedule).
A first sampling pass is stored in samples
by running the sampler_fuction
with the diffusion model, noise and sigmas. The result is then passed into the inner_model.inverse_noise_scaling
where it prepares the sample for the next sample iteration, on the last sigma's entry; the lowest sigma.
Inside the KSampler
class I have also found this implementation of sample
.
def sample(self, noise, positive, negative, cfg, latent_image=None, start_step=None, last_step=None, force_full_denoise=False, denoise_mask=None, sigmas=None, callback=None, disable_pbar=False, seed=None):
if sigmas is None:
sigmas = self.sigmas
if last_step is not None and last_step < (len(sigmas) - 1):
sigmas = sigmas[:last_step + 1]
if force_full_denoise:
sigmas[-1] = 0
if start_step is not None:
if start_step < (len(sigmas) - 1):
sigmas = sigmas[start_step:]
else:
if latent_image is not None:
return latent_image
else:
return torch.zeros_like(noise)
sampler = sampler_object(self.sampler)
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
At first it looks like a recursive function, but if you examine the signature carefully, it actually reveals to be yet another wrapper for another sample
function.
def sample(model, noise, positive, negative, cfg, device, sampler, sigmas, model_options={}, latent_image=None, denoise_mask=None, callback=None, disable_pbar=False, seed=None):
cfg_guider = CFGGuider(model)
cfg_guider.set_conds(positive, negative)
cfg_guider.set_cfg(cfg)
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
This function now implies the use of the CFGGuider
, a class with yet another collection of sample
methods. All the references to other functions inside this class result in covering the full content of the samplers.py
(1120 lines of code), so I can't show a listing of all the middle-functions I went through. Suffice it to say the entire process is hidden behind multiple layers for extensibility and flexibility. Inside inner_sampler
of CFGGuider
the sampler
class and the sampler.sample
method are wrapped into ExecutableWrapper.
This all resolves back into the sampler_object
utility function from the start. However there is an interesting observation:
def sampler_object(name):
if name == "uni_pc":
sampler = KSAMPLER(uni_pc.sample_unipc)
elif name == "uni_pc_bh2":
sampler = KSAMPLER(uni_pc.sample_unipc_bh2)
elif name == "ddim":
sampler = ksampler("euler", inpaint_options={"random": True})
else:
sampler = ksampler(name)
return sampler
The names are resolved to a function with the ksampler
name, except for uni_pc
, uni_pc_bh2
and ddim
, coincidentally 3 of the samplers that did underperform in our example. However one pressing question remains open. The process has been described as iterative, and the reason I am doing all this backtracking is to find the context in which we iterate the sampling. But no matter how far I track it, I can't find any loops or recursive calls.
def sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=1.0, disable_noise=False, start_step=None, last_step=None, force_full_denoise=False, noise_mask=None, sigmas=None, callback=None, disable_pbar=False, seed=None):
sampler = comfy.samplers.KSampler(model, steps=steps, device=model.load_device, sampler=sampler_name, scheduler=scheduler, denoise=denoise, model_options=model.model_options)
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
samples = samples.to(comfy.model_management.intermediate_device())
return samples
KSampler
(not KSAMPLER
and not ksampler
) is called within, yet another sample
function.
(I am showing you this, so you can comprehend the agony I am subjecting myself through.)
It seems that the iteration takes place inside the individual sampler
functions after all. So let's get to it!
Euler
Since this is our first Sampler
implementation we are going to find out a lot of new concepts and question applicable to Samplers in general, so bare with me, it is going to be a long section.
@torch.no_grad()
def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
"""Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
extra_args = {} if extra_args is None else extra_args
s_in = x.new_ones([x.shape[0]])
for i in trange(len(sigmas) - 1, disable=disable):
if s_churn > 0:
gamma = min(s_churn / (len(sigmas) - 1), 2 ** 0.5 - 1) if s_tmin <= sigmas[i] <= s_tmax else 0.
sigma_hat = sigmas[i] * (gamma + 1)
else:
gamma = 0
sigma_hat = sigmas[i]
if gamma > 0:
eps = torch.randn_like(x) * s_noise
x = x + eps * (sigma_hat ** 2 - sigmas[i] ** 2) ** 0.5
denoised = model(x, sigma_hat * s_in, **extra_args)
d = to_d(x, sigma_hat, denoised)
if callback is not None:
callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigma_hat, 'denoised': denoised})
dt = sigmas[i + 1] - sigma_hat
# Euler method
x = x + d * dt
return x
We will proceed like we did with the Scheduler
analysis. From the outside in, and then from the end to the start. First thing we can see here is the @decorator called @torch.no_grad()
. It is a decorator (context-manager) that disables gradient calculation, saving on memory and useful for functions that are used for inference (1).
This tells us, we are where we need to be, in the place where inference takes place.
Our inputs are: model
(the model), x
(which is passed by the KSAMPLER as noise
), sigmas
(the noise schedule computet with the Schedulers
), extra_args
(we will ignore), callback
(we will ignore), disable
, s_churn
, s_tmin
, s_tmax
, s_noise
.
We return x
. What is x
? Probably a tensor. But what is it conceptually? According to my previous contextualization of the function, it is noise
. But at the moment I don't really understand what that means. I thought noise comes from the sigmas. Maybe it is the generated noise from a seed?
My back-tracing ended up here, in the nodes.py
file calling comfy.sample.sample
(Do I have to comment this any further?...).
def common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent, denoise=1.0, disable_noise=False, start_step=None, last_step=None, force_full_denoise=False):
...
noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds)
...
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
...
And this is where noise
is generated
def prepare_noise(latent_image, seed, noise_inds=None):
"""
creates random noise given a latent image and a seed.
optional arg skip can be used to skip and discard x number of noise generations for a given seed
"""
generator = torch.manual_seed(seed)
if noise_inds is None:
return torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")
unique_inds, inverse = np.unique(noise_inds, return_inverse=True)
noises = []
for i in range(unique_inds[-1]+1):
noise = torch.randn([1] + list(latent_image.size())[1:], dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")
if i in unique_inds:
noises.append(noise)
noises = [noises[i] for i in inverse]
noises = torch.cat(noises, axis=0)
return noises
The function called prepare_noise
generates noises
. Ufff....
But let's limit to what we need to know. Based on noise_inds
it creates a collection of noise. Noise is generated using torch.randn
using the layout
of the latent_image
.
Back to the sampler.
X
is the collection of noise in the latent_image
layout and is a tensor.
X is given by x + d * dt
and is called the Euler Method. So let us see what d
and dt
are.
This all takes place inside a loop: our sampling process!
dt
is sigma_hat
removed from the next Sigma in the Schedule. d
is the result of the function to_d
which takes x
(the noise), sigma_hat
and denoised
. Based on a comment in the to_d
definition, it converts a denoiser output to a Karras ODE Derivative. So d
is the ODE Derivative... Of what?
denoised
is the result of the model
given x
(the noise), sigma_hat
by s_in
. This is where algorithm meets AI model. Great, so the model does what it learned, given noise, it denoises it. I think this is what all the wrapper layers were for: dependency injecton, to get all the conditioning into the model, without having to pass it through all the layers explicitely all the way to the sampling method. Also my first guess for what the sampler does now is: The sampler is responsible, given a certain noise schedule, to prepare the noise for the model, let the model denoise it and then process the denoised result to be applied to the accumulated remaining noise, preparing it for the next iteration.
The derivative that to_d
finds therefor is from the current noise
to denoised
based on sigma_hat
which implies our noise schedule.
Now we get to the complicated "mathy" stuff. Let's break it down. We have a branch based on s_churn
, wether it is positive or negative. s_churn
is part of a collection of parameters that can be passed into the function through **extra_options
but inside the ksampler
factory I could not see any of them being passed. So I will use the default values from here on: 0. Which defines gamma as 0 and sigma_hat
as the current sigma
of the schedule.
So s_churn
and in consequence gamma
seem to be a parameter to adjust the sigma
(or the standard deviation of the normal distribution for this sampling step). So sigma_hat
is simply the current sigma
.
The loop iterates over trange
of the sigma count minus 1, probably for the calculation of dt
ending one short. Remember our schedulers always assigned 0.0 at the end of the schedule? This could be the reason for that. trange
is an alias for tarange
which in turn is a short form for tqdm.asyncio.tqdm
which takes a range and **kwargs
.
tqdm
Is a customizable progressbar decorator for iterators, that includes a default range
iterator printing to stderr
. In simpler words, this is the reason while during the sampling process the program does not freeze, but instead we can observe the progress bar on top of theKSampler
node and continue working. It is relevant for the UI/UX but not for the sampling algorithm.
So to answer our question: How does the sampling algorithm affect the result and how it can be conceptualized, we need to translate d
and dt
to concepts.
def to_d(x, sigma, denoised):
"""Converts a denoiser output to a Karras ODE derivative."""
return (x - denoised) / utils.append_dims(sigma, x.ndim)
to_d
takes x (the noise), the sigma
and and the denoised
sample, subtracts the denoised
from the noise, and divides it by append_dims
of sigma
and x.ndim
(which returns the number of dimensions of the tensor (1) ). append_dims
is a utility function that appends dimensions to a tensor until it reaches that number. Conceptually this is the 0-padding technique applied to Machine Learning. This is probably done so that the tensor operators work together. But the whole x-denoised/sigma
is very very underwhelming. How can this produce a Derivative? I will dig a bit deeper into tensor operations and also maybe into the Karras paper. dt
Is the difference between the current sigma and the next.
So Euler simply applies the derivative of the noise to the denoised and multiplies it by the sigma jump from the current to the next. Effectively producing the next iteration of noise for the model. With my current knowledge, this reminds me of matrix multiplication in linear algebra, and since matrices are rank-2 tensors, I am using this simplification to explain what is going on. Matrices at first also look like an arbitrary collection of numbers, however if we multiply matrices and vectors we can combine or apply transforms. We can express all the manipulations of 2d and 3d space with this simple operation with rank-1 and rank-2 vectors.
So imagine what could be expressed applied to n-ranked vectors.
Over the next months I am going to complement my analysis by studying a bit advanced math concepts, and will then revise some of the observations a fill in the gaps, that I am currently abstracting. But I think for a conceptual overview this will suffice for now.