Untitled

J: If f(x) is a neural net which states the probability of input being noise or not, $\frac{df(x)}{dx}$ states the direction in which x needs to move in order to make f(x) higher. i.e. in the direction that makes it more likely to be a digit.

S: However, note that this is wrt x and not the weights of the net. So we probably need to train a net between noise and real images?

Untitled

J: Or in other words if we just try and predict noise, this will tell us what to subtract from the (noised) input.

J: Unet is used as the architecture.

S: Why do we use Unets? Why aren’t we use Encoder Transformers now? Same concept of input size and output size being the same. Is it a speed issue?

J: There are no cross connections. Because we are talking about auto-encoders in this case. In order to compress pixels into a representation. So f about is done on the latents (compressed version) and not the actual pixels.

Untitled

J: Now we give some extra information such as the label. Therefore, we should expect to learn the noise better. This method is called guidance. In the case of

J: Talks about CLIP. My post on CLIP.

Untitled

Untitled

S: “Things that appear like this, [points to noised image] never appeared on the training set”. But isn’t that all that appeared in training? Noised images?

J: This looks a lot like optimizers. Can we use momentum (yes)?

J: The model inputs are noised image, caption and t (x_t, z, t) where t is an indication on how much noise there is.

Open questions