Flow matching, Score-based Generative Model, Schrödinger Bridge and Optimal Transport

Reference

Introduction

Generative models resonate with two deep principles: the thermodynamics of entropy and the mathematics of optimal transport.

Symbols and Preliminaries

Probability Space

\((\Omega,\mathcal F,\mathbb P)\) is a probability space.
\((E,\mathcal E) = (\mathbb R^d, \mathcal B(\mathbb R^d))\) is the measurable space.
A random variable is a measurable map \(X:(\Omega,\mathcal F)\to(E,\mathcal E)\), i.e. \[ X^{-1}(B) \in \mathcal F, \quad \forall B\in \mathcal B(\mathbb R^d). \]
The distribution (pushforward measure) of \(X\) is \[ \mu = \mathrm{Law}(X) = \mathbb P \circ X^{-1}, \quad \mu \in \mathcal P(\mathbb R^d). \]
If \(\mu\) is absolutely continuous w.r.t. the Lebesgue measure \(\lambda\), then there exists a density function \(p(x): E\to [0,+\infty)\) such that \[ d\mu(x) = p(x)\, dx. \]
Notation: we write \(X\sim \mu\); if \(\mu\) admits a density, we often abbreviate \(X\sim p(x)\).

Stochastic Process

A stochastic process is a time-parametrized family of random variables \[ \{X_t: t\in [0,T]\},\quad X_t:(\Omega, \mathcal{F}) \to (E,\mathcal{E}). \]
The marginal distribution at time \(t\) is \[ \mu_t = \mathrm{Law}(X_t),\quad X_t\sim \mu_t. \]

Standard Brownian Motion (Wiener Process)

A \(d\)-dimensional standard Brownian motion \((W_t)_{t\ge0}\) with respect to a filtration \((\mathcal F_t)_{t\ge0}\) satisfies: 1. \(W_0 = 0\) almost surely. 2. For \(0\le s<t\), the increment \(W_t-W_s \sim \mathcal N(0,(t-s)I_d)\). 3. The increments \(W_t-W_s\) are independent of \(\mathcal F_s\) (independent increments). 4. The paths \(t\mapsto W_t(\omega)\) are almost surely continuous.

Itô Integral and Itô’s Lemma

Quadratic Variation.
For a continuous semimartingale \(X=(X_t)_{t\ge0}\), the quadratic variation is defined as \[ [X]_t = \lim_{\|\Pi\|\to 0} \sum_{k} (X_{t_{k+1}}-X_{t_k})^2, \] where \(\Pi=\{0=t_0<\cdots<t_n=t\}\) is a partition of \([0,t]\) and the limit is in probability.

For one-dimensional Brownian motion \((W_t)\), \[ [W]_t = t, \quad \text{a.s.} \] In higher dimensions, for \(W=(W^{(1)},\dots,W^{(d)})\), \[ [W^{(i)},W^{(j)}]_t = \begin{cases} t & i=j,\\ 0 & i\ne j, \end{cases} \] where \([X,Y]_t\) denotes the quadratic covariation.

Itô Integral.
Let \((W_t)\) be a Brownian motion and let \(\phi_t\) be an adapted process with \[ \mathbb E\int_0^T \|\phi_t\|^2 dt < \infty. \] Then the Itô integral is defined as \[ \int_0^T \phi_t\, dW_t = L^2\text{-}\lim_{\|\Pi\|\to0}\sum_k \phi_{t_k}(W_{t_{k+1}}-W_{t_k}). \] It satisfies the Itô isometry: \[ \mathbb E\left[\left(\int_0^T \phi_t\, dW_t\right)^2\right] = \mathbb E\int_0^T \phi_t^2\, dt. \]

Itô’s Lemma (Itô formula, one-dimensional).
Suppose \(X_t\) satisfies the SDE \[ dX_t = f(X_t,t)\,dt + g(X_t,t)\,dW_t, \] and let \(F:\mathbb R\times[0,T]\to\mathbb R\) be \(C^{2,1}\) (twice continuously differentiable in \(x\) and once in \(t\)). Then \[ dF(X_t,t) = \Big(\partial_t F + f\,\partial_x F + \tfrac{1}{2} g^2\,\partial_{xx}F\Big)\,dt + g\,\partial_x F\, dW_t. \]

Multidimensional Itô’s Lemma.
If \(X_t\in\mathbb R^d\) satisfies \[ dX_t = f(X_t,t)\,dt + G(X_t,t)\,dW_t, \quad G\in\mathbb R^{d\times m}, \] and \(F:\mathbb R^d\times[0,T]\to\mathbb R\) is \(C^{2,1}\), then \[ dF(X_t,t) = \Big(\partial_t F + f^\top \nabla_x F + \tfrac12 \mathrm{Tr}\!\big(GG^\top \nabla_x^2 F\big)\Big)dt + (\nabla_x F)^\top G\, dW_t. \]

Remark.
- The term \(\tfrac12 g^2 \partial_{xx}F\) (or \(\tfrac12 \mathrm{Tr}(GG^\top\nabla^2F)\) in higher dimensions) arises from the quadratic variation of Brownian motion, i.e. \([W]_t=t\).
- This correction term is what distinguishes Itô calculus from classical calculus and is fundamental in stochastic analysis.

Other

The Kullback–Leibler (KL) divergence between two probability measures are: \[ D_{\mathrm{KL}}(\mu \| \nu)=\int_{E}\log\left(\frac{d\mu}{d\nu}(x)\right) d\mu(x), \] where the \(\frac{d\mu}{d\nu}\) is the Radon-Nikodym derivative.

For KL divergence between Gaussian distribution, we have

\[ \begin{align*} \mathrm{KL}(\mathcal{N}({\mu}_x,{\Sigma}_x)\|\mathcal{N}({\mu}_y,{\Sigma}_y)) &=\frac{1}{2}\left[\log\frac{|{\Sigma}_y|}{|{\Sigma}_x|} - d + \text{tr}({\Sigma}_y^{-1}{\Sigma}_x) + ({\mu}_y-{\mu}_x)^T {\Sigma}_y^{-1} ({\mu}_y-{\mu}_x)\right] \end{align*} \]

Diffusion Models

DDPM

\[ x\sim q(x) \]

How to sample from \(q(x)\)?

We define a markovian stochastic process \(x_t\) as the forward process, such that: \[ x_0=x\\ q(x_{1:T}|x_0) = \prod_{t=1}^{T} q(x_t|x_{t-1}), \] where \(q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t I) = \frac{1}{\sqrt{(2\pi\beta_t)^d}}\exp(-\frac{\|x_t-\sqrt{1-\beta_t}x_{t-1}\|^2}{2\beta_t})\).

We call \(\{\beta_t\}_{t=1}^T\) as a variance schedule.

By induction, we know \(x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon,\quad \epsilon \sim\mathcal{N}(0,I),\quad \bar\alpha_t =\prod_{s=1}^{t}(1-\beta_s)\),

\[ x_t \sim \mathcal{N}(\sqrt{\bar \alpha_t}x_0, (1-\bar\alpha_t)I). \]

So we can sample \(x_t\) directly.

Assume \(\bar\alpha_t \to 0\), we can regard \[ x_T \sim N(0,I). \]

Now given \(x_T\), we want sample \(x_0\) from \(x_T\).

Calculate the posterior probability density: \[ q(x_t|x_{t+1}) = \int q(x_{t-1}|x_{t},x_{0}) q(x_0|x_t) d x_0 \] Which is impractical to calculate since we has to integrate over \(x_0\).

Instead we consider \[ q(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1})q(x_{t-1}|x_{0})}{q(x_t|x_0)} \] So \[ q(x_{t-1}|x_t,x_0)\propto q(x_t|x_{t-1})q(x_{t-1}|x_{0}) \] is still a Gaussian distribution. \[ q(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1};\mu,\Sigma) \] Recall \(\Sigma^{-1}=\Sigma_{1}^{-1}+\Sigma_{2}^{-1}, \Sigma^{-1}\mu = \Sigma_1^{-1}\mu_1+\Sigma_1^{-1}\mu_2\)

Note that \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I)=\mathcal{N}(x_{t-1};\frac{1}{\sqrt{1-\beta_t}}x_{t}, \frac{\beta_t}{1-\beta_t}I) \] \(\Sigma_1=\frac{\beta_t}{1-\beta_t} I,\mu_1=\frac{1}{\sqrt{1-\beta_t}}x_{t}\).

\(\Sigma_2=(1-\bar\alpha_{t-1})I,\mu_2=\sqrt{\bar \alpha_{t-1}}x_0\).

\(\Sigma_t=(\frac{1-\beta_{t}}{\beta_t}+\frac{1}{1-\bar\alpha_{t-1}})^{-1}I=\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_tI=\tilde \beta_t I\)

\(\mu_t=\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t(\frac{\sqrt{1-\beta_t}}{\beta_t}x_t+\frac{\sqrt{\bar\alpha_{t-1}}}{1-\bar\alpha_{t-1}}x_0) = \frac{\beta_t\sqrt{\bar\alpha_{t-1}}}{1-\bar\alpha_t}x_0 + \frac{(1-\bar\alpha_{t-1})\sqrt{1-\beta_t}}{1-\bar\alpha_t}x_t\).

Now, to train the \(p_{\theta}(x_{t-1}|x_t)\), fix the variance \(\tilde \beta_t\), let model predict the \(\mu_t\).

we minimize the KL divergence: \[ L_t(\theta)=\mathrm{KL}(q(x_{t-1}|x_0,x_t)\|p_{\theta}(x_{t-1}|x_t)) \propto \mathbb{E}[\|\mu_t(x_t,x_0)-\mu_{\theta}(x_t,t)\|^2] \] Or let the model predict the noise \(\epsilon=\frac{x_t-\sqrt{\bar\alpha_t}x_0}{\sqrt{1-\bar\alpha_t}}\), \(x_0=\frac{1}{\sqrt{\bar\alpha_t}}(x_t-\sqrt{1-\bar\alpha_t}\epsilon)\).

And minimize \[ \mathbb E_{x_0,\epsilon,t}\left[\|\epsilon-\epsilon_{\theta}(x_t,t)\|^2\right]. \] This reparametrization setting generally yields better performance.

Score-based Generative Model (SGM)

We view noise injection as an SDE and learn the score \(s_t(x) := \nabla_x \log p_t(x)\) of the noisy marginal \((p_t)_{t\in [0,1]}\). Sampling is done by integrating a reverse-time SDE (or its ODE counterpart), where the score guides the dynamics back to data.

Forward SDE

Let \(X_t\in \mathbb R^d\) solve: \[ dX_t = f(X_t,t) dt + g(t) dW_t, \quad t\in [0,1], \] with \(f:\mathbb R^d\times [0,1] \to \mathbb R^d\), diffusion scale \(g:[0,1]\to \mathbb R_+\). Denote \(p_t = \mathrm{Law}(X_t)\).

DDPM as VP-SDE

Set the Variance-Preserving (VP) SDE \[ dX_t = -\frac{1}{2}\beta(t)X_t dt + \sqrt{\beta(t)} d W_t, \] with \(\beta(t)\geq 0\) integratable and \(X_0 \sim p_{data}\).

This linear SDE has the explicit solution \[ X_t = \sqrt{\bar \alpha_t} X_0 + \underbrace{\int_0^t \exp\left(-\frac{1}{2}\int_s^t \beta(u) du\right)\sqrt{\beta(s)} dW_s}_{\text{zero-mean Gaussian}}, \] where \(\bar\alpha_t:=\exp\left(-\int_0^t\beta(s)ds\right)\).

Hence the marginal conditional matches DDPM.

\[ X_t|X_0\sim \mathcal{N}(\sqrt{\bar\alpha_t}X_0, (1-\bar\alpha_t)I). \]

Claim. The DDPM forward chain with \(\beta_1,\ldots,\beta_T\) is an Euler-Maruyama discretization of the VP-SDE with piecewise-constant \(\beta(t)\) and share the same marginals \(q(x_t|x_0)\).

Variance-Exploding SDE (VE-SDE)

Alternatively, set \[ dX_t = g(t)\,dW_t, \qquad f\equiv 0, \] with \(g(t)=\sqrt{d[\sigma^2(t)]/dt}\) and \(\sigma(0)=0\). Then \[ X_t = X_0 + \sigma(t)Z,\quad Z\sim\mathcal N(0,I). \] So \(p_t = p_0 * \mathcal N(0,\sigma(t)^2I)\), i.e. Gaussian smoothing.

Reverse-time SDE and Probability Flow ODE

The reverse SDE (Anderson, 1982) is \[ dX_t = \big(f(X_t,t)-g(t)^2\nabla_x\log p_t(X_t)\big)dt + g(t)\,d\bar W_t, \] where \(\bar W_t\) is a backward Wiener process. Replacing \(\nabla_x\log p_t\) with a learned \(s_\theta\) gives a generative sampler.

The equivalent deterministic probability flow ODE is \[ \frac{dX_t}{dt} = f(X_t,t) - \tfrac12 g(t)^2 \nabla_x\log p_t(X_t). \] This is the continuous analogue of DDIM.

Noise Prediction vs. Score Prediction

For VP, \[ \nabla_{x_t}\log q(x_t|x_0) = -\frac{x_t-\sqrt{\bar\alpha_t}x_0}{1-\bar\alpha_t} = -\frac{\epsilon}{\sqrt{1-\bar\alpha_t}}. \] Thus noise prediction and score prediction are equivalent: \[ s_\theta(x_t,t) = -\frac{\epsilon_\theta(x_t,t)}{\sqrt{1-\bar\alpha_t}}. \]

An alternative data predictor: \[ \hat x_0(x_t) = \frac{1}{\sqrt{\bar\alpha_t}}\big(x_t-\sqrt{1-\bar\alpha_t}\,\epsilon_\theta(x_t,t)\big). \]

Denoising Score Matching (DSM)

Training objective: \[ \min_\theta\; \mathbb E_{x_0,\sigma,z}\;\Big\|s_\theta(x_0+\sigma z,\sigma) + \tfrac{z}{\sigma}\Big\|^2. \]

For VP, choosing \(\sigma(t)=\sqrt{1-\bar\alpha_t}\) reduces DSM to the DDPM noise-prediction MSE.

Derivation and Intuition

TODO: Weak form.

Flow Matching Models

See Flow Matching

Schrödinger Bridge Models

Relation with Entropy-Regularized Optimal Transport

IPF (DSB, S-DSB)

IMF

D-IMF (ADSB)

CDSB

TODO.

Machine Learning

#Generative Models #Flow Matching #Score-based Models #Optimal Transport #Stochastic Differential Equations

Flow matching, Score-based Generative Model, Schrödinger Bridge and Optimal Transport

https://notdesigned.github.io/2025/09/18/Flow-matching-Score-based-Generative-Model-Schrodinger-Bridge-and-Optimal-Transport/

Author

Luocheng Liang

Posted on

September 18, 2025

Licensed under

Reading notes of The Principle of Diffusion Models Previous

Variational Calculus in VAE Next