The objective of generative modeling is to learn the underlying
distribution of the
training data . This is typically
achieved by training a generative model to approximate , allowing for the generation
of new samples from the learned distribution.
Let be a complete and
separable metric space such as . Then, the space of probability measures
is defined as the
set of all Borel probability measures on .
Generative models usually aims to learn a mapping from a simple
distribution (e.g., Gaussian) to the complex data distribution by transforming samples from
the simple distribution into samples from the data distribution.
Topology on
Weak Topology
the weak topology on is defined as the coarsest topology
making all functionals of the form continuous, where ranges over all the bounded continuous functions on
.
Convergence characterization: A sequence converges to in the weak topology if and only if
for all bounded continuous functions .
Actually, let be
the space of all signed, finite measure on . can be seen as the dual space
induced by the pairing:
The relation between a measure on and a linear functional (dual space) on
:
is linear and
injective.
When is compact, is bijective by the Riesz
representation theorem. .
When is locally compact and
separable, , where is
the compactly support function on , which vanishing on infinity.
And we have the weak* topology on the dual space of
So induces the weak*
topology on and its
subset .
Wasserstein Metric Topology
When is a compact Riemann
manifold , the -Wasserstein distance is a metric on
, and induced a
topology on . More precisely, for
probability distribution with finite -th moment, .
Kantorovich-Rubinstein duality states that for any
-Wasserstein distance, we
have:
where is
the -Wasserstein distance.
This plays an important role in WGAN.
So -Wasserstein convergence is
equivalent to convergence over all Lipschitz functions , which is stronger than weak
convergence (all
functions).
Generally, we have
So -Wasserstein topology is
weaker than -Wasserstein metric
topology, etc.
But if we assume all the probability measure has compact support, the
convergence is equivalent.
Let be
the simple distribution, e.g. , and let be the data
distribution. The goal is to learn a mapping such that , where is the pushforward measure of
under the mapping defined by , where
is a Borel
set.
But in reality, we only have the samples from the data distribution
. So the best we can do is to
approximate by where is the
sample data and is the
Dirac measure centered at .
This can be viewed as a semi-discrete optimal transport problem,
where we aim to find a mapping that pushes forward the simple distribution to the empirical distribution .
Data Manifold Hypothesis
We view the real distribution as a distribution over , where is a manifold embedded in
and is the dimensionality of the data. More
precisely, we assume that the data distribution has support on a lower-dimensional
manifold , typically with . This is the data
manifold hypothesis, which posits that high-dimensional data often lies
on or near a lower-dimensional manifold.
Theoretical
Framework vs. Practical Implementation
Theoretical assumption:
The true data distribution is absolutely continuous with
respect to the volume measure on the data manifold , i.e., there exists a smooth
density function such that:
Practical reality:
We only observe finite samples from the true
distribution, leading to the empirical distribution:
Note that each Dirac measure is not absolutely continuous
with respect to the volume measure, since has zero volume but .
Regularization assumption: We assume that the empirical distribution
converges weakly to the
true continuous distribution as
:
This weak convergence justifies treating the discrete
empirical distribution as an approximation to the underlying continuous
distribution.
But we can instead use a Gaussian kernel to smooth the empirical
distribution:
where is a
Gaussian kernel with bandwidth .
We shall see this is the choice of the flow matching.
(Discrete) Normalizing Flow
We learn a bijective mapping such that the pushforward measure
matches the target
distribution .
The Normalizing Flow method assumes that can be expressed as a sequence of
invertible transformations, typically using neural networks, such that
the Jacobian determinant can be efficiently computed.
$$ $$
And each is a neural network
parameterized by , i.e.,
.
Loss and Training
The objective is maximum likelihood estimation or KL divergence,
equivalently.
By the change of variables formula, we have:
Continuous Normalizing Flows
Let , it is
equivalent to construct a vector field (and corresponding ODE) and take
the transform process as particle flows.
Given a time dependent transformation
The flow induces a
time-dependent one parameter group of diffeomorphisms: And the trajectory of a particle is denoted as : Note that here particle
means it was initially at position at time .
We want a gradual transformation of the input distribution to the
target distribution, i.e. And satisfies .
Fix a position and describe
the speed of the particles at , we
can define the velocity field as:
And the Eulerian description and Lagrangian description are related
as follows: Eulerian: Lagrangian:
We shall now consider the conservation of mass, i.e., we want to
ensure that the flow preserves the total mass of the distribution.
We thus introduce a density function induced by the flow And denote as .
We require Particularly, .
And we assume is tangent to
the boundary , and
the flow remains inside of ,
thus .
It must satisfy the continuity equation to make the probability mass
conserve
Proof of continuity equation
Let , By the arbitrariness of , we get the continuity equation:
We say the solve
the continuity in distributional sense, which is called a weak solution
or distributional solution.
From the Lagrangian description, we know the formula: Take differential on both side, And
Hence
Take it back, and eliminate And .
since is a
diffeomorphism, we have:
Note: If we further assume the incompressibility condition, i.e.
then we have
So we know
Now we focus on a particle is initially at , moving on trajectory
So Thus
Loss and Training
We simulate the flow by using a neural network with parameter .
Similar to the discrete normalizing flow, we use MLE:
Or equivalently, minimize the KL divergence between and .
Flow Matching
Instead of learning , flow
matching directly learns the velocity field .
The theoretical objective is: But this cannot be trained directly since we don’t know .
Conditional Flow Matching
conditioning on , CFM use the
conditional distribution
Key theorem:
Given conditional probability path satisfying the continuity
equation with conditional vector field .
Then the marginal vector field : satisfy the continuity equation and generate the marginal
probability path .
And one can choose any conditional probability path as long as And the paper use Gaussian kernal as an approximation.
If the above condition is satisfied, the paper proves that optimizing
the objective of CFM is equivalent to optimizing the objective of the
original flow model.
And
And you can also condition on .
or both and , the formula is still
valid if if the coupling is fixed.
Rectified Flow
In the paper, rectified flow optimized the velocity
field by minimizing the
following objective:
where .
First, lets translate this into the CFM language.
Conditioning on and , rectified flow defined a
deterministic path as the interpolation between and :
And apply the CFM objective: And this is exactly the objective of rectified flow.
Actually, the author of Flow Matching For Generative
Modeling also proposed an similar objective, but they use the name
“OT-Flow” instead of “rectified flow” (And Gaussian kernal). We shall
see the reason in the next section.
Relation to Optimal
Transport
What does the rectified flow and conditional flow matching have to do
with optimal transport? Let’s recall the method of conditional
probability path.
The claim is that to get the vector field , we can use the conditional vector
field and do
summation over all and . Then we shall get the vector field
that satisfies the continuity
equation and generates the marginal probability path .
It is not difficult to see that summing over the conditional vector
field will yield a valid vector field that satisfies the continuity
equation and transport the distribution from to . But what is the property of such
vector field?
Let’s think about a principle in traditional physics, the
principle of least action. It states that the path
taken by a system between two states is the one for which the action is
stationary (usually minimized).
In the context of traditional physics, the shortest path between two
points is a straight line. But in quantum mechanics, the particle is no
longer deterministic as a point, but rather a wave function that can be
spread out over a region of space assigning a probability amplitude to
each point. But the principle of least action still holds, and the path
taken by the particle is the one that minimizes the action.
Benamou-Brenier Theory
The Benamou-Brenier theory states that the optimal transport problem
can be reformulated as a dynamic problem, where the optimal transport
plan is the one that minimizes the action functional.
In every time , we have a
distribution and a velocity
field . The kinetic energy is
given by the integral of the squared velocity field:
The action functional is then defined as:
And we see the action functional is exactly the total energy of using
the velocity field to transport
the distribution from to .
Problem (Benamou-Brenier): Given two probability
measures and , find the optimal transport plan that
minimizes the action functional subject to some conditions. Admissible pairs are pairs such that: - , , - in distributional sense, - , - , i.e., the velocity field is square integrable with respect
to the measure . - , i.e., the probability path is absolutely continuous
with respect to the weak* topology on the space of probability
measures.
Theorem: (Benamou-Brenier)
The -Wasserstein distance
between two probability measures and can be expressed as the infimum of
the action functional over all admissible pairs
This means that the optimal transport plan is the one that minimizes
the action functional, which is equivalent to minimizing the total
energy of the velocity field.
Variational Calculus
We introduce a Lagrange multiplier to enforce the continuity
equation constraint:
Variation with respect to : Using integration by parts on the second term where the boundary term vanishes due to the assumption that
is tangent to the boundary .
we get
Which forces
Variation with respect to : Similarly, using integration by parts on the second term and
the third term Note and
are fixed, so
variations are zero
at the endpoints .
we get Setting this to zero for all , we have
Substituting , we have
To summarize, we have three equations for the action functional,
which gives us the necessary conditions for optimality:
By the , we can get
Kantorovich potential ,
as the dual potentials for the optimal transport problem.
One can prove it by integral along the optimal pair , and verify the duality
optimality. (Actually, this maybe give a proof to the Benamou-Brenier
theorem)
Economic interpretation
The optimal transport problem can be interpreted in an economic
context, where we want to transport resources (e.g., goods, people) from
one location to another while minimizing the cost of transportation.
(Monge’s formulation) where is the
cost of transporting from location to location , and and are the source and target
distributions, respectively.
(Kantorovich’s formulation) where is a joint
distribution over the source and target locations, and is the set of all couplings
between and .
(Kantorovich dual formulation)
Let’s consider the economic interpretation:
is the distribution of
resources at the source location ,
and is the target distribution
of resources at the destination location . The cost function represents the cost of
transporting resources from location to location . And merchants are trying to maximize
their profit by transporting resources.
Let be the
potential function that represents the value of resources at , and in the nation , let denote the cost of buying resources
at and denote the cost of selling resources
at .
Now look back at the setting: The potential
represents the change of the balance of merchants buying resources at
the source location , and the
potential represents the gain
of merchants selling resources at the destination location .
So the Kantorovich dual formulation can be interpreted as maximizing
the profit of merchants by buying resources at the source location and selling them at the destination
location .
And the no arbitrage condition ensures that the profit from buying and
selling resources is not greater than the cost of transportation,
preventing arbitrage opportunities.
Look back at the optimality conditions, we have: 1. The velocity
field
represents the optimal transportation direction of resources, which is
the gradient of the potential function .
This is quite intuitive.
The Hamilton-Jacobi equation describes the
evolution of the potential function over time, where the term can be
interpreted as the cost of transportation.
But here is a problem, the cost of transportation is formulated as
the integral of the instantaneous cost over time, but in the original
optimal transport problem, the cost is formulated as independent of
path, i.e., the cost is only dependent on the source and target
locations, not on the path taken.
To make the two formulations consistent, we impose a constraint:
Let
be the cost of transportation along the path , if then the two formulations give the same optimal transport
plan.
A common situation is where is an
integer.
When , the cost function is
linear and not strictly convex, and the optimal transport plan is not
unique. Any linear parameterization of the line segment connecting and will yield the same cost.
When , the cost function
is strictly convex, and the optimal transport plan is unique. The
optimal path is the straight line connecting and with velocity parameterized by the arc
length.
Lemma: If is
a convex function over ,
then Moreover, if is
strictly convex, then the infimum is achieved by a unique path .
Proof
Use the Jensen’s inequality of integral.
So for the -Wasserstein
distance, the cost function can be seen as the integral of velocity
field over time. where the best path is the straight line connecting
and with constant velocity .
Why Conditioning Works?
Turning back to the rectified flow, the author condition on , let be the product coupling
.
It is easy to see if , then the marginal vector field is exactly the optimal transport
vector field.
But rectified flow set , so it cannot get the optimal transport vector field in 1
step. But the paper proves every time of reflow matching will decrease
the transport cost, and the rectified flow will converge to the optimal
transport vector field by the speed of , where is the number of iterations.
And in the Flow Matching
For Generative Modeling, the author proposed “OT-flow” is
conditioning on only.
Therefore, since the target is a Gaussian kernal and the source is a
Gaussian distribution, they just know the exact form of the optimal
transport vector field and there is no problem of coupling since they only condition
on .
Why Reflow Works?
The -rectified flow is
calculated by: where is the independent coupling of the source and target
distributions. This is the expectation of line direction that pass at time .
For , the -rectified flow is calculated by: where is the coupling induced by the
-rectified flow.
as the transport cost of the coupling .
We assume the cost function has the form and is convex and , we know the cost can be represented
as the infimum of the path cost.
We has the following transform sequence: where is pushforward
.
For , given , is calculated by: And is determined by
the following equation:
Theorem:
Proof:
Note that as the marginal
distribution of , is
the distribution generated by the marginal vector field . They are the same since the
continuity equation is satisfied.