Flow Matching

Reference

Flow Matching For Generative Modeling

Continuous Normalizing Flows

Flow Straight and Fast

Preliminaries

The objective of generative modeling is to learn the underlying distribution of the training data . This is typically achieved by training a generative model to approximate , allowing for the generation of new samples from the learned distribution.

Let be a complete and separable metric space such as . Then, the space of probability measures is defined as the set of all Borel probability measures on .

Generative models usually aims to learn a mapping from a simple distribution (e.g., Gaussian) to the complex data distribution by transforming samples from the simple distribution into samples from the data distribution.

Topology on

Weak Topology

the weak topology on is defined as the coarsest topology making all functionals of the form continuous, where ranges over all the bounded continuous functions on .

Convergence characterization: A sequence converges to in the weak topology if and only if for all bounded continuous functions .

Actually, let be the space of all signed, finite measure on . can be seen as the dual space induced by the pairing:

The relation between a measure on and a linear functional (dual space) on :

is linear and injective.

When is compact, is bijective by the Riesz representation theorem. .

When is locally compact and separable, , where is the compactly support function on , which vanishing on infinity.

And we have the weak* topology on the dual space of

So induces the weak* topology on and its subset .

Wasserstein Metric Topology

When is a compact Riemann manifold , the -Wasserstein distance is a metric on , and induced a topology on . More precisely, for probability distribution with finite -th moment, .

Kantorovich-Rubinstein duality states that for any -Wasserstein distance, we have:

where is the -Wasserstein distance.

This plays an important role in WGAN.

So -Wasserstein convergence is equivalent to convergence over all Lipschitz functions , which is stronger than weak convergence (all functions).

Generally, we have

So -Wasserstein topology is weaker than -Wasserstein metric topology, etc.

But if we assume all the probability measure has compact support, the convergence is equivalent.


Let be the simple distribution, e.g. , and let be the data distribution. The goal is to learn a mapping such that , where is the pushforward measure of under the mapping defined by , where is a Borel set.

But in reality, we only have the samples from the data distribution . So the best we can do is to approximate by where is the sample data and is the Dirac measure centered at .

This can be viewed as a semi-discrete optimal transport problem, where we aim to find a mapping that pushes forward the simple distribution to the empirical distribution .

Data Manifold Hypothesis

We view the real distribution as a distribution over , where is a manifold embedded in and is the dimensionality of the data. More precisely, we assume that the data distribution has support on a lower-dimensional manifold , typically with . This is the data manifold hypothesis, which posits that high-dimensional data often lies on or near a lower-dimensional manifold.

Theoretical Framework vs. Practical Implementation

Theoretical assumption:

The true data distribution is absolutely continuous with respect to the volume measure on the data manifold , i.e., there exists a smooth density function such that:

Practical reality:

We only observe finite samples from the true distribution, leading to the empirical distribution:

Note that each Dirac measure is not absolutely continuous with respect to the volume measure, since has zero volume but .

Regularization assumption: We assume that the empirical distribution converges weakly to the true continuous distribution as :

This weak convergence justifies treating the discrete empirical distribution as an approximation to the underlying continuous distribution.

But we can instead use a Gaussian kernel to smooth the empirical distribution:

where is a Gaussian kernel with bandwidth .

We shall see this is the choice of the flow matching.

(Discrete) Normalizing Flow

We learn a bijective mapping such that the pushforward measure matches the target distribution .

The Normalizing Flow method assumes that can be expressed as a sequence of invertible transformations, typically using neural networks, such that the Jacobian determinant can be efficiently computed.

$$ $$

And each is a neural network parameterized by , i.e., .

Loss and Training

The objective is maximum likelihood estimation or KL divergence, equivalently.

By the change of variables formula, we have:

Continuous Normalizing Flows

Let , it is equivalent to construct a vector field (and corresponding ODE) and take the transform process as particle flows.

Given a time dependent transformation

The flow induces a time-dependent one parameter group of diffeomorphisms: And the trajectory of a particle is denoted as : Note that here particle means it was initially at position at time .

We want a gradual transformation of the input distribution to the target distribution, i.e. And satisfies .

Fix a position and describe the speed of the particles at , we can define the velocity field as:

And the Eulerian description and Lagrangian description are related as follows: Eulerian: Lagrangian:

We shall now consider the conservation of mass, i.e., we want to ensure that the flow preserves the total mass of the distribution.

We thus introduce a density function induced by the flow And denote as .

We require Particularly, .

And we assume is tangent to the boundary , and the flow remains inside of , thus .

It must satisfy the continuity equation to make the probability mass conserve

Proof of continuity equation


Let , By the arbitrariness of , we get the continuity equation:

We say the solve the continuity in distributional sense, which is called a weak solution or distributional solution.

From the Lagrangian description, we know the formula: Take differential on both side, And

Hence

Take it back, and eliminate And . since is a diffeomorphism, we have:

Note: If we further assume the incompressibility condition, i.e. then we have


So we know

Now we focus on a particle is initially at , moving on trajectory

So Thus

Loss and Training

We simulate the flow by using a neural network with parameter .

Similar to the discrete normalizing flow, we use MLE:

Or equivalently, minimize the KL divergence between and .

Flow Matching

Instead of learning , flow matching directly learns the velocity field .

The theoretical objective is: But this cannot be trained directly since we don’t know .

Conditional Flow Matching

conditioning on , CFM use the conditional distribution

Key theorem:

Given conditional probability path satisfying the continuity equation with conditional vector field .

Then the marginal vector field : satisfy the continuity equation and generate the marginal probability path .

And one can choose any conditional probability path as long as And the paper use Gaussian kernal as an approximation.

If the above condition is satisfied, the paper proves that optimizing the objective of CFM is equivalent to optimizing the objective of the original flow model.

And

And you can also condition on .

or both and , the formula is still valid if if the coupling is fixed.

Rectified Flow

In the paper, rectified flow optimized the velocity field by minimizing the following objective:

where .

First, lets translate this into the CFM language.

Conditioning on and , rectified flow defined a deterministic path as the interpolation between and :

And apply the CFM objective: And this is exactly the objective of rectified flow.

Actually, the author of Flow Matching For Generative Modeling also proposed an similar objective, but they use the name “OT-Flow” instead of “rectified flow” (And Gaussian kernal). We shall see the reason in the next section.

Relation to Optimal Transport

What does the rectified flow and conditional flow matching have to do with optimal transport? Let’s recall the method of conditional probability path.

The claim is that to get the vector field , we can use the conditional vector field and do summation over all and . Then we shall get the vector field that satisfies the continuity equation and generates the marginal probability path .

It is not difficult to see that summing over the conditional vector field will yield a valid vector field that satisfies the continuity equation and transport the distribution from to . But what is the property of such vector field?

Let’s think about a principle in traditional physics, the principle of least action. It states that the path taken by a system between two states is the one for which the action is stationary (usually minimized).

In the context of traditional physics, the shortest path between two points is a straight line. But in quantum mechanics, the particle is no longer deterministic as a point, but rather a wave function that can be spread out over a region of space assigning a probability amplitude to each point. But the principle of least action still holds, and the path taken by the particle is the one that minimizes the action.

Benamou-Brenier Theory

The Benamou-Brenier theory states that the optimal transport problem can be reformulated as a dynamic problem, where the optimal transport plan is the one that minimizes the action functional.

In every time , we have a distribution and a velocity field . The kinetic energy is given by the integral of the squared velocity field:

The action functional is then defined as:

And we see the action functional is exactly the total energy of using the velocity field to transport the distribution from to .

Problem (Benamou-Brenier): Given two probability measures and , find the optimal transport plan that minimizes the action functional subject to some conditions. Admissible pairs are pairs such that: - , , - in distributional sense, - , - , i.e., the velocity field is square integrable with respect to the measure . - , i.e., the probability path is absolutely continuous with respect to the weak* topology on the space of probability measures.

Theorem: (Benamou-Brenier)

The -Wasserstein distance between two probability measures and can be expressed as the infimum of the action functional over all admissible pairs

This means that the optimal transport plan is the one that minimizes the action functional, which is equivalent to minimizing the total energy of the velocity field.

Variational Calculus

We introduce a Lagrange multiplier to enforce the continuity equation constraint:

Variation with respect to : Using integration by parts on the second term where the boundary term vanishes due to the assumption that is tangent to the boundary .

we get

Which forces

Variation with respect to : Similarly, using integration by parts on the second term and the third term Note and are fixed, so variations are zero at the endpoints .

we get Setting this to zero for all , we have

Substituting , we have

To summarize, we have three equations for the action functional, which gives us the necessary conditions for optimality:

By the , we can get Kantorovich potential , as the dual potentials for the optimal transport problem.

One can prove it by integral along the optimal pair , and verify the duality optimality. (Actually, this maybe give a proof to the Benamou-Brenier theorem)

Economic interpretation

The optimal transport problem can be interpreted in an economic context, where we want to transport resources (e.g., goods, people) from one location to another while minimizing the cost of transportation.

(Monge’s formulation) where is the cost of transporting from location to location , and and are the source and target distributions, respectively.

(Kantorovich’s formulation) where is a joint distribution over the source and target locations, and is the set of all couplings between and .

(Kantorovich dual formulation)

Let’s consider the economic interpretation:

is the distribution of resources at the source location , and is the target distribution of resources at the destination location . The cost function represents the cost of transporting resources from location to location . And merchants are trying to maximize their profit by transporting resources.

Let be the potential function that represents the value of resources at , and in the nation , let denote the cost of buying resources at and denote the cost of selling resources at .

Now look back at the setting: The potential represents the change of the balance of merchants buying resources at the source location , and the potential represents the gain of merchants selling resources at the destination location .

So the Kantorovich dual formulation can be interpreted as maximizing the profit of merchants by buying resources at the source location and selling them at the destination location .

And the no arbitrage condition ensures that the profit from buying and selling resources is not greater than the cost of transportation, preventing arbitrage opportunities.

Look back at the optimality conditions, we have: 1. The velocity field represents the optimal transportation direction of resources, which is the gradient of the potential function .

This is quite intuitive.

  1. The Hamilton-Jacobi equation describes the evolution of the potential function over time, where the term can be interpreted as the cost of transportation.

But here is a problem, the cost of transportation is formulated as the integral of the instantaneous cost over time, but in the original optimal transport problem, the cost is formulated as independent of path, i.e., the cost is only dependent on the source and target locations, not on the path taken.

To make the two formulations consistent, we impose a constraint:

Let

be the cost of transportation along the path , if then the two formulations give the same optimal transport plan.

A common situation is where is an integer.

When , the cost function is linear and not strictly convex, and the optimal transport plan is not unique. Any linear parameterization of the line segment connecting and will yield the same cost.

When , the cost function is strictly convex, and the optimal transport plan is unique. The optimal path is the straight line connecting and with velocity parameterized by the arc length.


Lemma: If is a convex function over , then Moreover, if is strictly convex, then the infimum is achieved by a unique path .

Proof

Use the Jensen’s inequality of integral.


So for the -Wasserstein distance, the cost function can be seen as the integral of velocity field over time. where the best path is the straight line connecting and with constant velocity .

Why Conditioning Works?

Turning back to the rectified flow, the author condition on , let be the product coupling .

It is easy to see if , then the marginal vector field is exactly the optimal transport vector field.

But rectified flow set , so it cannot get the optimal transport vector field in 1 step. But the paper proves every time of reflow matching will decrease the transport cost, and the rectified flow will converge to the optimal transport vector field by the speed of , where is the number of iterations.

And in the Flow Matching For Generative Modeling, the author proposed “OT-flow” is conditioning on only. Therefore, since the target is a Gaussian kernal and the source is a Gaussian distribution, they just know the exact form of the optimal transport vector field and there is no problem of coupling since they only condition on .

Why Reflow Works?

The -rectified flow is calculated by: where is the independent coupling of the source and target distributions. This is the expectation of line direction that pass at time .

For , the -rectified flow is calculated by: where is the coupling induced by the -rectified flow.

as the transport cost of the coupling .

We assume the cost function has the form and is convex and , we know the cost can be represented as the infimum of the path cost.

We has the following transform sequence: where is pushforward .

For , given , is calculated by: And is determined by the following equation:

Theorem:

Proof:

Note that as the marginal distribution of , is the distribution generated by the marginal vector field . They are the same since the continuity equation is satisfied.

Proof

For all ,

Solves the weak form of the continuity equation.


This is the key construct in the rectified flow.

Straightness

Define as the straightness of the flow .

Theorem 3.7 in Flow Straight and Fast states:


Flow Matching
https://notdesigned.github.io/2025/08/09/Flow-Matching/
Author
Luocheng Liang
Posted on
August 9, 2025
Licensed under