Flow-based generative model

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,[1][2][3] which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.

In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.

Method

[edit | edit source]
File:Normalizing-flow.svg
Scheme for normalizing flows

Let z0 be a (possibly multivariate) random variable with distribution p0(z0).

For i=1,...,K, let zi=fi(zi1) be a sequence of random variables transformed from z0. The functions f1,...,fK should be invertible, i.e. the inverse function fi1 exists. The final output zK models the target distribution.

The log likelihood of zK is (see derivation):

logpK(zK)=logp0(z0)i=1Klog|detdfi(zi1)dzi1|

Learning probability distributions by differentiating such log Jacobians originated in the Infomax (maximum likelihood) approach to ICA,[4] which forms a single-layer (K=1) flow-based model. Relatedly, the single layer precursor of conditional generative flows appeared in.[5]

To efficiently compute the log likelihood, the functions f1,...,fK should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions f1,...,fK are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,[6] RealNVP,[7] and Glow.[8]

Derivation of log likelihood

[edit | edit source]

Consider z1 and z0. Note that z0=f11(z1).

By the change of variable formula, the distribution of z1 is:

p1(z1)=p0(z0)|detdf11(z1)dz1|

Where detdf11(z1)dz1 is the determinant of the Jacobian matrix of f11.

By the inverse function theorem:

p1(z1)=p0(z0)|det(df1(z0)dz0)1|

By the identity det(A1)=det(A)1 (where A is an invertible matrix), we have:

p1(z1)=p0(z0)|detdf1(z0)dz0|1

The log likelihood is thus:

logp1(z1)=logp0(z0)log|detdf1(z0)dz0|

In general, the above applies to any zi and zi1. Since logpi(zi) is equal to logpi1(zi1) subtracted by a non-recursive term, we can infer by induction that:

logpK(zK)=logp0(z0)i=1Klog|detdfi(zi1)dzi1|

Training method

[edit | edit source]

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denoting pθ the model's likelihood and p* the target distribution to learn, the (forward) KL-divergence is:

DKL[p*(x)pθ(x)]=𝔼p*(x)[logpθ(x)]+𝔼p*(x)[logp*(x)]

The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter θ we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset {xi}i=1N of samples each independently drawn from the target distribution p*(x), then this term can be estimated as:

𝔼^p*(x)[logpθ(x)]=1Ni=0Nlogpθ(xi)

Therefore, the learning objective

argminθ DKL[p*(x)pθ(x)]

is replaced by

argmaxθ i=0Nlogpθ(xi)

In other words, minimizing the Kullback–Leibler divergence between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution.[9]

A pseudocode for training normalizing flows is as follows:[10]

  • INPUT. dataset x1:n, normalizing flow model fθ(),p0.
  • SOLVE. maxθjlogpθ(xj) by gradient descent
  • RETURN. θ^

Variants

[edit | edit source]

Planar Flow

[edit | edit source]

The earliest example.[11] Fix some activation function h, and let θ=(u,w,b) with the appropriate dimensions, thenx=fθ(z)=z+uh(w,z+b)The inverse fθ1 has no closed-form solution in general.

The Jacobian is |det(I+h(w,z+b)uwT)|=|1+h(w,z+b)u,w|.

For it to be invertible everywhere, it must be nonzero everywhere. For example, h=tanh and u,w>1 satisfies the requirement.

Nonlinear Independent Components Estimation (NICE)

[edit | edit source]

Let x,z2n be even-dimensional, and split them in the middle.[6] Then the normalizing flow functions arex=[x1x2]=fθ(z)=[z1z2]+[0mθ(z1)]where mθ is any neural network with weights θ.

fθ1 is just z1=x1,z2=x2mθ(x1), and the Jacobian is just 1, that is, the flow is volume-preserving.

When n=1, this is seen as a curvy shearing along the x2 direction.

Real Non-Volume Preserving (Real NVP)

[edit | edit source]

The Real Non-Volume Preserving model generalizes NICE model by:[7]x=[x1x2]=fθ(z)=[z1esθ(z1)z2]+[0mθ(z1)]

Its inverse is z1=x1,z2=esθ(x1)(x2mθ(x1)), and its Jacobian is i=1nesθ(z1,). The NICE model is recovered by setting sθ=0. Since the Real NVP map keeps the first and second halves of the vector x separate, it's usually required to add a permutation (x1,x2)(x2,x1) after every Real NVP layer.

Generative Flow (Glow)

[edit | edit source]

In generative flow model,[8] each layer has 3 parts:

  • channel-wise affine transformycij=sc(xcij+bc)with Jacobian cscHW.
  • invertible 1x1 convolutionzcij=cKccycijwith Jacobian det(K)HW. Here K is any invertible matrix.
  • Real NVP, with Jacobian as described in Real NVP.

The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

Masked Autoregressive Flow (MAF)

[edit | edit source]

An autoregressive model of a distribution on n is defined as the following stochastic process:[12]

x1N(μ1,σ12)x2N(μ2(x1),σ2(x1)2)xnN(μn(x1:n1),σn(x1:n1)2)where μi:i1 and σi:i1(0,) are fixed functions that define the autoregressive model.

By the reparameterization trick, the autoregressive model is generalized to a normalizing flow:x1=μ1+σ1z1x2=μ2(x1)+σ2(x1)z2xn=μn(x1:n1)+σn(x1:n1)znThe autoregressive model is recovered by setting zN(0,In).

The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).

The Jacobian matrix is lower-diagonal, so the Jacobian is σ1σ2(x1)σn(x1:n1).

Reversing the two maps fθ and fθ1 of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.[13]

Continuous Normalizing Flow (CNF)

[edit | edit source]

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.[14][15] Let z0 be the latent variable with distribution p(z0). Map this latent variable to data space with the following flow function:

x=F(z0)=zT=z0+0Tf(zt,t)dt

where f is an arbitrary function and can be modeled with e.g. neural networks.

The inverse function is then naturally:[14]

z0=F1(x)=zT+T0f(zt,t)dt=zT0Tf(zt,t)dt

And the log-likelihood of x can be found as:[14]

log(p(x))=log(p(z0))0TTr[fzt]dt

Since the trace depends only on the diagonal of the Jacobian ztf, this allows "free-form" Jacobian.[16] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.

The trace can be estimated by "Hutchinson's trick":[17][18]

Given any matrix

Wn×n

, and any random

un

with

E[uuT]=I

, we have

E[uTWu]=tr(W)

. (Proof: expand the expectation directly.)

Usually, the random vector is sampled from

N(0,I)

(normal distribution) or

{±n1/2}n

(Rademacher distribution).

When f is implemented as a neural network, neural ODE methods[19] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.

There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow f might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible f that all solve the same problem).

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".[20]

Any homeomorphism of n can be approximated by a neural ODE operating on 2n+1, proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks.[21]

To regularize the flow f, one can impose regularization losses. The paper [17] proposed the following regularization loss based on optimal transport theory:λK0Tf(zt,t)2dt+λJ0Tzf(zt,t)F2dtwhere λK,λJ>0 are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.

Flows on manifolds

[edit | edit source]

When a probabilistic flow transforms a distribution on an m-dimensional smooth manifold embedded in n, where m<n, and where the transformation is specified as a function, nn, the scaling factor between the source and transformed PDFs is not given by the naive computation of the determinant of the n-by-n Jacobian (which is zero), but instead by the determinant(s) of one or more suitably defined m-by-m matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),[22] where the more general case of non-isometrically embedded Riemann manifolds is also treated. Here we restrict attention to isometrically embedded manifolds.

As running examples of manifolds with smooth, isometric embedding in n we shall use:

As a first example of a spherical manifold flow transform, consider the normalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the n-by-n invertible matrix 𝐌:

flin(𝐱;𝐌)=𝐌𝐱𝐌𝐱

In full Euclidean space, flin:nn is not invertible, but if we restrict the domain and co-domain to the unitsphere, then flin:𝕊n1𝕊n1 is invertible (more specifically it is a bijection and a homeomorphism and a diffeomorphism), with inverse flin(;𝐌1). The Jacobian of flin:nn, at 𝐲=flin(𝐱;𝐌) is 𝐌𝐱1(𝐈n𝐲𝐲)𝐌, which has rank n1 and determinant of zero; while as explained here, the factor (see subsection below) relating source and transformed densities is: 𝐌𝐱n|det𝐌|.

Differential volume ratio

[edit | edit source]

For m<n, let n be an m-dimensional manifold with a smooth, isometric embedding into n. Let f:nn be a smooth flow transform with range restricted to . Let 𝐱 be sampled from a distribution with density PX. Let 𝐲=f(𝐱), with resultant (pushforward) density PY. Let U be a small, convex region containing 𝐱 and let V=f(U) be its image, which contains 𝐲; then by conservation of probability mass:

PX(𝐱)volume(U)PY(𝐲)volume(V)

where volume (for very small regions) is given by Lebesgue measure in m-dimensional tangent space. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the differential volume ratio.

To obtain concrete formulas for volume on the m-dimensional manifold, we construct U by mapping an m-dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function: mn. At very small scale, the embedding function becomes essentially linear so that U is a parallelotope (multidimensional generalization of a parallelogram). Similarly, the flow transform, f becomes linear, so that the image, V=f(U) is also a parallelotope. In m, we can represent an m-dimensional parallelotope with an m-by-m matrix whose column-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The volume is given by the absolute value of the determinant of this matrix. If more generally (as is the case here), an m-dimensional paralellotope is embedded in n, it can be represented with a (tall) n-by-m matrix, say 𝐕. Denoting the parallelotope as /𝐕/, its volume is then given by the square root of the Gram determinant:

volume/𝐕/=|det(𝐕𝐕)|

In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.

Simplex flow

[edit | edit source]

As a first example, we develop expressions for the differential volume ratio of a simplex flow, 𝐪=f(𝐩), where 𝐩,𝐪=Δn1. Define the embedding function:

e:𝐩~=(p1,pn1)𝐩=(p1,pn1,1i=1n1pi)

which maps a conveniently chosen, (n1)-dimensional representation, 𝐩~, to the embedded manifold. The n-by-(n1) Jacobian is 𝐄=[𝐈n11]. To define U, the differential volume element at the transformation input (𝐩Δn1), we start with a rectangle in 𝐩~-space, having (signed) differential side-lengths, dp1,,dpn1 from which we form the square diagonal matrix 𝐃, the columns of which span the rectangle. At very small scale, we get U=e(𝐃)=/𝐄𝐃/, with:

File:Simplex measure pullback.svg
For the 1-simplex (blue) embedded in 2, when we pull back Lebesgue measure from tangent space (parallel to the simplex), via the embedding p1(p1,1p1), with Jacobian 𝐄=[11], a scaling factor of 𝐄𝐄=2 results.
volume(U)=|det(𝐃𝐄𝐄𝐃)|=|det(𝐄𝐄)||det𝐃)|=ni=1n1|dpi|

To understand the geometric interpretation of the factor n, see the example for the 1-simplex in the diagram at right.

The differential volume element at the transformation output (𝐪Δn1), is the parallelotope, V=f(U)=/𝐅𝐩𝐄𝐃/, where 𝐅𝐩 is the n-by-n Jacobian of f at 𝐩=e(𝐩~). Its volume is:

volume(V)=|det(𝐃𝐄𝐅𝐩𝐅𝐩𝐄𝐃)|=|det(𝐄𝐅𝐩𝐅𝐩𝐄)||det𝐃)|

so that the factor |det𝐃)| cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the representation function, r:𝐩𝐩~, which simply extracts the first (n1) components. The Jacobian is 𝐑=[𝐈n0]. Observe that, since erf=f, the chain rule for function composition gives: 𝐄𝐑𝐅𝐩=𝐅𝐩. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor |det(𝐄𝐄)|=n, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, rfe:

RfΔ(𝐩)=volume(V)volume(U)=|det(𝐑𝐅𝐩𝐄)|

which, if 𝐩P𝐏, can be used to derive the pushforward density after a change of variables, 𝐪=f(𝐩):

P𝐐(𝐪)=P𝐏(𝐩)RfΔ(𝐩),where𝐩=f1(𝐪)

This formula is valid only because the simplex is flat and the Jacobian, 𝐄 is constant. The more general case for curved manifolds is discussed below, after we present two concrete examples of simplex flow transforms.

Simplex calibration transform

[edit | edit source]

A calibration transform, fcal:Δn1Δn1, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic n-class classifier,[23][24] uses the softmax function to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For 𝐩,𝐪Δn1 and with parameters, a0 and 𝐜n the transform can be specified as:

𝐪=fcal(𝐩;a,𝐜)=softmax(a1log𝐩+𝐜)𝐩=fcal1(𝐪;a,𝐜)=softmax(alog𝐪a𝐜)

where the log is applied elementwise. After some algebra the differential volume ratio can be expressed as:

RcalΔ(𝐩;a,𝐜)=|det(𝐑𝐅𝐩𝐄)|=|a|1ni=1nqipi
  • This result can also be obtained by factoring the density of the SGB distribution,[25] which is obtained by sending Dirichlet variates through fcal.

While calibration transforms are most often trained as discriminative models, the reinterpretation here as a probabilistic flow allows also the design of generative calibration models based on this transform. When used for calibration, the restriction a>0 can be imposed to prevent direction reversal in log-probability space. With the additional restriction 𝐜=0, this transform (with discriminative training) is known in machine learning as temperature scaling.

Generalized calibration transform

[edit | edit source]

The above calibration transform can be generalized to fgcal:Δn1Δn1, with parameters 𝐜n and 𝐀 n-by-n invertible:[26]

𝐪=fgcal(𝐩;𝐀,𝐜)=softmax(𝐀log𝐩+𝐜),subject to𝐀𝟏=λ𝟏

where the condition that 𝐀 has 𝟏 as an eigenvector ensures invertibility by sidestepping the information loss due to the invariance: softmax(𝐱+α𝟏)=softmax(𝐱). Note in particular that 𝐀=λ𝐈n is the only allowed diagonal parametrization, in which case we recover fcal(𝐩;λ1,𝐜), while (for n>2) generalization is possible with non-diagonal matrices. The inverse is:

𝐩=fgcal1(𝐪;𝐀,𝐜)=fgcal(𝐪;𝐀1,𝐀1𝐜),where𝐀𝟏=λ𝟏𝐀1𝟏=λ1𝟏

The differential volume ratio is:

RgcalΔ(𝐩;𝐀,𝐜)=|det(𝐀)||λ|i=1nqipi

If fgcal is to be used as a calibration transform, further constraint could be imposed, for example that 𝐀 be positive definite, so that (𝐀𝐱)𝐱>0, which avoids direction reversals. (This is one possible generalization of a>0 in the fcal parameter.)

For n=2, a>0 and 𝐀 positive definite, then fcal and fgcal are equivalent in the sense that in both cases, logp1p2logq1q2 is a straight line, the (positive) slope and offset of which are functions of the transform parameters. For n>2, fgcal does generalize fcal.

It must however be noted that chaining multiple fgcal flow transformations does not give a further generalization, because:

fgcal(;𝐀1,𝐜1)fgcal(;𝐀2,𝐜2)=fgcal(;𝐀1𝐀2,𝐜1+𝐀1𝐜2)

In fact, the set of fgcal transformations form a group under function composition. The set of fcal transformations form a subgroup.

Also see: Dirichlet calibration,[27] which generalizes fgcal, by not placing any restriction on the matrix, 𝐀, so that invertibility is not guaranteed. While Dirichlet calibration is trained as a discriminative model, fgcal can also be trained as part of a generative calibration model.

Differential volume ratio for curved manifolds

[edit | edit source]

Consider a flow, 𝐲=f(𝐱) on a curved manifold, for example 𝕊n1 which we equip with the embedding function, e that maps a set of (n1) angular spherical coordinates to 𝕊n1. The Jacobian of e is non-constant and we have to evaluate it at both input (𝐄𝐱) and output (𝐄𝐲). The same applies to r, the representation function that recovers spherical coordinates from points on 𝕊n1, for which we need the Jacobian at the output (𝐑𝐲). The differential volume ratio now generalizes to:

Rf(𝐱)=|det(𝐑𝐲𝐅𝐱𝐄𝐱)||det(𝐄𝐲𝐄𝐲)||det(𝐄𝐱𝐄𝐱)|

For geometric insight, consider 𝐒2, where the spherical coordinates are co-latitude, θ[0,π] and longitude, ϕ[0,2π). At 𝐱=e(θ,ϕ), we get |det(𝐄𝐱𝐄𝐱)|=sinθ, which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is: sinθdθdϕ.

The above derivation for Rf is fragile in the sense that when using fixed functions e,r, there may be places where they are not well-defined, for example at the poles of the 2-sphere where longitude is arbitrary. This problem is sidestepped (using standard manifold machinery) by generalizing to local coordinates (charts), where in the vicinities of 𝐱,𝐲, we map from local m-dimensional coordinates to n and back using the respective function pairs e𝐱,r𝐱 and e𝐲,r𝐲. We continue to use the same notation for the Jacobians of these functions (𝐄𝐱,𝐄𝐲,𝐑𝐲), so that the above formula for Rf remains valid.

We can however, choose our local coordinate system in a way that simplifies the expression for Rf and indeed also its practical implementation.[22] Let π:𝒫n be a smooth idempotent projection (ππ=π) from the projectible set, 𝒫n, onto the embedded manifold. For example:

  • The positive orthant of n is projected onto the simplex as: π(𝐳)=(i=1nzi)1𝐳
  • Non-zero vectors in n are projected onto the unitsphere as: π(𝐳)=(i=1nzi2)12𝐳

For every 𝐱, we require of π that its n-by-n Jacobian, 𝜫𝒙 has rank m (the manifold dimension), in which case 𝜫𝒙 is an idempotent linear projection onto the local tangent space (orthogonal for the unitsphere: 𝐈n𝐱𝐱; oblique for the simplex: 𝐈n𝒙1). The columns of 𝜫𝒙 span the m-dimensional tangent space at 𝐱. We use the notation, 𝐓𝐱 for any n-by-m matrix with orthonormal columns (𝐓𝐱𝐓𝐱=𝐈m) that span the local tangent space. Also note: 𝜫𝒙𝐓𝐱=𝐓𝐱. We can now choose our local coordinate embedding function, e𝐱:mn:

e𝐱(x~)=π(𝐱+𝐓𝐱𝐱~),with Jacobian:𝐄𝐱=𝐓𝐱at𝐱~=𝟎.

Since the Jacobian is injective (full rank: m), a local (not necessarily unique) left inverse, say r𝐱* with Jacobian 𝐑𝐱*, exists such that r𝐱*(e𝐱(x~))=x~ and 𝐑𝐱*𝐓𝐱=𝐈m. In practice we do not need the left inverse function itself, but we do need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as, r𝐱:nm:

r𝐱(𝐳)=r𝐱*(π(𝐳)),with Jacobian:𝐑𝐱=𝐓𝐱

We can now finally plug 𝐄𝐱=𝐓𝐱 and 𝐑𝐲=𝐓𝐲 into our previous expression for Rf, the differential volume ratio, which because of the orthonormal Jacobians, simplifies to:[28]

Rf(𝐱)=|det(𝐓𝐲𝐅𝐱𝐓𝐱)|

Practical implementation

[edit | edit source]

For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, Rf, or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to Rf itself. Practical solutions include:

  • Sorrenson et al.(2023)[22] give a solution for computationally efficient stochastic parameter gradient approximation for logRf.
  • For some hand-designed flow transforms, Rf can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. Further examples are given below in the section on simple spherical flows.
  • On a software platform equipped with linear algebra and automatic differentiation, Rf(𝐱)=|det(𝐓𝐲𝐅𝐱𝐓𝐱)| can be automatically evaluated, given access to only 𝐱,f,π.[29] But this is expensive for high-dimensional data, with at least 𝒪(n3) computational costs. Even then, the slow automatic solution can be invaluable as a tool for numerically verifying hand-designed closed-form solutions.

Simple spherical flows

[edit | edit source]

In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found.[22] In contrast, this section compiles from statistics literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere, 𝕊n1. All three flows are compositions of an invertible affine transform in n, followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are:

  • To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility.
  • To derive in simple closed form the differential volume ratio, Rf.

An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps surprisingly flexible distributions.

Normalized translation flow

[edit | edit source]

The normalized translation flow, ftrans:𝕊n1𝕊n1, with parameter 𝐜n, is given by:

𝐲=ftrans(𝐱;𝐜)=𝐱+𝐜𝐱+𝐜,where𝐜<1

The inverse function may be derived by considering, for >0: 𝐲=1(𝐱+𝐜) and then using 𝐱𝐱=1 to get a quadratic equation to recover , which gives:

𝐱=ftrans1(𝐲;𝐜)=𝐲𝐜,where=𝐲𝐜+(𝐲𝐜)2+1𝐜𝐜

from which we see that we need 𝐜<1 to keep real and positive for all 𝐲𝕊n1. The differential volume ratio is given (without derivation) by Boulerice & Ducharme(1994) as:[30]

Rtrans(𝐱;𝐜)=1+𝐱𝐜𝐱+𝐜n

This can indeed be verified analytically:

  • By a laborious manipulation of Rf(𝐱)=|det(𝐓𝐲𝐅𝐱𝐓𝐱)|.
  • By setting 𝐌=𝐈n in Raff(𝐱;𝐌,𝐜), which is given below.

Finally, it is worth noting that ftrans and ftrans1 do not have the same functional form.

Normalized linear flow

[edit | edit source]

The normalized linear flow, flin:𝕊n1𝕊n1, where parameter 𝐌 is an invertible n-by-n matrix, is given by:

𝐲=flin(𝐱;𝐌)=𝐌𝐱𝐌𝐱𝐱=flin1(𝐲;𝐌)=flin(𝐲;𝐌1)=𝐌𝟏𝐲𝐌𝟏𝐲

The differential volume ratio is:

Rlin(𝐱;𝐌)=|det𝐌|𝐌𝐱n

This result can be derived indirectly via the Angular central Gaussian distribution (ACG),[31] which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, see ACG distribution.

Normalized affine flow

[edit | edit source]

The normalized affine flow, faff:𝕊n1𝕊n1, with parameters 𝐜n and 𝐌, n-by-n invertible, is given by:

faff(𝐱;𝐌,𝐜)=𝐌𝐱+𝐜𝐌𝐱+𝐜,where𝐌𝟏𝐜<1

The inverse function, derived in a similar way to the normalized translation inverse is:

𝐱=faff1(𝐲;𝐌,𝐜)=𝐌1(𝐲𝐜),where=𝐲𝐖𝐜+(𝐲𝐖𝐜)2+𝐲𝐖𝐲(1𝐜𝐖𝐜)𝐲𝐖𝐲

where 𝐖=(𝐌𝐌)1. The differential volume ratio is:

Raff(𝐱;𝐌,𝐜)=Rlin(𝐱;𝐌+𝐜𝐱)=|det𝐌|(1+𝐱𝐌𝟏𝐜)𝐌𝐱+𝐜n

The final RHS numerator was expanded from det(𝐌+𝐜𝐱) by the matrix determinant lemma. Recalling Rf(𝐱)=|det(𝐓𝐲𝐅𝐱𝐓𝐱)|, the equality between Raff and Rlin holds because not only:

𝐱𝐱=1𝐲=faff(𝐱;𝐌,𝐜)=flin(𝐱;𝐌+𝐜𝐱)

but also, by orthogonality of 𝐱 to the local tangent space:

𝐱𝐓𝐱=0𝐅𝐱aff𝐓𝐱=𝐅𝐱lin𝐓𝐱

where 𝐅𝐱lin=𝐌𝐱+𝐜1(𝐈n𝐲𝐲)(𝐌+𝐜𝐱) is the Jacobian of flin differentiated w.r.t. its input, but not also w.r.t. to its parameter.

Downsides

[edit | edit source]

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them.[32]

Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set).[33] Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis,[34] estimation issues when training models,[35] or fundamental issues due to the entropy of the data distributions.[36]

One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.[37]

Applications

[edit | edit source]

Flow-based generative models have been applied on a variety of modeling tasks, including:

  • Audio generation[38]
  • Image generation[8]
  • Molecular graph generation[39]
  • Point-cloud modeling[40]
  • Video generation[41]
  • Lossy image compression[32]
  • Anomaly detection[42]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Bell, A. J.; Sejnowski, T. J. (1995). "An information-maximization approach to blind separation and blind deconvolution". Neural Computation. **7** (6): 1129–1159. doi:10.1162/neco.1995.7.6.1129.
  5. ^ Roth, Z.; Baram, Y. (1996). "Multidimensional density shaping by sigmoids". IEEE Transactions on Neural Networks. **7** (5): 1291–1298. doi:10.1109/72.536322.
  6. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  9. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  10. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  11. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  12. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  13. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  14. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  15. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  16. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  17. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  18. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  19. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  20. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  21. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  22. ^ a b c d Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  23. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  24. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  25. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  26. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  27. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  28. ^ The tangent matrices are not unique: if 𝐓 has orthonormal columns and 𝐐 is an orthogonal matrix, then 𝐓𝐐 also has orthonormal columns that span the same subspace; it is easy to verify that |det(𝐓𝐲𝐅𝐱𝐓𝐱)| is invariant to such transformations of the tangent representatives.
  29. ^ With PyTorch:
    from torch.linalg import qr
    from torch.func import jacrev
    def logRf(pi, m, f, x):
        y = f(x) 
        Fx, PI = jacrev(f)(x), jacrev(pi)
        Tx, Ty = [qr(PI(z)).Q[:,:m] for z in (x,y)]
        return (Ty.T @ Fx @ Tx).slogdet().logabsdet
    
  30. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  31. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  32. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  33. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  34. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  35. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  36. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  37. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  38. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  39. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  40. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  41. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  42. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
[edit | edit source]