f-divergence

From Wikipedia, the free encyclopedia
(Redirected from Chi-squared divergence)
Jump to navigation Jump to search


In probability theory, an f-divergence is a certain type of function Df(PQ) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence.

History

[edit | edit source]

These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár f-divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.

Definition

[edit | edit source]

Non-singular case

[edit | edit source]

Let P and Q be two probability distributions over a space Ω, such that PQ, that is, P is absolutely continuous with respect to Q (meaning Q>0 wherever P>0). Then, for a convex function f:[0,+)(,+] such that f(x) is finite for all x>0, f(1)=0, and f(0)=limt0+f(t) (which could be infinite), the f-divergence of P from Q is defined as

Df(PQ)Ωf(dPdQ)dQ.

We call f the generator of Df.

In concrete applications, there is usually a reference distribution μ on Ω (for example, when Ω=n, the reference distribution is the Lebesgue measure), such that P,Qμ, then we can use Radon–Nikodym theorem to take their probability densities p and q, giving

Df(PQ)=Ωf(p(x)q(x))q(x)dμ(x).

When there is no such reference distribution ready at hand, we can simply define μ=P+Q, and proceed as above. This is a useful technique in more abstract proofs.

The above definition can be extended to cases where PQ is no longer satisfied (Definition 7.1 of [2]).

Since f is convex, and f(1)=0 , the function f(x)x1 must be nondecreasing, so there exists f():=limxf(x)/x, taking value in (,+].

Since for any p(x)>0, we have limq(x)0q(x)f(p(x)q(x))=p(x)f() , we can extend f-divergence to the P≪̸Q.

Properties

[edit | edit source]

Basic relations between f-divergences

[edit | edit source]
  • Linearity: Diaifi=iaiDfi given a finite sequence of nonnegative real numbers ai and generators fi.
  • Df=Dg iff f(x)=g(x)+c(x1) for some c.
Proof

If f(x)=g(x)+c(x1), then Df=Dg by definition.

Conversely, if DfDg=0, then let h=fg. For any two probability measures P,Q on the set {0,1}, since Df(PQ)Dg(PQ)=0, we get h(P1/Q1)=Q0Q1h(P0/Q0)

Since each probability measure P,Q has one degree of freedom, we can solve P0Q0=a,P1Q1=x for every choice of 0<a<1<x.

Linear algebra yields Q0=x1xa,Q1=1axa, which is a valid probability measure. Then we obtain h(x)=h(a)a1(x1),h(a)=h(x)x1(a1).

Thus h(x)={c1(x1)if x>1,c0(x1)if 0<x<1, for some constants c0,c1. Plugging the formula into h(x)=h(a)a1(x1) yields c0=c1.

Basic properties of f-divergences

[edit | edit source]
  • Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
    Df(PQ)=f(dPdQ)dQf(dPdQdQ)=f(1)=0.
  • Data-processing inequality: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
    Df(PQ)Df(PκQκ).
    The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
  • Joint convexity: for any 0 ≤ λ ≤ 1,
    Df(λP1+(1λ)P2λQ1+(1λ)Q2)λDf(P1Q1)+(1λ)Df(P2Q2).
    This follows from the convexity of the mapping (p,q)qf(p/q) on +2.
  • Reversal by convex inversion: for any function f, its convex inversion is defined as g(t):=tf(1/t). When f satisfies the defining features of a f-divergence generator (f(x) is finite for all x>0, f(1)=0, and f(0)=limt0+f(t)), then g satisfies the same features, and thus defines a f-divergence Dg. This is the "reverse" of Df, in the sense that Dg(PQ)=Df(QP) for all P,Q that are absolutely continuous with respect to each other. In this way, every f-divergence Df can be turned symmetric by D12(f+g). For example, performing this symmetrization turns KL-divergence into Jeffreys divergence.

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution P* then Df(P(t)P*) is a monotonic (non-increasing) function of time, where the probability distribution P(t) is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences Df(P(t)P*) are the Lyapunov functions of the Kolmogorov forward equations. The converse statement is also true: If H(P) is a Lyapunov function for all Markov chains with positive equilibrium P* and is of the trace-form (H(P)=if(Pi,Pi*)) then H(P)=Df(P(t)P*), for some convex function f.[3][4] For example, Bregman divergences in general do not have such property and can increase in Markov processes.[5]

Analytic properties

[edit | edit source]

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Basic variational representation

[edit | edit source]

Let f* be the convex conjugate of f. Let effdom(f*) be the effective domain of f*, that is, effdom(f*)={y:f*(y)<}. Then we have two variational representations of Df, which we describe below.

Under the above setup,

TheoremDf(P;Q)=supg:Ωeffdom(f*)EP[g]EQ[f*g].

This is Theorem 7.24 in.[2]

Example applications

[edit | edit source]

Using this theorem on total variation distance, with generator f(x)=12|x1|, its convex conjugate is f*(x*)={x* on [1/2,1/2],+ else., and we obtain TV(PQ)=sup|g|1/2EP[g(X)]EQ[g(X)]. For chi-squared divergence, defined by f(x)=(x1)2,f*(y)=y2/4+y, we obtain χ2(P;Q)=supgEP[g(X)]EQ[g(X)2/4+g(X)]. Since the variation term is not affine-invariant in g, even though the domain over which g varies is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.

Replacing g by ag+b and taking the maximum over a,b, we obtain χ2(P;Q)=supg(EP[g(X)]EQ[g(X)])2VarQ[g(X)], which is just a few steps away from the Hammersley–Chapman–Robbins bound and the Cramér–Rao bound (Theorem 29.1 and its corollary in [2]).

For α-divergence with α(,0)(0,1), we have fα(x)=xααx(1α)α(α1), with range x[0,). Its convex conjugate is fα*(y)=1α(x(y)α1) with range y(,(1α)1), where x(y)=((α1)y+1)1α1.

Applying this theorem yields, after substitution with h=((α1)g+1)1α1, Dα(PQ)=1α(1α)infh:Ω(0,)(EQ[hαα]+EP[hα11α]), or, releasing the constraint on h, Dα(PQ)=1α(1α)infh:Ω(EQ[|h|αα]+EP[|h|α11α]). Setting α=1 yields the variational representation of χ2-divergence obtained above.

The domain over which h varies is not affine-invariant in general, unlike the χ2-divergence case. The χ2-divergence is special, since in that case, we can remove the || from |h|.

For general α(,0)(0,1), the domain over which h varies is merely scale invariant. Similar to above, we can replace h by ah, and take minimum over a>0 to obtain Dα(PQ)=suph>0[1α(1α)(1EP[hα1]αEQ[hα]α1)]. Setting α=12, and performing another substitution by g=h, yields two variational representations of the squared Hellinger distance: H2(PQ)=12D1/2(PQ)=2infh>0(EQ[h(X)]+EP[h(X)1]), H2(PQ)=2suph>0(1EP[h1]EQ[h]). Applying this theorem to the KL-divergence, defined by f(x)=xlnx,f*(y)=ey1, yields DKL(P;Q)=supgEP[g(X)]e1EQ[eg(X)]. This is strictly less efficient than the Donsker–Varadhan representation DKL(P;Q)=supgEP[g(X)]lnEQ[eg(X)]. This defect is fixed by the next theorem.

Improved variational representation

[edit | edit source]

Assume the setup in the beginning of this section ("Variational representations").

TheoremIf f(x)=+ on x<0 (redefine f if necessary), then

Df(PQ)=f()P[Sc]+supg𝔼P[g1S]ΨQ,P*(g),

where ΨQ,P*(g):=infa𝔼Q[f*(g(X)a)]+aP[S] and S:={q>0}, where q is the probability density function of Q with respect to some underlying measure.

In the special case of f()=+, we have

Df(PQ)=supg𝔼P[g]ΨQ*(g),ΨQ*(g):=infa𝔼Q[f*(g(X)a)]+a.

This is Theorem 7.25 in.[2]

Example applications

[edit | edit source]

Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.

Attempting to apply this theorem to the general α-divergence with α(,0)(0,1) does not yield a closed-form solution.

Common examples of f-divergences

[edit | edit source]

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of α-divergence, or linear sums of α-divergences.

For each f-divergence Df, its generating function is not uniquely defined, but only up to c(t1), where c is any real constant. That is, for any f that generates an f-divergence, we have Df(t)=Df(t)+c(t1). This freedom is not only convenient, but actually necessary.

Divergence Corresponding f(t) Discrete Form
χα-divergence, α1 12|t1|α 12i|piqiqi|αqi
Total variation distance (α=1) 12|t1| 12i|piqi|
α-divergence {tααt(1α)α(α1)if α0,α1,tlntt+1,if α=1,lnt+t1,if α=0
KL-divergence (α=1) tlnt ipilnpiqi
reverse KL-divergence (α=0) lnt iqilnqipi
Jensen–Shannon divergence 12(tlnt(t+1)ln(t+12)) 12i(pilnpi(pi+qi)/2+qilnqi(pi+qi)/2)
Jeffreys divergence (KL + reverse KL) (t1)ln(t) i(piqi)lnpiqi
squared Hellinger distance (α=12) 12(t1)2,1t 12i(piqi)2;1ipiqi
Neyman χ2-divergence (t1)2 i(piqi)2qi
Pearson χ2-divergence (t1)2t i(piqi)2pi
File:Alpha-divergence.svg
Comparison between the generators of alpha-divergences, as alpha varies from -1 to 2.

Let fα be the generator of α-divergence, then fα and f1α are convex inversions of each other, so Dα(PQ)=D1α(QP). In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.

In the literature, the α-divergences are sometimes parametrized as

{41α2(1t(1+α)/2),if α±1,tlnt,if α=1,lnt,if α=1

which is equivalent to the parametrization in this page by substituting αα+12.

Relations to other statistical divergences

[edit | edit source]

Here, we compare f-divergences with other statistical divergences.

Rényi divergence

[edit | edit source]

The Rényi divergences is a family of divergences defined by

Rα(PQ)=1α1log(EQ[(dPdQ)α])

when α(0,1)(1,+). It is extended to the cases of α=0,1,+ by taking the limit.

Simple algebra shows that Rα(PQ)=1α1ln(1+α(α1)Dα(PQ)), where Dα is the α-divergence defined above.

Bregman divergence

[edit | edit source]

The only f-divergence that is also a Bregman divergence is the KL divergence.[6]

Integral probability metrics

[edit | edit source]

The only f-divergence that is also an integral probability metric is the total variation.[7]

Financial interpretation

[edit | edit source]

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[8]

See also

[edit | edit source]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). Eq. (4.20)
  2. ^ a b c d Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).