Matrix Chernoff bound

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose {𝐗k} is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

Pr{λmax(k𝐗k)t}

The following theorems answer this general question under various assumptions; these assumptions are named below by analogy to their classical, scalar counterparts. All of these theorems can be found in (Tropp 2010), as the specific application of a general result which is derived below. A summary of related works is given.

Matrix Gaussian and Rademacher series

[edit | edit source]

Self-adjoint matrices case

[edit | edit source]

Consider a finite sequence {𝐀k} of fixed, self-adjoint matrices with dimension d, and let {ξk} be a finite sequence of independent standard normal or independent Rademacher random variables.

Then, for all t0,

Pr{λmax(kξk𝐀k)t}det2/2σ2

where

σ2=k𝐀k2.

Rectangular case

[edit | edit source]

Consider a finite sequence {𝐁k} of fixed matrices with dimension d1×d2, and let {ξk} be a finite sequence of independent standard normal or independent Rademacher random variables. Define the variance parameter

σ2=max{k𝐁k𝐁k*,k𝐁k*𝐁k}.

Then, for all t0,

Pr{kξk𝐁kt}(d1+d2)et2/2σ2.

Matrix Chernoff inequalities

[edit | edit source]

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly bounded random variables. In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random matrices subjected to a uniform eigenvalue bound.

Matrix Chernoff I

[edit | edit source]

Consider a finite sequence {𝐗k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies

𝐗k𝟎andλmax(𝐗k)R

almost surely.

Define

μmin=λmin(k𝔼𝐗k)andμmax=λmax(k𝔼𝐗k).

Then

Pr{λmin(k𝐗k)(1δ)μmin}d[eδ(1δ)1δ]μmin/Rfor δ[0,1), and
Pr{λmax(k𝐗k)(1+δ)μmax}d[eδ(1+δ)1+δ]μmax/Rfor δ0.

Matrix Chernoff II

[edit | edit source]

Consider a sequence {𝐗k:k=1,2,,n} of independent, random, self-adjoint matrices that satisfy

𝐗k𝟎andλmax(𝐗k)1

almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

μ¯min=λmin(1nk=1n𝔼𝐗k)andμ¯max=λmax(1nk=1n𝔼𝐗k).

Then

Pr{λmin(1nk=1n𝐗k)α}denD(αμ¯min)for 0αμ¯min, and
Pr{λmax(1nk=1n𝐗k)α}denD(αμ¯max)for μ¯maxα1.

The binary information divergence is defined as

D(au)=a(logalogu)+(1a)(log(1a)log(1u))

for a,u[0,1].

Matrix Bennett and Bernstein inequalities

[edit | edit source]

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of independent, zero-mean random variables that are either bounded or subexponential. In the matrix case, the analogous results concern a sum of zero-mean random matrices.

Bounded case

[edit | edit source]

Consider a finite sequence {𝐗k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies

𝔼𝐗k=𝟎andλmax(𝐗k)R

almost surely.

Compute the norm of the total variance,

σ2=k𝔼(𝐗k2).

Then, the following chain of inequalities holds for all t0:

Pr{λmax(k𝐗k)t}dexp(σ2R2h(Rtσ2))dexp(t2σ2+Rt/3){dexp(3t2/8σ2)for tσ2/R;dexp(3t/8R)for tσ2/R.

The function h(u) is defined as h(u)=(1+u)log(1+u)u for u0.

Consider a sequence {𝐗k}k=1n of independent and identically distributed random column vectors in d. Assume that each random vector satisfies 𝐗k2M almost surely, and 𝔼[𝐗k𝐗kT]21. Then, for all t0,[1]

Pr{1nk=1n𝐗k𝐗kT𝔼[𝐗1𝐗1T]2t}(2min(d,n))2exp(n(t1)4M2)

Subexponential case

[edit | edit source]

Consider a finite sequence {𝐗k} of independent, random, self-adjoint matrices with dimension d. Assume that

𝔼𝐗k=𝟎and𝔼(𝐗kp)p!2Rp2𝐀k2

for p=2,3,4,.

Compute the variance parameter,

σ2=k𝐀k2.

Then, the following chain of inequalities holds for all t0:

Pr{λmax(k𝐗k)t}dexp(t2/2σ2+Rt){dexp(t2/4σ2)for tσ2/R;dexp(t/4R)for tσ2/R.

Rectangular case

[edit | edit source]

Consider a finite sequence {𝐙k} of independent, random, matrices with dimension d1×d2. Assume that each random matrix satisfies

𝔼𝐙k=𝟎and𝐙kR

almost surely. Define the variance parameter

σ2=max{k𝔼(𝐙k𝐙k*),k𝔼(𝐙k*𝐙k)}.

Then, for all t0

Pr{k𝐙kt}(d1+d2)exp(t2/2σ2+Rt/3)

holds.

Matrix Azuma, Hoeffding, and McDiarmid inequalities

[edit | edit source]

Matrix Azuma

[edit | edit source]

The scalar version of Azuma's inequality states that a scalar martingale exhibits normal concentration about its mean value, and the scale for deviations is controlled by the total maximum squared range of the difference sequence. The following is the extension in matrix setting.

Consider a finite adapted sequence {𝐗k} of self-adjoint matrices with dimension d, and a fixed sequence {𝐀k} of self-adjoint matrices that satisfy

𝔼k1𝐗k=𝟎and𝐗k2𝐀k2

almost surely.

Compute the variance parameter

σ2=k𝐀k2.

Then, for all t0

Pr{λmax(k𝐗k)t}det2/8σ2

The constant 1/8 can be improved to 1/2 when there is additional information available. One case occurs when each summand 𝐗k is conditionally symmetric. Another example requires the assumption that 𝐗k commutes almost surely with 𝐀k.

Matrix Hoeffding

[edit | edit source]

Placing addition assumption that the summands in Matrix Azuma are independent gives a matrix extension of Hoeffding's inequalities.

Consider a finite sequence {𝐗k} of independent, random, self-adjoint matrices with dimension d, and let {𝐀k} be a sequence of fixed self-adjoint matrices. Assume that each random matrix satisfies

𝔼𝐗k=𝟎and𝐗k2𝐀k2

almost surely.

Then, for all t0

Pr{λmax(k𝐗k)t}det2/8σ2

where

σ2=k𝐀k2.

An improvement of this result was established in (Mackey et al. 2012): for all t0

Pr{λmax(k𝐗k)t}det2/2σ2

where

σ2=12k𝐀k2+𝔼𝐗k2k𝐀k2.

Matrix bounded difference (McDiarmid)

[edit | edit source]

In scalar setting, McDiarmid's inequality provides one common way of bounding the differences by applying Azuma's inequality to a Doob martingale. A version of the bounded differences inequality holds in the matrix setting.

Let {Zk:k=1,2,,n} be an independent, family of random variables, and let 𝐇 be a function that maps n variables to a self-adjoint matrix of dimension d. Consider a sequence {𝐀k} of fixed self-adjoint matrices that satisfy

(𝐇(z1,,zk,,zn)𝐇(z1,,z'k,,zn))2𝐀k2,

where zi and z'i range over all possible values of Zi for each index i. Compute the variance parameter

σ2=k𝐀k2.

Then, for all t0

Pr{λmax(𝐇(𝐳)𝔼𝐇(𝐳))t}det2/8σ2,

where 𝐳=(Z1,,Zn).

An improvement of this result was established in (Paulin, Mackey & Tropp 2013) (see also (Paulin, Mackey & Tropp 2016)): for all t0

Pr{λmax(𝐇(𝐳)𝔼𝐇(𝐳))t}det2/σ2,

where 𝐳=(Z1,,Zn) and σ2=k𝐀k2.

[edit | edit source]

The first bounds of this type were derived by (Ahlswede & Winter 2003). Recall the theorem above for self-adjoint matrix Gaussian and Rademacher bounds: For a finite sequence {𝐀k} of fixed, self-adjoint matrices with dimension d and for {ξk} a finite sequence of independent standard normal or independent Rademacher random variables, then

Pr{λmax(kξk𝐀k)t}det2/2σ2

where

σ2=k𝐀k2.

Ahlswede and Winter would give the same result, except with

σAW2=kλmax(𝐀k2).

By comparison, the σ2 in the theorem above commutes Σ and λmax; that is, it is the largest eigenvalue of the sum rather than the sum of the largest eigenvalues. It is never larger than the Ahlswede–Winter value (by the norm triangle inequality), but can be much smaller. Therefore, the theorem above gives a tighter bound than the Ahlswede–Winter result.

The chief contribution of (Ahlswede & Winter 2003) was the extension of the Laplace-transform method used to prove the scalar Chernoff bound (see Chernoff bound#Additive form (absolute error)) to the case of self-adjoint matrices. The procedure given in the derivation below. All of the recent works on this topic follow this same procedure, and the chief differences follow from subsequent steps. Ahlswede & Winter use the Golden–Thompson inequality to proceed, whereas Tropp (Tropp 2010) uses Lieb's Theorem.

Suppose one wished to vary the length of the series (n) and the dimensions of the matrices (d) while keeping the right-hand side approximately constant. Then n must vary approximately as the log of d. Several papers have attempted to establish a bound without a dependence on dimensions. Rudelson and Vershynin (Rudelson & Vershynin 2007) give a result for matrices which are the outer product of two vectors. (Magen & Zouzias 2010) provide a result without the dimensional dependence for low rank matrices. The original result was derived independently from the Ahlswede–Winter approach, but (Oliveira 2010b) proves a similar result using the Ahlswede–Winter approach.

Finally, Oliveira (Oliveira 2010a) proves a result for matrix martingales independently from the Ahlswede–Winter framework. Tropp (Tropp 2011) slightly improves on the result using the Ahlswede–Winter framework. Neither result is presented in this article.

Derivation and proof

[edit | edit source]

Ahlswede and Winter

[edit | edit source]

The Laplace transform argument found in (Ahlswede & Winter 2003) is a significant result in its own right: Let 𝐘 be a random self-adjoint matrix. Then

Pr{λmax(Y)t}infθ>0{eθtE[treθ𝐘]}.

To prove this, fix θ>0. Then

Pr{λmax(𝐘)t}=Pr{λmax(𝜽𝐘)θt}=Pr{eλmax(θ𝐘)eθt}eθtEeλmax(θ𝐘)eθtEtre(θ𝐘)

The second-to-last inequality is Markov's inequality. The last inequality holds since eλmax(θ𝐘)=λmax(eθ𝐘)tr(eθ𝐘). Since the left-most quantity is independent of θ, the infimum over θ>0 remains an upper bound for it.

Thus, our task is to understand E[tr(eθ𝐘)] Nevertheless, since trace and expectation are both linear, we can commute them, so it is sufficient to consider Eeθ𝐘:=𝐌𝐘(θ), which we call the matrix generating function. This is where the methods of (Ahlswede & Winter 2003) and (Tropp 2010) diverge. The immediately following presentation follows (Ahlswede & Winter 2003).

The Golden–Thompson inequality implies that

tr𝐌𝐗1+𝐗2(θ)tr[(Eeθ𝐗1)(Eeθ𝐗2)]=tr𝐌𝐗1(θ)𝐌𝐗2(θ), where we used the linearity of expectation several times.

Suppose 𝐘=k𝐗k. We can find an upper bound for tr𝐌𝐘(θ) by iterating this result. Noting that tr(𝐀𝐁)tr(𝐀)λmax(𝐁), then

tr𝐌𝐘(θ)tr[(Eek=1n1θ𝐗k)(Eeθ𝐗n)]tr(Eek=1n1θ𝐗k)λmax(Eeθ𝐗n).

Iterating this, we get

tr𝐌𝐘(θ)(tr𝐈)[Πkλmax(Eeθ𝐗k)]=dekλmax(logEeθ𝐗k)

So far we have found a bound with an infimum over θ. In turn, this can be bounded. At any rate, one can see how the Ahlswede–Winter bound arises as the sum of largest eigenvalues.

Tropp

[edit | edit source]

The major contribution of (Tropp 2010) is the application of Lieb's theorem where (Ahlswede & Winter 2003) had applied the Golden–Thompson inequality. Tropp's corollary is the following: If H is a fixed self-adjoint matrix and X is a random self-adjoint matrix, then

Etre𝐇+𝐗tre𝐇+log(Ee𝐗)

Proof: Let 𝐘=e𝐗. Then Lieb's theorem tells us that

f(𝐘)=tre𝐇+log(𝐘)

is concave. The final step is to use Jensen's inequality to move the expectation inside the function:

Etre𝐇+log(𝐘)tre𝐇+log(E𝐘).

This gives us the major result of the paper: the subadditivity of the log of the matrix generating function.

Subadditivity of log mgf

[edit | edit source]

Let 𝐗k be a finite sequence of independent, random self-adjoint matrices. Then for all θ,

tr𝐌k𝐗k(θ)treklog𝐌𝐗k(θ)

Proof: It is sufficient to let θ=1. Expanding the definitions, we need to show that

Etrekθ𝐗ktreklogEeθ𝐗k.

To complete the proof, we use the law of total expectation. Let Ek be the expectation conditioned on 𝐗1,,𝐗k. Since we assume all the 𝐗i are independent,

Ek1e𝐗k=Ee𝐗k.

Define 𝜩k=logEk1e𝐗k=log𝐌𝐗k(θ).

Finally, we have

Etrek=1n𝐗k=E0En1trek=1n1𝐗k+𝐗nE0En2trek=1n1𝐗k+log(En1e𝐗n)=E0En2trek=1n2𝐗k+𝐗n1+𝜩n=trek=1n𝜩k

where at every step m we use Tropp's corollary with

𝐇m=k=1m1𝐗k+k=m+1n𝜩k

Master tail bound

[edit | edit source]

The following is immediate from the previous result:

Pr{λmax(k𝐗k)t}infθ>0{eθttreklog𝐌𝐗k(θ)}

All of the theorems given above are derived from this bound; the theorems consist in various ways to bound the infimum. These steps are significantly simpler than the proofs given.

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).