Uniform convergence in probability

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family uniformly converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory. Specifically, the Glivenko-Cantelli theorem and the homonymous classes of functions are fundamentally related to uniform convergence.

The law of large numbers says that, for each single event A, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class S from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events S [1]. The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Definitions

[edit | edit source]

For a class of predicates H defined on a set X and a set of samples x=(x1,x2,,xm), where xiX, the empirical frequency of hH on x is

Q^x(h)=1m|{i:1im,h(xi)=1}|.

The theoretical probability of hH is defined as QP(h)=P{yX:h(y)=1}.

The Uniform Convergence Theorem states, roughly, that if H is "simple" and we draw samples independently (with replacement) from X according to any distribution P, then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.[2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class H is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[1] using the concept of growth function.

Uniform Convergence Theorem

[edit | edit source]

The statement of the Uniform Convergence Theorem is as follows:[3]

If H is a set of {0,1}-valued functions defined on a set X and P is a probability distribution on X then for ε>0 and m a positive integer, we have:

Pm{|QP(h)Qx^(h)|ε for some hH}4ΠH(2m)eε2m/8.

In the above, for any xXm,QP(h)=P{(yX:h(y)=1},Q^x(h)=1m|{i:1im,h(xi)=1}| and |x|=m. Pm indicates that the probability is taken over x consisting of m i.i.d. draws from the distribution P.

Finally, the growth function ΠH is defined in the following way, for any {0,1}-valued functions H over X and for any natural number m: ΠH(m)=max|{hD:DX,|D|=m,hH}|.

From the point of view of Learning Theory one can consider H to be the Concept/Hypothesis class defined over the instance set X. Crucially, the Sauer–Shelah lemma implies that ΠH(m)md, where d is the VC dimension of H.

Proof of the Uniform Convergence Theorem

[edit | edit source]

[1] and [3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

  1. Symmetrization: We transform the problem of analyzing |QP(h)Q^x(h)|ε into the problem of analyzing |Q^r(h)Q^s(h)|ε/2, where r and s are i.i.d samples of size m drawn according to the distribution P. One can view r as the original randomly drawn sample of length m, while s may be thought as the testing sample which is used to estimate QP(h).
  2. Permutation: Since r and s are picked identically and independently, so swapping elements between them will not change the probability distribution on r and s. So, we will try to bound the probability of |Q^r(h)Q^s(h)|ε/2 for some hH by considering the effect of a specific collection of permutations of the joint sample x=r||s. Specifically, we consider permutations σ(x) which swap xi and xm+i in some subset of 1,2,...,m. The symbol r||s means the concatenation of r and s.[citation needed]
  3. Reduction to a finite class: We can now restrict the function class H to a fixed joint sample and hence, if H has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof. It should be stressed that this proof glosses over details like the measurability of the events V and R; measurability is granted in the case of H being finite or countable, but this is not normally the case in standard applications of the theorem (e.g. for statistical learning theory or to prove the Glivenko-Cantelli theorem). To get measurability, one needs to use a notion of separability of the underlying space, possibly related to H[4].

Symmetrization

[edit | edit source]

Lemma: Let V={xXm:|QP(h)Q^x(h)|ε for some hH} and

R={(r,s)Xm×Xm:|Qr^(h)Q^s(h)|ε/2 for some hH}.

Then for m2ε2, Pm(V)2P2m(R).

Proof: By the triangle inequality,
if |QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2 then |Q^r(h)Q^s(h)|ε/2.

Therefore,

P2m(R)P2m{hH,|QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2}=VPm{s:hH,|QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2}dPm(r)=A

since r and s are independent.

Now for rV fix an hH such that |QP(h)Q^r(h)|ε. For this h, we shall show that

Pm{|QP(h)Q^s(h)|ε2}12.

Thus for any rV, APm(V)2 and hence P2m(R)Pm(V)2. And hence we perform the first step of our high level idea.

Notice, mQ^s(h) is a binomial random variable with expectation mQP(h) and variance mQP(h)(1QP(h)). By Chebyshev's inequality we get

Pm{|QP(h)Qs(h)^|>ε2}mQP(h)(1QP(h))(εm/2)21ε2m12

for the mentioned bound on m. Here we use the fact that x(1x)1/4 for x.

Permutations

[edit | edit source]

Let Γm be the set of all permutations of {1,2,3,,2m} that swaps i and m+i i in some subset of {1,2,3,,2m}.

Lemma: Let R be any subset of X2m and P any probability distribution on X. Then,

P2m(R)=E[Pr[σ(x)R]]maxxX2m(Pr[σ(x)R]),

where the expectation is over x chosen according to P2m, and the probability is over σ chosen uniformly from Γm.

Proof: For any σΓm,

P2m(R)=P2m{x:σ(x)R}

(since coordinate permutations preserve the product distribution P2m.)

P2m(R)=X2m1R(x)dP2m(x)=1|Γm|σΓmX2m1R(σ(x))dP2m(x)=X2m1|Γm|σΓm1R(σ(x))dP2m(x)(because |Γm| is finite)=X2mPr[σ(x)R]dP2m(x)(the expectation)maxxX2m(Pr[σ(x)R]).

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

[edit | edit source]

Lemma: Basing on the previous lemma,

maxxX2m(Pr[σ(x)R])4ΠH(2m)eε2m/8.

Proof: Let us define x=(x1,x2,,x2m) and t=|H|x| which is at most ΠH(2m). This means there are functions h1,h2,,htH such that for any hH,i between 1 and t with hi(xk)=h(xk) for 1k2m.

We see that σ(x)R iff for some h in H satisfies, |1m|{1im:h(xσi)=1}|1m|{m+1i2m:h(xσi)=1}||ε2. Hence if we define wij=1 if hj(xi)=1 and wij=0 otherwise.

For 1im and 1jt, we have that σ(x)R iff for some j in 1,,t satisfies |1m(iwσ(i)jiwσ(m+i)j)|ε2. By union bound we get

Pr[σ(x)R]tmax(Pr[|1m(iwσijiwσm+ij)|ε2])
ΠH(2m)max(Pr[|1m(iwσijiwσm+ij)|ε2]).

Since, the distribution over the permutations σ is uniform for each i, so wσijwσm+ij equals ±|wijwm+ij|, with equal probability.

Thus,

Pr[|1m(i(wσijwσm+ij))|ε2]=Pr[|1m(i|wijwm+ij|βi)|ε2],

where the probability on the right is over βi and both the possibilities are equally likely. By Hoeffding's inequality, this is at most 2emε2/8.

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

[edit | edit source]
  1. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). This is an English translation, by B. Seckler, of the Russian paper: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). The translation was reproduced as: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ a b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).