Uniform convergence in probability

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family uniformly converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory. Specifically, the Glivenko-Cantelli theorem and the homonymous classes of functions are fundamentally related to uniform convergence.

The law of large numbers says that, for each single event $A$ , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class $S$ from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events $S$ ^[1]. The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Definitions

For a class of predicates $H$ defined on a set $X$ and a set of samples $x = (x_{1}, x_{2}, \dots, x_{m})$ , where $x_{i} \in X$ , the empirical frequency of $h \in H$ on $x$ is

{\hat{Q}}_{x} (h) = \frac{1}{m} | {i : 1 \leq i \leq m, h (x_{i}) = 1} | .

The theoretical probability of $h \in H$ is defined as $Q_{P} (h) = P {y \in X : h (y) = 1} .$

The Uniform Convergence Theorem states, roughly, that if $H$ is "simple" and we draw samples independently (with replacement) from $X$ according to any distribution $P$ , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.^[2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class $H$ is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis^[1] using the concept of growth function.

Uniform Convergence Theorem

The statement of the Uniform Convergence Theorem is as follows:^[3]

If $H$ is a set of ${0, 1}$ -valued functions defined on a set $X$ and $P$ is a probability distribution on $X$ then for $ε > 0$ and $m$ a positive integer, we have:

P^{m} {| Q_{P} (h) - \hat{Q_{x}} (h) | \geq ε for some h \in H} \leq 4 Π_{H} (2 m) e^{- ε^{2} m / 8} .

In the above, for any $x \in X^{m},$ $Q_{P} (h) = P {(y \in X : h (y) = 1},$ ${\hat{Q}}_{x} (h) = \frac{1}{m} | {i : 1 \leq i \leq m, h (x_{i}) = 1} |$ and $| x | = m .$ $P^{m}$ indicates that the probability is taken over $x$ consisting of $m$ i.i.d. draws from the distribution $P .$

Finally, the growth function $Π_{H}$ is defined in the following way, for any ${0, 1}$ -valued functions $H$ over $X$ and for any natural number $m$ : $Π_{H} (m) = \max | {h \cap D : D \subseteq X, | D | = m, h \in H} | .$

From the point of view of Learning Theory one can consider $H$ to be the Concept/Hypothesis class defined over the instance set $X$ . Crucially, the Sauer–Shelah lemma implies that $Π_{H} (m) \leq m^{d}$ , where $d$ is the VC dimension of $H$ .

Proof of the Uniform Convergence Theorem

^[1] and ^[3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

Symmetrization: We transform the problem of analyzing $| Q_{P} (h) - {\hat{Q}}_{x} (h) | \geq ε$ into the problem of analyzing $| {\hat{Q}}_{r} (h) - {\hat{Q}}_{s} (h) | \geq ε / 2$ , where $r$ and $s$ are i.i.d samples of size $m$ drawn according to the distribution $P$ . One can view $r$ as the original randomly drawn sample of length $m$ , while $s$ may be thought as the testing sample which is used to estimate $Q_{P} (h)$ .
Permutation: Since $r$ and $s$ are picked identically and independently, so swapping elements between them will not change the probability distribution on $r$ and $s$ . So, we will try to bound the probability of $| {\hat{Q}}_{r} (h) - {\hat{Q}}_{s} (h) | \geq ε / 2$ for some $h \in H$ by considering the effect of a specific collection of permutations of the joint sample $x = r | | s$ . Specifically, we consider permutations $σ (x)$ which swap $x_{i}$ and $x_{m + i}$ in some subset of $1, 2, ..., m$ . The symbol $r | | s$ means the concatenation of $r$ and $s$ .^{[citation needed]}
Reduction to a finite class: We can now restrict the function class $H$ to a fixed joint sample and hence, if $H$ has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof. It should be stressed that this proof glosses over details like the measurability of the events $V$ and $R$ ; measurability is granted in the case of $H$ being finite or countable, but this is not normally the case in standard applications of the theorem (e.g. for statistical learning theory or to prove the Glivenko-Cantelli theorem). To get measurability, one needs to use a notion of separability of the underlying space, possibly related to $H$ ^[4].

Symmetrization

Lemma: Let $V = {x \in X^{m} : | Q_{P} (h) - {\hat{Q}}_{x} (h) | \geq ε for some h \in H}$ and

R = {(r, s) \in X^{m} \times X^{m} : | \hat{Q_{r}} (h) - {\hat{Q}}_{s} (h) | \geq ε / 2 for some h \in H} .

Then for $m \geq \frac{2}{ε^{2}}$ , $P^{m} (V) \leq 2 P^{2 m} (R)$ .

Proof: By the triangle inequality,
if $| Q_{P} (h) - {\hat{Q}}_{r} (h) | \geq ε$ and $| Q_{P} (h) - {\hat{Q}}_{s} (h) | \leq ε / 2$ then $| {\hat{Q}}_{r} (h) - {\hat{Q}}_{s} (h) | \geq ε / 2$ .

Therefore,

\begin{matrix} P^{2 m} (R) \\ \geq & P^{2 m} {\exists h \in H, | Q_{P} (h) - {\hat{Q}}_{r} (h) | \geq ε and | Q_{P} (h) - {\hat{Q}}_{s} (h) | \leq ε / 2} \\ = & \int_{V} P^{m} {s : \exists h \in H, | Q_{P} (h) - {\hat{Q}}_{r} (h) | \geq ε and | Q_{P} (h) - {\hat{Q}}_{s} (h) | \leq ε / 2} d P^{m} (r) \\ = & A \end{matrix}

since $r$ and $s$ are independent.

Now for $r \in V$ fix an $h \in H$ such that $| Q_{P} (h) - {\hat{Q}}_{r} (h) | \geq ε$ . For this $h$ , we shall show that

P^{m} {| Q_{P} (h) - {\hat{Q}}_{s} (h) | \leq \frac{ε}{2}} \geq \frac{1}{2} .

Thus for any $r \in V$ , $A \geq \frac{P^{m} (V)}{2}$ and hence $P^{2 m} (R) \geq \frac{P^{m} (V)}{2}$ . And hence we perform the first step of our high level idea.

Notice, $m \cdot {\hat{Q}}_{s} (h)$ is a binomial random variable with expectation $m \cdot Q_{P} (h)$ and variance $m \cdot Q_{P} (h) (1 - Q_{P} (h))$ . By Chebyshev's inequality we get

P^{m} {| Q_{P} (h) - \hat{Q_{s} (h)} | > \frac{ε}{2}} \leq \frac{m \cdot Q_{P} (h) (1 - Q_{P} (h))}{(ε m / 2)^{2}} \leq \frac{1}{ε^{2} m} \leq \frac{1}{2}

for the mentioned bound on $m$ . Here we use the fact that $x (1 - x) \leq 1 / 4$ for $x$ .

Permutations

Let $Γ_{m}$ be the set of all permutations of ${1, 2, 3, \dots, 2 m}$ that swaps $i$ and $m + i$ $\forall i$ in some subset of ${1, 2, 3, \dots, 2 m}$ .

Lemma: Let $R$ be any subset of $X^{2 m}$ and $P$ any probability distribution on $X$ . Then,

P^{2 m} (R) = E [\Pr [σ (x) \in R]] \leq \max_{x \in X^{2 m}} (\Pr [σ (x) \in R]),

where the expectation is over $x$ chosen according to $P^{2 m}$ , and the probability is over $σ$ chosen uniformly from $Γ_{m}$ .

Proof: For any $σ \in Γ_{m},$

P^{2 m} (R) = P^{2 m} {x : σ (x) \in R}

(since coordinate permutations preserve the product distribution $P^{2 m}$ .)

\begin{matrix} ∴ P^{2 m} (R) = & \int_{X^{2 m}} 1_{R} (x) d P^{2 m} (x) \\ = & \frac{1}{| Γ_{m} |} \sum_{σ \in Γ_{m}} \int_{X^{2 m}} 1_{R} (σ (x)) d P^{2 m} (x) \\ = & \int_{X^{2 m}} \frac{1}{| Γ_{m} |} \sum_{σ \in Γ_{m}} 1_{R} (σ (x)) d P^{2 m} (x) \\ (because | Γ_{m} | is finite) \\ = & \int_{X^{2 m}} \Pr [σ (x) \in R] d P^{2 m} (x) (the expectation) \\ \leq & \max_{x \in X^{2 m}} (\Pr [σ (x) \in R]) . \end{matrix}

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

\max_{x \in X^{2 m}} (\Pr [σ (x) \in R]) \leq 4 Π_{H} (2 m) e^{- ε^{2} m / 8}

.

Proof: Let us define $x = (x_{1}, x_{2}, \dots, x_{2 m})$ and $t = | H |_{x} |$ which is at most $Π_{H} (2 m)$ . This means there are functions $h_{1}, h_{2}, \dots, h_{t} \in H$ such that for any $h \in H, \exists i$ between $1$ and $t$ with $h_{i} (x_{k}) = h (x_{k})$ for $1 \leq k \leq 2 m .$

We see that $σ (x) \in R$ iff for some $h$ in $H$ satisfies, $| \frac{1}{m} | {1 \leq i \leq m : h (x_{σ_{i}}) = 1} | - \frac{1}{m} | {m + 1 \leq i \leq 2 m : h (x_{σ_{i}}) = 1} | | \geq \frac{ε}{2}$ . Hence if we define $w_{i}^{j} = 1$ if $h_{j} (x_{i}) = 1$ and $w_{i}^{j} = 0$ otherwise.

For $1 \leq i \leq m$ and $1 \leq j \leq t$ , we have that $σ (x) \in R$ iff for some $j$ in $1, \dots, t$ satisfies $| \frac{1}{m} (\sum_{i} w_{σ (i)}^{j} - \sum_{i} w_{σ (m + i)}^{j}) | \geq \frac{ε}{2}$ . By union bound we get

\Pr [σ (x) \in R] \leq t \cdot \max (\Pr [| \frac{1}{m} (\sum_{i} w_{σ_{i}}^{j} - \sum_{i} w_{σ_{m + i}}^{j}) | \geq \frac{ε}{2}])

\leq Π_{H} (2 m) \cdot \max (\Pr [| \frac{1}{m} (\sum_{i} w_{σ_{i}}^{j} - \sum_{i} w_{σ_{m + i}}^{j}) | \geq \frac{ε}{2}]) .

Since, the distribution over the permutations $σ$ is uniform for each $i$ , so $w_{σ_{i}}^{j} - w_{σ_{m + i}}^{j}$ equals $\pm | w_{i}^{j} - w_{m + i}^{j} |$ , with equal probability.

Thus,

\Pr [| \frac{1}{m} (\sum_{i} (w_{σ_{i}}^{j} - w_{σ_{m + i}}^{j})) | \geq \frac{ε}{2}] = \Pr [| \frac{1}{m} (\sum_{i} | w_{i}^{j} - w_{m + i}^{j} | β_{i}) | \geq \frac{ε}{2}],

where the probability on the right is over $β_{i}$ and both the possibilities are equally likely. By Hoeffding's inequality, this is at most $2 e^{- m ε^{2} / 8}$ .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

^ ^a ^b ^c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). This is an English translation, by B. Seckler, of the Russian paper: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). The translation was reproduced as: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ ^a ^b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[vc-1] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). This is an English translation, by B. Seckler, of the Russian paper: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). The translation was reproduced as: Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[2] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[books.google.com-3] Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[KrappWirthFTSL-4] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[1]

[2]

[3]

[4]

Uniform convergence in probability

Contents

Definitions

Uniform Convergence Theorem

Proof of the Uniform Convergence Theorem

Symmetrization

Permutations

Reduction to a finite class

References

Navigation menu

Uniform convergence in probability

Definitions

Uniform Convergence Theorem

Proof of the Uniform Convergence Theorem

Symmetrization

Permutations

Reduction to a finite class

References

Navigation menu

Search