Bayes classifier

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classes using the same set of features.^[1]

Definition

Suppose a pair $(X, Y)$ takes values in $ℝ^{d} \times {1, 2, \dots, K}$ , where $Y$ is the class label of an element whose features are given by $X$ . Assume that the conditional distribution of X, given that the label Y takes the value r is given by $(X ∣ Y = r) \sim P_{r} for r = 1, 2, \dots, K$ where " $\sim$ " means "is distributed as", and where $P_{r}$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $C : ℝ^{d} \to {1, 2, \dots, K}$ , with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as $ℛ (C) = P {C (X) \neq Y} .$

The Bayes classifier is $C^{Bayes} (x) = \underset{r \in {1, 2, \dots, K}}{argmax} P (Y = r ∣ X = x) .$

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $P (Y = r ∣ X = x)$ . The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $C$ (possibly depending on some training data) is defined as $ℛ (C) - ℛ (C^{Bayes}) .$ Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.^[2]

Considering the components $x_{i}$ of $x$ to be mutually independent, we get the naive Bayes classifier, where $C^{Bayes} (x) = \underset{r \in {1, 2, \dots, K}}{argmax} P (Y = r) \prod_{i = 1}^{d} P_{r} (x_{i}) .$

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $R (h)$ , Bayes risk $R^{*}$ , all possible classes to which the points can be classified $Y = {0, 1}$ . Let the posterior probability of a point belonging to class 1 be $η (x) = P r (Y = 1 | X = x)$ . Define the classifier $𝒽^{*}$ as $𝒽^{*} (x) = {\begin{matrix} 1 & if η (x) ⩾ 0.5, \\ 0 & otherwise. \end{matrix}$

Then we have the following results:

$R (h^{*}) = R^{*}$ , i.e. $h^{*}$ is a Bayes classifier,
For any classifier $h$ , the excess risk satisfies $R (h) - R^{*} = 2 𝔼_{X} [| η (x) - 0.5 | \cdot 𝕀_{{h (X) \neq h^{*} (X)}}]$
$R^{*} = 𝔼_{X} [\min (η (X), 1 - η (X))]$
$R^{*} = \frac{1}{2} - \frac{1}{2} 𝔼 [| 2 η (X) - 1 |]$

Proof of (a): For any classifier $h$ , we have $\begin{matrix} R (h) & = 𝔼_{X Y} [𝕀_{{h (X) \neq Y}}] \\ = 𝔼_{X} 𝔼_{Y | X} [𝕀_{{h (X) \neq Y}}] \\ = 𝔼_{X} [η (X) 𝕀_{{h (X) = 0}} + (1 - η (X)) 𝕀_{{h (X) = 1}}] \end{matrix}$ where the second line was derived through Fubini's theorem

Notice that $R (h)$ is minimised by taking $\forall x \in X$ , $h (x) = {\begin{matrix} 1 & if η (x) ⩾ 1 - η (x), \\ 0 & otherwise. \end{matrix}$

Therefore the minimum possible risk is the Bayes risk, $R^{*} = R (h^{*})$ .

Proof of (b): $\begin{matrix} R (h) - R^{*} & = R (h) - R (h^{*}) \\ = 𝔼_{X} [η (X) 𝕀_{{h (X) = 0}} + (1 - η (X)) 𝕀_{{h (X) = 1}} - η (X) 𝕀_{{h^{*} (X) = 0}} - (1 - η (X)) 𝕀_{{h^{*} (X) = 1}}] \\ = 𝔼_{X} [| 2 η (X) - 1 | 𝕀_{{h (X) \neq h^{*} (X)}}] \\ = 2 𝔼_{X} [| η (X) - 0.5 | 𝕀_{{h (X) \neq h^{*} (X)}}] \end{matrix}$

Proof of (c): $\begin{matrix} R (h^{*}) & = 𝔼_{X} [η (X) 𝕀_{{h^{*} (X) = 0}} + (1 - η (X)) 𝕀_{{h * (X) = 1}}] \\ = 𝔼_{X} [\min (η (X), 1 - η (X))] \end{matrix}$

Proof of (d): $\begin{matrix} R (h^{*}) & = 𝔼_{X} [\min (η (X), 1 - η (X))] \\ = \frac{1}{2} - 𝔼_{X} [\max (η (X) - 1 / 2, 1 / 2 - η (X))] \\ = \frac{1}{2} - \frac{1}{2} 𝔼 [| 2 η (X) - 1 |] \end{matrix}$

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows. $\begin{matrix} 𝔼_{Y} (𝕀_{{y \neq \hat{y}}}) & = 𝔼_{X} 𝔼_{Y | X} (𝕀_{{y \neq \hat{y}}} | X = x) \\ = 𝔼 [\Pr (Y = 1 | X = x) 𝕀_{{\hat{y} = 2, 3, \dots, n}} + \Pr (Y = 2 | X = x) 𝕀_{{\hat{y} = 1, 3, \dots, n}} + \dots + \Pr (Y = n | X = x) 𝕀_{{\hat{y} = 1, 2, 3, \dots, n - 1}}] \end{matrix}$

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $h (x) = k, \arg \max_{k} P r (Y = k | X = x)$ for each observation x.

References

^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[PTPR-1] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[2] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[1]

[2]

Bayes classifier

Contents

Definition

Properties

General case

See also

References

Navigation menu

Bayes classifier

Definition

Properties

General case

See also

References

Navigation menu

Search