Binomial proportion confidence interval

From Wikipedia, the free encyclopedia
(Redirected from Wald interval)
Jump to navigation Jump to search

In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments (Bernoulli trials). In other words, a binomial proportion confidence interval is an interval estimate of a success probability p when only the number of experiments n and the number of successes n𝗌 are known.

There are several formulas for a binomial confidence interval, but all of them rely on the assumption of a binomial distribution. In general, a binomial distribution applies when an experiment is repeated a fixed number of times, each trial of the experiment has two possible outcomes (success and failure), the probability of success is the same for each trial, and the trials are statistically independent. Because the binomial distribution is a discrete probability distribution (i.e., not continuous) and difficult to calculate for large numbers of trials, a variety of approximations are used to calculate this confidence interval, all with their own tradeoffs in accuracy and computational intensity.

A simple example of a binomial distribution is the set of various possible outcomes, and their probabilities, for the number of heads observed when a coin is flipped ten times. The observed binomial proportion is the fraction of the flips that turn out to be heads. Given this observed proportion, the confidence interval for the true probability of the coin landing on heads is a range of possible proportions, which may or may not contain the true proportion. A 95% confidence interval for the proportion, for instance, will contain the true proportion 95% of the times that the procedure for constructing the confidence interval is employed.[1]

Problems with using a normal approximation or "Wald interval"

[edit | edit source]
Plotting the normal approximation interval on an arbitrary logistic curve reveals problems of overshoot and zero-width intervals.[2]

A commonly used formula for a binomial confidence interval relies on approximating the distribution of error about a binomially-distributed observation, p^, with a normal distribution.[3] The normal approximation depends on the de Moivre–Laplace theorem (the original, binomial-only version of the central limit theorem) and becomes unreliable when it violates the theorems' premises, as the sample size becomes small or the success probability grows close to either 0 or 1 .[4]

Using the normal approximation, the success probability p is estimated by

pp^±zαnp^(1p^),

where p^n𝗌n is the proportion of successes in a Bernoulli trial process and an estimator for p in the underlying Bernoulli distribution. The equivalent formula in terms of observation counts is

pn𝗌n±zαnn𝗌nn𝖿n,

where the data are the results of n trials that yielded n𝗌 successes and n𝖿=nn𝗌 failures. The distribution function argument zα is the 1α2 quantile of a standard normal distribution (i.e., the probit) corresponding to the target error rate α. For a 95% confidence level, the error α=10.95=0.05, so 1α2=0.975 and z.05=1.96.

When using the Wald formula to estimate p, or just considering the possible outcomes of this calculation, two problems immediately become apparent:

  • First, for p^ approaching either 1 or 0, the interval narrows to zero width (falsely implying certainty).
  • Second, for values of p^<11+n/zα2 (probability too low / too close to 0), the interval boundaries exceed [0,1] (overshoot).

(Another version of the second, overshoot problem, arises when instead 1p^ falls below the same upper bound: probability too high / too close to 1 .)

An important theoretical derivation of this confidence interval involves the inversion of a hypothesis test. Under this formulation, the confidence interval represents those values of the population parameter that would have large P-values if they were tested as a hypothesized population proportion.[clarification needed] The collection of values, θ, for which the normal approximation is valid can be represented as

{θ|yαp^θ1np^(1p^)zα},

where yα is the lower α2 quantile of a standard normal distribution, vs. zα, which is the upper (i.e., 1α2) quantile.

Since the test in the middle of the inequality is a Wald test, the normal approximation interval is sometimes called the Wald interval or Wald method, after Abraham Wald, but it was first described by Laplace (1812).[5]

Bracketing the confidence interval

[edit | edit source]

Extending the normal approximation and Wald-Laplace interval concepts, Michael Short has shown that inequalities on the approximation error between the binomial distribution and the normal distribution can be used to accurately bracket the estimate of the confidence interval around p:[6]

k+C𝖫𝟣zαW^n+zα2pk+C𝖴𝟣+zαW^n+zα2

with W^nkk2+C𝖫𝟤nC𝖫𝟥k+C𝖫𝟦n,

and where p is again the (unknown) proportion of successes in a Bernoulli trial process (as opposed to p^n𝗌n that estimates it) measured with n trials yielding k successes, zα is the 1α2 quantile of a standard normal distribution (i.e., the probit) corresponding to the target error rate α, and the constants C𝖫𝟣, C𝖫𝟤, C𝖫𝟥, C𝖫𝟦, C𝖴𝟣, C𝖴𝟤, C𝖴𝟥, and C𝖴𝟦 are simple algebraic functions of zα.[6] For a fixed α (and hence zα), the above inequalities give easily computed one- or two-sided intervals which bracket the exact binomial upper and lower confidence limits corresponding to the error rate α.

Standard error of a proportion estimation when using weighted data

[edit | edit source]

Let there be a simple random sample X1,,Xn where each Xi is i.i.d from a Bernoulli(p) distribution and weight wi is the weight for each observation, with the(positive) weights wi normalized so they sum to 1. The weighted sample proportion is: p^=i=1nwiXi. Since each of the Xi is independent from all the others, and each one has variance var{Xi}=p(1p) for every i=1,,n, the sampling variance of the proportion therefore is:[7]

var{p^}=i=1nvar{wiXi}=p(1p)i=1nwi2.

The standard error of p^ is the square root of this quantity. Because we do not know p(1p), we have to estimate it. Although there are many possible estimators, a conventional one is to use p^, the sample mean, and plug this into the formula. That gives:

SE{p^}p^(1p^)i=1nwi2

For otherwise unweighted data, the effective weights are uniform wi=1/n, giving i=1nwi2=1n. The SE becomes 1np^(1p^), leading to the familiar formulas, showing that the calculation for weighted data is a direct generalization of them.

Wilson score interval

[edit | edit source]
Wilson score intervals plotted on a logistic curve, revealing asymmetry and good performance for small n and where p is at or near 0 or 1.

The Wilson score interval was developed by E.B. Wilson (1927).[8] It is an improvement over the normal approximation interval in multiple respects: Unlike the symmetric normal approximation interval (above), the Wilson score interval is asymmetric, and it doesn't suffer from problems of overshoot and zero-width intervals that afflict the normal interval. It can be safely employed with small samples and skewed observations.[3] The observed coverage probability is consistently closer to the nominal value, 1α.[2]

Like the normal interval, the interval can be computed directly from a formula.

Wilson started with the normal approximation to the binomial: zαpp^σn where zα is the standard normal interval half-width corresponding to the desired confidence 1α. The analytic formula for a binomial sample standard deviation is σn=p(1p)n. Combining the two, and squaring out the radical, gives an equation that is quadratic in p: (pp^)2=zα2np(1p) or p22pp^+p^2=pzα2np2zα2n. Transforming the relation into a standard-form quadratic equation for p, treating p^ and n as known values from the sample (see prior section), and using the value of zα that corresponds to the desired confidence 1α for the estimate of p gives this: (1+zα2n)p2(2p^+zα2n)p+p^2=0, where all of the values bracketed by parentheses are known quantities. The solution for p estimates the upper and lower limits of the confidence interval for p. Hence the probability of success p is estimated by p^ and with 1α confidence bracketed in the interval pα(w,w+)=11+zα2/n(p^+zα22n±zα2n4np^(1p^)+zα2)

where α is an abbreviation for

{p(w,w+)}=1α.

An equivalent expression using the observation counts n𝗌 and n𝖿 is pαn𝗌+12zα2n+zα2±zαn+zα2n𝗌n𝖿n+zα24,

with the counts as above: n𝗌 the count of observed "successes", n𝖿 the count of observed "failures", and their sum is the total number of observations n=n𝗌+n𝖿.

In practical tests of the formula's results, users find that this interval has good properties even for a small number of trials and / or the extremes of the probability estimate, p^n𝗌n.[2][3][9]

Intuitively, the center value of this interval is the weighted average of p^ and 12, with p^ receiving greater weight as the sample size increases. Formally, the center value corresponds to using a pseudocount of 12zα2, the number of standard deviations of the confidence interval: Add this number to both the count of successes and of failures to yield the estimate of the ratio. For the common two standard deviations in each direction interval (approximately 95% coverage, which itself is approximately 1.96 standard deviations), this yields the estimate n𝗌+2n+4, which is known as the "plus four rule".

Although the quadratic can be solved explicitly, in most cases Wilson's equations can also be solved numerically using the fixed-point iteration pk+1=p^±zα1npk(1pk) with p0=p^.

The Wilson interval can also be derived from the single sample z-test or Pearson's chi-squared test with two categories. The resulting interval,

{θ|yαp^θ1nθ(1θ)zα},

(with yα the lower α quantile) can then be solved for θ to produce the Wilson score interval. The test in the middle of the inequality is a score test.

The interval equality principle

[edit | edit source]
File:Wilson score pdf and interval equality.png
The probability density function (PDF) for the Wilson score interval, plus PDFs at interval bounds. Tail areas are equal.

Since the interval is derived by solving from the normal approximation to the binomial, the Wilson score interval (w,w+) has the property of being guaranteed to obtain the same result as the equivalent z-test or chi-squared test.

This property can be visualised by plotting the probability density function for the Wilson score interval (see Wallis).[9](pp 297-313) After that, then also plotting a normal PDF across each bound. The tail areas of the resulting Wilson and normal distributions represent the chance of a significant result, in that direction, must be equal.

The continuity-corrected Wilson score interval and the Clopper-Pearson interval are also compliant with this property. The practical import is that these intervals may be employed as significance tests, with identical results to the source test, and new tests may be derived by geometry.[9]

Wilson score interval with continuity correction

[edit | edit source]

The Wilson interval may be modified by employing a continuity correction, in order to align the minimum coverage probability, rather than the average coverage probability, with the nominal value, 1α.

Just as the Wilson interval mirrors Pearson's chi-squared test, the Wilson interval with continuity correction mirrors the equivalent Yates' chi-squared test.

The following formulae for the lower and upper bounds of the Wilson score interval with continuity correction (w𝖼𝖼,w𝖼𝖼+) are derived from Newcombe:[2]

w𝖼𝖼=max{0,2np^+zα2[zαzα21n+4np^(1p^)+(4p^2)+1]2(n+zα2)},w𝖼𝖼+=min{1,2np^+zα2+[zαzα21n+4np^(1p^)(4p^2)+1]2(n+zα2)}, for p^0 and p^1.

If p^=0, then w𝖼𝖼 must instead be set to 0; if p^=1, then w𝖼𝖼+ must be instead set to 1.

Wallis (2021)[9] identifies a simpler method for computing continuity-corrected Wilson intervals that employs a special function based on Wilson's lower-bound formula: In Wallis' notation, for the lower bound, let 𝖶𝗂𝗅𝗌𝗈𝗇𝗅𝗈𝗐𝖾𝗋(p^,n,α2)w=11+zα2/n(p^+zα22nzα2n4np^(1p^)+zα2),

where α is the selected tolerable error level for zα. Then w𝖼𝖼=𝖶𝗂𝗅𝗌𝗈𝗇𝗅𝗈𝗐𝖾𝗋(max{p^12n,0},n,α2).

This method has the advantage of being further decomposable.

Jeffreys interval

[edit | edit source]
Jeffreys intervals plotted on a logistic curve, revealing asymmetry and good performance for small n and where p is at or near 0 or 1.

The Jeffreys interval has a Bayesian derivation, but good frequentist properties (outperforming most frequentist constructions). In particular, it has coverage properties that are similar to those of the Wilson interval, but it is one of the few intervals with the advantage of being equal-tailed (e.g., for a 95% confidence interval, the probabilities of the interval lying above or below the true value are both close to 2.5%). In contrast, the Wilson interval has a systematic bias such that it is centred too close to p=0.5.[10]

The Jeffreys interval is the Bayesian credible interval obtained when using the non-informative Jeffreys prior for the binomial proportion p. The Jeffreys prior for this problem is a Beta distribution with parameters (12,12), a conjugate prior. After observing x successes in n trials, the posterior distribution for p is a Beta distribution with parameters (x+12,nx+12).

When x0 and xn, the Jeffreys interval is taken to be the 100(1α)% equal-tailed posterior probability interval, i.e., the 12α and 112α quantiles of a Beta distribution with parameters (x+12,nx+12).

In order to avoid the coverage probability tending to zero when p0 or 1 , when x=0 the upper limit is calculated as before but the lower limit is set to 0 , and when x=n the lower limit is calculated as before but the upper limit is set to 1 .[4]

Jeffreys' interval can also be thought of as a frequentist interval based on inverting the p-value from the G-test after applying the Yates correction to avoid a potentially-infinite value for the test statistic.[citation needed]

Clopper–Pearson interval

[edit | edit source]

The Clopper–Pearson interval is an early and very common method for calculating binomial confidence intervals.[11] This is often called an 'exact' method, as it attains the nominal coverage level in an exact sense, meaning that the coverage level is never less than the nominal 1α.[2]

The Clopper–Pearson interval can be written as

SS

or equivalently,

(infS,supS)

with

S{p|{Bin(n;p)x}>α2} and S{p|{Bin(n;p)x}>α2},

where 0xn is the number of successes observed in the sample and 𝖡𝗂𝗇(n;p) is a binomial random variable with n trials and probability of success p.

Equivalently we can say that the Clopper–Pearson interval is (xnε1,xn+ε2) with confidence level 1α if εi is the infimum of those such that the following tests of hypothesis succeed with significance α2:

  1. H0: p=xnε1 with HA: p>xnε1
  2. H0: p=xn+ε2 with HA: p<xn+ε2.

Because of a relationship between the binomial distribution and the beta distribution, the Clopper–Pearson interval is sometimes presented in an alternate format that uses quantiles from the beta distribution.[12]

B(α2;x,nx+1)<p<B(1α2;x+1,nx)

where x is the number of successes, n is the number of trials, and B(p;v,w) is the pth quantile from a beta distribution with shape parameters v and w.

Thus, pmin<p<pmax, where: Γ(n+1)Γ(x)Γ(nx+1)0pmintx1(1t)nxdt=α2, Γ(n+1)Γ(x+1)Γ(nx)0pmaxtx(1t)nx1dt=1α2. The binomial proportion confidence interval is then (pmin,pmax), as follows from the relation between the Binomial distribution cumulative distribution function and the regularized incomplete beta function.

When x is either 0 or n, closed-form expressions for the interval bounds are available: when x=0 the interval is (0,1(α2)1/n) and when x=n it is ((α2)1/n,1).[12]

The beta distribution is, in turn, related to the F-distribution so a third formulation of the Clopper–Pearson interval can be written using F quantiles:

(1+nx+1xF[α2;2x,2(nx+1)])1<p<(1+nx(x+1)F[1α2;2(x+1),2(nx)])1

where x is the number of successes, n is the number of trials, and F(c;d1,d2) is the c quantile from an F-distribution with d1 and d2 degrees of freedom.[13]

The Clopper–Pearson interval is an 'exact' interval, since it is based directly on the binomial distribution rather than any approximation to the binomial distribution. This interval never has less than the nominal coverage for any population proportion, but that means that it is usually conservative. For example, the true coverage rate of a 95% Clopper–Pearson interval may be well above 95%, depending on n and p.[4] Thus the interval may be wider than it needs to be to achieve 95% confidence, and wider than other intervals. In contrast, it is worth noting that other confidence interval may have coverage levels that are lower than the nominal 1α, i.e., the normal approximation (or "standard") interval, Wilson interval,[8] Agresti–Coull interval,[13] etc., with a nominal coverage of 95% may in fact cover less than 95%,[4] even for large sample sizes.[12]

The definition of the Clopper–Pearson interval can also be modified to obtain exact confidence intervals for different distributions. For instance, it can also be applied to the case where the samples are drawn without replacement from a population of a known size, instead of repeated draws of a binomial distribution. In this case, the underlying distribution would be the hypergeometric distribution.

The interval boundaries can be computed with numerical functions qbeta[14] in R and scipy.stats.beta.ppf[15] in Python.

from scipy.stats import beta
import numpy as np

k = 20
n = 400
alpha = 0.05
p_u, p_o = beta.ppf([alpha / 2, 1 - alpha / 2], [k, k + 1], [n - k + 1, n - k])
if np.isnan(p_o):
    p_o = 1
if np.isnan(p_u):
    p_u = 0

Agresti–Coull interval

[edit | edit source]

The Agresti–Coull interval is also another approximate binomial confidence interval.[13]

Given n𝗌 successes in n trials, define n~n+zα2

and p~=1n~(n𝗌+zα22)

Then, a confidence interval for p is given by

pp~±zαp~n~(1p~)

where zα=Φ-1(1α2) is the quantile of a standard normal distribution, as before (for example, a 95% confidence interval requires α=0.05, thereby producing z.05=1.96). According to Brown, Cai, & DasGupta (2001),[4] taking z=2 instead of 1.96 produces the "add 2 successes and 2 failures" interval previously described by Agresti & Coull.[13]

This interval can be summarised as employing the centre-point adjustment, p~, of the Wilson score interval, and then applying the Normal approximation to this point.[3][4]

p~=p^+zα22n1+zα2n

Arcsine transformation

[edit | edit source]

The arcsine transformation has the effect of pulling out the ends of the distribution.[16] While it can stabilize the variance (and thus confidence intervals) of proportion data, its use has been criticized in several contexts.[17]

Let X be the number of successes in n trials and let p=1nX. The variance of p is

var{p}=1np(1p).

Using the arc sine transform, the variance of the arcsine of p is[18]

var{arcsinp}var{p}4p(1p)=p(1p)4np(1p)=14n.

So, the confidence interval itself has the form

sin2(zα2n+arcsinp)<θ<sin2(+zα2n+arcsinp),

where zα is the 1α2 quantile of a standard normal distribution.

This method may be used to estimate the variance of p but its use is problematic when p is close to 0 or 1 .

ta transform

[edit | edit source]

Let p be the proportion of successes. For 0a2,

ta=log(pa(1p)2a)=alogp(2a)log(1p)

This family is a generalisation of the logit transform which is a special case with a = 1 and can be used to transform a proportional data distribution to an approximately normal distribution. The parameter a has to be estimated for the data set.

Rule of three — for when no successes are observed

[edit | edit source]

The rule of three is used to provide a simple way of stating an approximate 95% confidence interval for p, in the special case that no successes (p^=0) have been observed.[19] The interval is (0,3n).

By symmetry, in the case of only successes (p^=1), the interval is (13n,1).

Comparison and discussion

[edit | edit source]

There are several research papers that compare these and other confidence intervals for the binomial proportion.[3][2][20][21]

Both Ross (2003)[22] and Agresti & Coull (1998)[13] point out that exact methods such as the Clopper–Pearson interval may not work as well as some approximations. The normal approximation interval and its presentation in textbooks has been heavily criticised, with many statisticians advocating that it not be used.[4] The principal problems are overshoot (bounds exceed [0, 1]), zero-width intervals at p^=0 or 1 (falsely implying certainty),[2] and overall inconsistency with significance testing.[3]

Of the approximations listed above, Wilson score interval methods (with or without continuity correction) have been shown to be the most accurate and the most robust,[3][4][2] though some prefer Agresti & Coulls' approach for larger sample sizes.[4] Wilson and Clopper–Pearson methods obtain consistent results with source significance tests,[9] and this property is decisive for many researchers.

Many of these intervals can be calculated in R using packages like binom.[23]

See also

[edit | edit source]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ a b c d e f g h Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ a b c d e f g Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ a b c d e f g h i Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  9. ^ a b c d e Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  10. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  11. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  12. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  13. ^ a b c d e Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  14. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  15. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  16. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  17. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  18. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  19. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). Stats topics on Medical Research
  20. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  21. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  22. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  23. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).