Influential observation

From Wikipedia, the free encyclopedia
(Redirected from Influential point)
Jump to navigation Jump to search
File:Anscombe's quartet 3.svg
In Anscombe's quartet the two datasets on the bottom both contain influential points. All four sets are identical when examined using simple summary statistics, but vary considerably when graphed. If one point is removed, the line would look very different.

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation.[1] In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.[2]

Assessment

[edit | edit source]

Various methods have been proposed for measuring influence.[3][4] Assume an estimated regression 𝐲=𝐗𝐛+𝐞, where 𝐲 is an nΓ—1 column vector for the response variable, 𝐗 is the nΓ—k design matrix of explanatory variables (including a constant), 𝐞 is the nΓ—1 residual vector, and 𝐛 is a kΓ—1 vector of estimates of some population parameter πœ·βˆˆβ„k. Also define 𝐇≑𝐗(𝐗𝖳𝐗)βˆ’1𝐗𝖳, the projection matrix of 𝐗. Then we have the following measures of influence:

  1. DFBETAiβ‰‘π›βˆ’π›(βˆ’i)=(𝐗𝖳𝐗)βˆ’1𝐱i𝖳ei1βˆ’hii, where 𝐛(βˆ’i) denotes the coefficients estimated with the i-th row 𝐱i of 𝐗 deleted, hii=𝐱i(𝐗𝖳𝐗)βˆ’1𝐱i𝖳 denotes the i-th value of matrix's 𝐇 main diagonal. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each variable and each observation (if there are N observations and k variables there are NΒ·k DFBETAs).[5] Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure):
x y intercept slope
10.0 7.46 -0.005 -0.044
8.0 6.77 -0.037 0.019
13.0 12.74 -357.910 525.268
9.0 7.11 -0.033 0
11.0 7.81 0.049 -0.117
14.0 8.84 0.490 -0.667
6.0 6.08 0.027 -0.021
4.0 5.39 0.241 -0.209
12.0 8.15 0.137 -0.231
7.0 6.42 -0.020 0.013
5.0 5.73 0.105 -0.087
  1. DFFITS - difference in fits
  2. Cook's D measures the effect of removing a data point on all the parameters combined.[2]

Outliers, leverage and influence

[edit | edit source]

An outlier may be defined as a data point that differs markedly from other observations.[6][7] A high-leverage point are observations made at extreme values of independent variables.[8] Both types of atypical observations will force the regression line to be close to the point.[2] In Anscombe's quartet, the bottom right image has a point with high leverage and the bottom left image has an outlying point.

See also

[edit | edit source]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value)..
  2. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

Further reading

[edit | edit source]
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).