Instatistics,explained variationmeasures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified asvariance;then, the more specific termexplained variancecan be used.
The complementary part of the total variation is calledunexplainedorresidualvariation;likewise, when discussing variance as such, this is referred to asunexplainedorresidual variance.
Definition in terms of information gain
editInformation gain by better modelling
editFollowing Kent (1983),[1]we use the Fraser information (Fraser 1965)[2]
whereis the probability density of a random variable,andwith() are two families of parametric models. Model family 0 is the simpler one, with a restricted parameter space.
Parameters are determined bymaximum likelihood estimation,
The information gain of model 1 over model 0 is written as
where a factor of 2 is included for convenience. Γ is always nonnegative; it measures the extent to which the best model of family 1 is better than the best model of family 0 in explainingg(r).
Information gain by a conditional model
editAssume a two-dimensional random variablewhereXshall be considered as an explanatory variable, andYas a dependent variable. Models of family 1 "explain"Yin terms ofX,
- ,
whereas in family 0,XandYare assumed to be independent. We define the randomness ofYby,and the randomness ofY,givenX,by.Then,
can be interpreted as proportion of the data dispersion which is "explained" byX.
Special cases and generalized usage
editLinear regression
editThe fraction of variance unexplained is an established concept in the context oflinear regression.The usual definition of thecoefficient of determinationis based on the fundamental concept of explained variance.
Correlation coefficient as measure of explained variance
editLetXbe a random vector, andYa random variable that is modeled by a normal distribution with centre.In this case, the above-derived proportion of explained variationequals the squaredcorrelation coefficient.
Note the strong model assumptions: the centre of theYdistribution must be a linear function ofX,and for any givenx,theYdistribution must be normal. In other situations, it is generally not justified to interpretas proportion of explained variance.
In principal component analysis
editExplained variance is routinely used inprincipal component analysis.The relation to the Fraser–Kent information gain remains to be clarified.
Criticism
editAs the fraction of "explained variance" equals the squared correlation coefficient,it shares all the disadvantages of the latter: it reflects not only the quality of the regression, but also the distribution of the independent (conditioning) variables.
In the words of one critic: "Thusgives the 'percentage of variance explained' by the regression, an expression that, for most social scientists, is of doubtful meaning but great rhetorical value. If this number is large, the regression gives a good fit, and there is little point in searching for additional variables. Other regression equations on different data sets are said to be less satisfactory or less powerful if theiris lower. Nothing aboutsupports these claims ".[3]: 58 And, after constructing an example whereis enhanced just by jointly considering data from two different populations: "'Explained variance' explains nothing."[3][page needed][4]: 183
See also
editReferences
edit- ^Kent, J. T. (1983). "Information gain and a general measure of correlation".Biometrika.70(1): 163–173.doi:10.1093/biomet/70.1.163.JSTOR2335954.
- ^Fraser, D. A. S. (1965)."On Information in Statistics".Ann. Math. Statist.36(3): 890–896.doi:10.1214/aoms/1177700061.
- ^abAchen, C. H. (1982).Interpreting and Using Regression.Beverly Hills: Sage. pp. 58–59.ISBN0-8039-1915-8.
- ^Achen, C. H. (1990). "'What Does "Explained Variance" Explain?: Reply ".Political Analysis.2(1): 173–184.doi:10.1093/pan/2.1.173.