Weighted least squares

Weighted least squares(WLS), also known asweighted linear regression,^[1]^[2]is a generalization ofordinary least squaresandlinear regressionin which knowledge of the unequalvarianceof observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization ofgeneralized least squares,when all the off-diagonal entries of thecovariance matrixof the errors, are null.

Formulation

The fit of a model to a data point is measured by itsresidual, $r_{i}$ ,defined as the difference between a measured value of the dependent variable, $y_{i}$ and the value predicted by the model, $f(x_{i},{\boldsymbol {\beta }})$ : $r_{i}({\boldsymbol {\beta }})=y_{i}-f(x_{i},{\boldsymbol {\beta }}).$

If the errors are uncorrelated and have equal variance, then the function $S({\boldsymbol {\beta }})=\sum _{i}r_{i}({\boldsymbol {\beta }})^{2},$ is minimised at ${\boldsymbol {\hat {\beta }}}$ ,such that ${\frac {\partial S}{\partial \beta _{j}}}({\hat {\boldsymbol {\beta }}})=0$ .

TheGauss–Markov theoremshows that, when this is so, ${\hat {\boldsymbol {\beta }}}$ is abest linear unbiased estimator(BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted.Aitkenshowed that when a weighted sum of squared residuals is minimized, ${\hat {\boldsymbol {\beta }}}$ is theBLUEif each weight is equal to the reciprocal of the variance of the measurement ${\begin{aligned}S&=\sum _{i=1}^{n}W_{ii}{r_{i}}^{2},&W_{ii}&={\frac {1}{{\sigma _{i}}^{2}}}\end{aligned}}$

The gradient equations for this sum of squares are $-2\sum _{i}W_{ii}{\frac {\partial f(x_{i},{\boldsymbol {\beta }})}{\partial \beta _{j}}}r_{i}=0,\quad j=1,\ldots,m$

which, in a linear least squares system give the modified normal equations, $\sum _{i=1}^{n}\sum _{k=1}^{m}X_{ij}W_{ii}X_{ik}{\hat {\beta }}_{k}=\sum _{i=1}^{n}X_{ij}W_{ii}y_{i},\quad j=1,\ldots,m\,.$ The matrix $X$ above is as defined in thecorresponding discussion of linear least squares.

When the observational errors are uncorrelated and theweight matrix,W=Ω⁻¹,is diagonal, these may be written as $\mathbf {\left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}Wy}.$

If the errors are correlated, the resulting estimator is theBLUEif the weight matrix is equal to the inverse of thevariance-covariance matrixof the observations.

When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as $w_{ii}={\sqrt {W_{ii}}}$ .The normal equations can then be written in the same form as ordinary least squares: $\mathbf {\left(X'^{\textsf {T}}X'\right){\hat {\boldsymbol {\beta }}}=X'^{\textsf {T}}y'} \,$

where we define the following scaled matrix and vector: ${\begin{aligned}\mathbf {X'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {X},\\\mathbf {y'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {y} =\mathbf {y} \oslash \mathbf {\sigma }.\end{aligned}}$

This is a type ofwhitening transformation;the last expression involves anentrywise division.

Fornon-linear least squaressystems a similar argument shows that the normal equations should be modified as follows. $\mathbf {\left(J^{\textsf {T}}WJ\right)\,{\boldsymbol {\Delta }}\beta =J^{\textsf {T}}W\,{\boldsymbol {\Delta }}y}.\,$

Note that for empirical tests, the appropriateWis not known for sure and must be estimated. For thisfeasible generalized least squares(FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution.

If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one.^[3]

Motivation

In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares: ${\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\sum _{i=1}^{n}w_{i}\left|y_{i}-\sum _{j=1}^{m}X_{ij}\beta _{j}\right|^{2}={\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\left\|W^{\frac {1}{2}}\left(\mathbf {y} -X{\boldsymbol {\beta }}\right)\right\|^{2}.$ wherew_i> 0 is the weight of theith observation, andWis thediagonal matrixof such weights.

The weights should, ideally, be equal to thereciprocalof thevarianceof the measurement. (This implies that the observations are uncorrelated. If the observations arecorrelated,the expression ${\textstyle S=\sum _{k}\sum _{j}r_{k}W_{kj}r_{j}\,}$ applies. In this case the weight matrix should ideally be equal to the inverse of thevariance-covariance matrixof the observations).^[3] The normal equations are then: $\left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y}.$

This method is used initeratively reweighted least squares.

Solution

Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values ${\hat {\boldsymbol {\beta }}}=(X^{\textsf {T}}WX)^{-1}X^{\textsf {T}}W\mathbf {y}.$

Therefore, an expression for the estimatedvariance-covariance matrixof the parameter estimates can be obtained byerror propagationfrom the errors in the observations. Let the variance-covariance matrix for the observations be denoted byMand that of the estimated parameters byM^β.Then $M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}WMW^{\textsf {T}}X\left(X^{\textsf {T}}W^{\textsf {T}}X\right)^{-1}.$

When $W = M -1$ ,this simplifies to $M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}.$

When unit weights are used ( $W = I$ ,theidentity matrix), it is implied that the experimental errors are uncorrelated and all equal: $M = σ 2 I$ ,where $σ 2$ is thea priorivariance of an observation. In any case,σ²is approximated by thereduced chi-squared $\chi _{\nu }^{2}$ : ${\begin{aligned}M^{\beta }&=\chi _{\nu }^{2}\left(X^{\textsf {T}}WX\right)^{-1},\\\chi _{\nu }^{2}&=S/\nu,\end{aligned}}$

whereSis the minimum value of the weightedobjective function: $S=r^{\textsf {T}}Wr=\left\|W^{\frac {1}{2}}\left(\mathbf {y} -X{\hat {\boldsymbol {\beta }}}\right)\right\|^{2}.$

The denominator, $\nu =n-m$ ,is the number ofdegrees of freedom;seeeffective degrees of freedomfor generalizations for the case of correlated observations.

In all cases, thevarianceof the parameter estimate ${\hat {\beta }}_{i}$ is given by $M_{ii}^{\beta }$ and thecovariancebetween the parameter estimates ${\hat {\beta }}_{i}$ and ${\hat {\beta }}_{j}$ is given by $M_{ij}^{\beta }$ .Thestandard deviationis the square root of variance, $\sigma _{i}={\sqrt {M_{ii}^{\beta }}}$ ,and the correlation coefficient is given by $\rho _{ij}=M_{ij}^{\beta }/(\sigma _{i}\sigma _{j})$ .These error estimates reflect onlyrandom errorsin the measurements. The true uncertainty in the parameters is larger due to the presence ofsystematic errors,which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are typicallycorrelated.

Parameter confidence limits

It is oftenassumed,for want of any concrete evidence but often appealing to thecentral limit theorem—seeNormal distribution#Occurrence and applications—that the error on each observation belongs to anormal distributionwith a mean of zero and standard deviation $\sigma$ .Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error $se_{\beta }$ (givenhere):

68% that the interval ${\hat {\beta }}\pm se_{\beta }$ encompasses the true coefficient value
95% that the interval ${\hat {\beta }}\pm 2se_{\beta }$ encompasses the true coefficient value
99% that the interval ${\hat {\beta }}\pm 2.5se_{\beta }$ encompasses the true coefficient value

The assumption is not unreasonable whenn>>m.If the experimental errors are normally distributed the parameters will belong to aStudent's t-distributionwithn−mdegrees of freedom.Whenn≫mStudent's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject tosampling error.^[4]

When the number of observations is relatively small,Chebychev's inequalitycan be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

Theresidualsare related to the observations by $\mathbf {\hat {r}} =\mathbf {y} -X{\hat {\boldsymbol {\beta }}}=\mathbf {y} -H\mathbf {y} =(I-H)\mathbf {y},$

whereHis theidempotent matrixknown as thehat matrix: $H=X\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W,$

andIis theidentity matrix.The variance-covariance matrix of the residuals,M^ris given by $M^{\mathbf {r} }=(I-H)M(I-H)^{\textsf {T}}.$

Thus the residuals are correlated, even if the observations are not.

When $W=M^{-1}$ , $M^{\mathbf {r} }=(I-H)M.$

The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals byX^TW^T: $X^{\textsf {T}}W{\hat {\mathbf {r} }}=X^{\textsf {T}}W\mathbf {y} -X^{\textsf {T}}WX{\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y} -\left(X^{\rm {T}}WX\right)\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W\mathbf {y} =\mathbf {0}.$

Say, for example, that the first term of the model is a constant, so that $X_{i1}=1$ for alli.In that case it follows that $\sum _{i}^{m}X_{i1}W_{i}{\hat {r}}_{i}=\sum _{i}^{m}W_{i}{\hat {r}}_{i}=0.$

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model.

If experimental error follows anormal distribution,then, because of the linear relationship between residuals and observations, so should residuals,^[5]but since the observations are only a sample of the population of all possible observations, the residuals should belong to aStudent's t-distribution.Studentized residualsare useful in making a statistical test for anoutlierwhen a particular residual appears to be excessively large.

References

^"Weighted regression".
^"Visualize a weighted regression".
^^a ^bStrutz, T. (2016). "3".Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond).Springer Vieweg.ISBN 978-3-658-11455-8.
^Mandel, John (1964).The Statistical Analysis of Experimental Data.New York: Interscience.
^Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979).Multivariate analysis.New York: Academic Press.ISBN 0-12-471250-9.

[1] "Weighted regression".

[2] "Visualize a weighted regression".

[strutz-3] Strutz, T. (2016). "3".Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond).Springer Vieweg.ISBN 978-3-658-11455-8.

[4] Mandel, John (1964).The Statistical Analysis of Experimental Data.New York: Interscience.

[5] Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979).Multivariate analysis.New York: Academic Press.ISBN 0-12-471250-9.

[1]

[2]

[3]

[4]

[5]