Jump to content

Generalization error

From Wikipedia, the free encyclopedia

Forsupervised learningapplications inmachine learningandstatistical learning theory,generalization error[1](also known as theout-of-sample error[2]or therisk) is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive tosampling error.As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoidingoverfittingin the learning algorithm. The performance of amachine learningalgorithmis visualized by plots that show values ofestimatesof the generalization error through the learning process, which are calledlearning curves.

Definition

[edit]

In a learning problem, the goal is to develop a functionthat predicts output valuesfor each input datum.The subscriptindicates that the functionis developed based on a data set ofdata points. Thegeneralization errororexpected lossorriskof a particular functionover all possible values ofandis theexpected valueof theloss function:[1]

whereis the unknownjoint probability distributionforand.

Without knowing the joint probability distribution,it is impossible to compute.Instead, we can compute the error on sample data, which is calledempirical error(orempirical risk). Givendata points, the empirical error of a candidate functionis:

An algorithm is said to generalize if:

Of particular importance is thegeneralization errorof the data-dependent functionthat is found by a learning algorithm based on the sample. Again, for an unknown probability distribution,cannot be computed. Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the generalization error and the empirical error in probability:

That is, the goal is to characterize the probabilitythat the generalization error is less than the empirical error plus some error bound(generally dependent onand). For many types of algorithms, it has been shown that an algorithm has generalization bounds if it meets certainstabilitycriteria. Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and meets two stability conditions, it will generalize. The first stability condition,leave-one-out cross-validationstability, says that to be stable, the prediction error for each data point when leave-one-out cross validation is used must converge to zero as.The second condition, expected-to-leave-one-out error stability (also known as hypothesis stability if operating in thenorm) is met if the prediction on a left-out datapoint does not change when a single data point is removed from the training dataset.[3]

These conditions can be formalized as:

Leave-one-out cross-validation Stability

[edit]

An algorithmhasstability if for each,there exists aandsuch that:

andandgo to zero asgoes to infinity.[3]

Expected-leave-one-out error Stability

[edit]

An algorithmhasstability if for eachthere exists aand asuch that:

withandgoing to zero for.

For leave-one-out stability in thenorm, this is the same as hypothesis stability:

withgoing to zero asgoes to infinity.[3]

Algorithms with proven stability

[edit]

A number of algorithms have been proven to be stable and as a result have bounds on their generalization error. A list of these algorithms and the papers that proved stability is availablehere.

Relation to overfitting

[edit]
This figure illustrates the relationship between overfitting and the generalization errorI[fn] -IS[fn]. Data points were generated from the relationshipy=xwith white noise added to theyvalues. In the left column, a set of training points is shown in blue. A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled from the underlying joint probability distribution ofxandy.In the top row, the function is fit on a sample dataset of 10 datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error on training and test data converges and generalization error goes to 0.

The concepts of generalization error and overfitting are closely related. Overfitting occurs when the learned functionbecomes sensitive to the noise in the sample. As a result, the function will perform well on the training set but not perform well on other data from the joint probability distribution ofand.Thus, the more overfitting occurs, the larger the generalization error.

The amount of overfitting can be tested usingcross-validationmethods, that split the sample into simulated training samples and testing samples. The model is then trained on a training sample and evaluated on the testing sample. The testing sample is previously unseen by the algorithm and so represents a random sample from the joint probability distribution ofand.This test sample allows us to approximate the expected error and as a result approximate a particular form of the generalization error.

Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more complex functions (known as Tikhonovregularization), or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization).

The approach to finding a function that does not overfit is at odds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data. This is known as thebias–variance tradeoff.Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions. It is impossible to minimize both simultaneously.

References

[edit]
  1. ^abMohri, M., Rostamizadeh A., Talwakar A., (2018)Foundations of Machine learning,2nd ed., Boston: MIT Press
  2. ^Y S. Abu-Mostafa, M.Magdon-Ismail, and H.-T. Lin (2012) Learning from Data, AMLBook Press.ISBN978-1600490064
  3. ^abcMukherjee, S.; Niyogi, P.; Poggio, T.; Rifkin., R. M. (2006)."Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization"(PDF).Adv. Comput. Math.25(1–3): 161–193.doi:10.1007/s10444-004-7634-z.S2CID2240256.

Further reading

[edit]