Supervised learning(SL) is a paradigm inmachine learningwhere input objects (for example, a vector of predictor variables) and a desired output value (also known as a human-labeledsupervisory signal) train a model. The training data is processed, building a function that maps new data to expected output values.[1]An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (seeinductive bias). This statistical quality of an algorithm is measured through the so-calledgeneralization error.

In supervised learning, the training data is labeled with the expected answers, while inunsupervised learning,the model identifies patterns or structures in unlabeled data.

Steps to follow

edit

To solve a given problem of supervised learning, one has to perform the following steps:

  1. Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case ofhandwriting analysis,for example, this might be a single handwritten character, an entire handwritten word, an entire sentence of handwriting or perhaps a full paragraph of handwriting.
  2. Gather atraining set.The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
  3. Determine the inputfeaturerepresentation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into afeature vector,which contains a number of features that are descriptive of the object. The number of features should not be too large, because of thecurse of dimensionality;but should contain enough information to accurately predict the output.
  4. Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to usesupport-vector machinesordecision trees.
  5. Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certaincontrol parameters.These parameters may be adjusted by optimizing performance on a subset (called avalidation set) of the training set, or viacross-validation.
  6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on atest setthat is separate from the training set.

Algorithm choice

edit

A wide range of supervised learning algorithms are available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see theNo free lunch theorem).

There are four major issues to consider in supervised learning:

Bias-variance tradeoff

edit

A first issue is the tradeoff betweenbiasandvariance.[2]Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular inputif, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for.A learning algorithm has high variance for a particular inputif it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[3]Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

Function complexity and amount of training data

edit

The second issue is of the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be able to learn with a large amount of training data paired with a "flexible" learning algorithm with low bias and high variance.

Dimensionality of the input space

edit

A third issue is the dimensionality of the input space. If the input feature vectors have large dimensions, learning the function can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, input data of large dimensions typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, it will likely improve the accuracy of the learned function. In addition, there are many algorithms forfeature selectionthat seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy ofdimensionality reduction,which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

Noise in the output values

edit

A fourth issue is the degree of noise in the desired output values (the supervisorytarget variables). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads tooverfitting.You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation, the part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been calleddeterministic noise.When either type of noise is present, it is better to go with a higher bias, lower variance estimator.

In practice, there are several approaches to alleviate noise in the output values such asearly stoppingto preventoverfittingas well asdetectingand removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreasedgeneralization errorwithstatistical significance.[4][5]

Other factors to consider

edit

Other factors to consider when choosing and applying a learning algorithm include the following:

When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (seecross-validation). Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.

Algorithms

edit

The most widely used learning algorithms are:

How supervised learning algorithms work

edit

Given a set oftraining examples of the formsuch thatis thefeature vectorof the-th example andis its label (i.e., class), a learning algorithm seeks a function,whereis the input space andis the output space. The functionis an element of some space of possible functions,usually called thehypothesis space.It is sometimes convenient to representusing ascoring functionsuch thatis defined as returning thevalue that gives the highest score:.Letdenote the space of scoring functions.

Althoughandcan be any space of functions, many learning algorithms are probabilistic models wheretakes the form of aconditional probabilitymodel,ortakes the form of ajoint probabilitymodel.For example,naive Bayesandlinear discriminant analysisare joint probability models, whereaslogistic regressionis a conditional probability model.

There are two basic approaches to choosingor:empirical risk minimizationandstructural risk minimization.[6]Empirical risk minimization seeks the function that best fits the training data. Structural risk minimization includes apenalty functionthat controls the bias/variance tradeoff.

In both cases, it is assumed that the training set consists of a sample ofindependent and identically distributed pairs,.In order to measure how well a function fits the training data, aloss functionis defined. For training example,the loss of predicting the valueis.

Theriskof functionis defined as the expected loss of.This can be estimated from the training data as

.

Empirical risk minimization

edit

In empirical risk minimization, the supervised learning algorithm seeks the functionthat minimizes.Hence, a supervised learning algorithm can be constructed by applying anoptimization algorithmto find.

Whenis a conditional probability distributionand the loss function is the negative log likelihood:,then empirical risk minimization is equivalent tomaximum likelihood estimation.

Whencontains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization. The learning algorithm is able to memorize the training examples without generalizing well. This is calledoverfitting.

Structural risk minimization

edit

Structural risk minimizationseeks to prevent overfitting by incorporating aregularization penaltyinto the optimization. The regularization penalty can be viewed as implementing a form ofOccam's razorthat prefers simpler functions over more complex ones.

A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the functionis a linear function of the form

.

A popular regularization penalty is,which is the squaredEuclidean normof the weights, also known as thenorm. Other norms include thenorm,,and the"norm",which is the number of non-zeros. The penalty will be denoted by.

The supervised learning optimization problem is to find the functionthat minimizes

The parametercontrols the bias-variance tradeoff. When,this gives empirical risk minimization with low bias and high variance. Whenis large, the learning algorithm will have high bias and low variance. The value ofcan be chosen empirically viacross-validation.

The complexity penalty has a Bayesian interpretation as the negative log prior probability of,,in which caseis theposterior probabilityof.

Generative training

edit

The training methods described above arediscriminative trainingmethods, because they seek to find a functionthat discriminates well between the different output values (seediscriminative model). For the special case whereis ajoint probability distributionand the loss function is the negative log likelihooda risk minimization algorithm is said to performgenerative training,becausecan be regarded as agenerative modelthat explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as innaive Bayesandlinear discriminant analysis.

Generalizations

edit
Tendency for a task to employ supervised vs. unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes.

There are several ways in which the standard supervised learning problem can be generalized:

  • Semi-supervised learningorweak supervision:the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled.
  • Active learning:Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.
  • Structured prediction:When the desired output value is a complex object, such as aparse treeor a labeled graph, then standard methods must be extended.
  • Learning to rank:When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.

Approaches and algorithms

edit

Applications

edit

General issues

edit

See also

edit

References

edit
  1. ^Mehryar Mohri,Afshin Rostamizadeh, Ameet Talwalkar (2012)Foundations of Machine Learning,The MIT PressISBN9780262018258.
  2. ^S. Geman, E. Bienenstock, and R. Doursat (1992).Neural networks and the bias/variance dilemma.Neural Computation 4, 1–58.
  3. ^G. James (2003) Variance and Bias for General Loss Functions, Machine Learning 51, 115-135. (http://www-bcf.usc.edu/~gareth/research/bv.pdf)
  4. ^C.E. Brodely and M.A. Friedl (1999). Identifying and Eliminating Mislabeled Training Instances, Journal of Artificial Intelligence Research 11, 131-167. (http://jair.org/media/606/live-606-1803-jair.pdf)
  5. ^M.R. Smith and T. Martinez (2011). "Improving Classification Accuracy by Identifying and Removing Instances that Should Be Misclassified".Proceedings of International Joint Conference on Neural Networks (IJCNN 2011).pp. 2690–2697.CiteSeerX10.1.1.221.1371.doi:10.1109/IJCNN.2011.6033571.
  6. ^Vapnik, V. N.The Nature of Statistical Learning Theory(2nd Ed.), Springer Verlag, 2000.
  7. ^A. Maity (2016). "Supervised Classification of RADARSAT-2 Polarimetric Data for Different Land Features".arXiv:1608.00501[cs.CV].
  8. ^"Key Technologies for Agile Procurement | SIPMM Publications".publication.sipmm.edu.sg.2020-10-09.Retrieved2022-06-16.
edit