C4.5 algorithm

C4.5is an algorithm used to generate adecision treedeveloped byRoss Quinlan.^[1]C4.5 is an extension of Quinlan's earlierID3 algorithm.The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as astatistical classifier.In 2011, authors of theWekamachine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".^[2]

It became quite popular after ranking #1 in theTop 10 Algorithms in Data Miningpre-eminent paper published bySpringer LNCSin 2008.^[3]

Algorithm[edit]

C4.5 builds decision trees from a set of training data in the same way asID3,using the concept ofinformation entropy.The training data is a set $S={s_{1},s_{2},...}$ of already classified samples. Each sample $s_{i}$ consists of a p-dimensional vector $(x_{1,i},x_{2,i},...,x_{p,i})$ ,where the $x_{j}$ represent attribute values orfeaturesof the sample, as well as the class in which $s_{i}$ falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalizedinformation gain(difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm thenrecurseson thepartitionedsublists.

This algorithm has a fewbase cases.

All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Pseudocode[edit]

Inpseudocode,the general algorithm for building decision trees is:^[4]

Check for the above base cases.
For each attributea,find the normalized information gain ratio from splitting ona.
Leta_bestbe the attribute with the highest normalized information gain.
Create a decisionnodethat splits ona_best.
Recurse on the sublists obtained by splitting ona_best,and add those nodes as children ofnode.

Implementations[edit]

J48is anopen source Javaimplementation of the C4.5 algorithm in theWeka data miningtool.

Improvements from ID3 algorithm[edit]

C4.5 made a number of improvements to ID3. Some of these are:

Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.^[5]
Handling training data with missing attribute values - C4.5 allows attribute values to be marked as? for missing. Missing attribute values are simply not used in gain and entropy calculations.
Handling attributes with differing costs.
Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm[edit]

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:^[6]^[7]

Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
Memory usage - C5.0 is more memory efficient than C4.5
Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
Support forboosting- Boosting improves the trees and gives them more accuracy.
Weighting - C5.0 allows you to weight different cases and misclassification types.
Winnowing - a C5.0 option automaticallywinnowsthe attributes to remove those that may be unhelpful.

Source for a single-threaded Linux version of C5.0 is available under theGNU General Public License(GPL).

References[edit]

^Quinlan, J. R. C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, 1993.
^Ian H. Witten; Eibe Frank; Mark A. Hall (2011)."Data Mining: Practical machine learning tools and techniques, 3rd Edition".Morgan Kaufmann, San Francisco. p. 191.
^Umd.edu - Top 10 Algorithms in Data Mining
^S.B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques",Informatica31(2007) 249-268, 2007
^J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.
^Is See5/C5.0 Better Than C4.5?
^M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer 2013

External links[edit]

Original implementation on Ross Quinlan's homepage:http://www.rulequest.com/Personal/
See5 and C5.0

[1] Quinlan, J. R. C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, 1993.

[2] Ian H. Witten; Eibe Frank; Mark A. Hall (2011)."Data Mining: Practical machine learning tools and techniques, 3rd Edition".Morgan Kaufmann, San Francisco. p. 191.

[3] Umd.edu - Top 10 Algorithms in Data Mining

[4] S.B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques",Informatica31(2007) 249-268, 2007

[5] J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.

[6] Is See5/C5.0 Better Than C4.5?

[7] M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer 2013

[1]

[2]

[3]

[4]

[5]

[6]

[7]