Information gain analysis

2207 Words5 Pages

Information gain analysis

ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions .Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.

The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is encoded in bits. Info(D) is just the average amount of information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D.

Now, suppose we were to partition the tuples in D on some attribute A having v distinct values, {a, a2, . . . , av}, as observed from the training data. If A is discrete-valued,these values correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets, {D, D2, . . . , Dv},where Dj contains those tuples in D that have outcome aj of A. These partitions would correspond to the branches gr...

... middle of paper ...

...r each outcome of the criterion, and the tuples are partitioned accordingly. This section describes three popular attribute selection measures—informationgain, gain ratio, and gini index.

The notation used herein is as follows. Let D, the data partition, be a training set ofclass-labeled tuples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = , . . . , m). Let Ci,D be the set of tuples of class Ci in D. Let |D|and |Ci,D| denote the number of tuples in D and Ci,D, respectively.

REFERENCES

1. Data Mining:Concepts and Techniques(Second Edition) by Jiawei Han and Micheline Kamber.

2. Attribute Oriented Induction with simple select SQL statement by Spits Warnars Department of Computing and Mathematics, Manchester Metropolitan University,John Dalton Building, Chester Street, Manchester M15GD, United Kingdom.

Open Document