Data Science

Information gain is the amount of information gained about a random variable from observing another random variable
Information gain is a metric that is used to evaluate the splitting criteria for a decison tree
In other words, we evaluate the information gain of an attribute (or split) to tell us how important the attribute is
The following are some example attributes we may evaluate using information gain:
- Should we split on balance > 50k
- Should we split on applicant = employed
- Should we split on weather = sunny
Information gain uses the entropy metric in its calculation
Our goal is to find a split that maximizes the information gain, which will happen if we minimize the entropy for the groups created by the split
Specifically, information gain can be defined as the following equation:

Gain = entropy_{p} - mean(entropy_{c})

Entropy measures the level of impurity in a group created by a proposed split
We should only think about splitting if our group is very impure (i.e. Entropy = 1)
Our goal is to find a split that minimizes the entropy for each group created by the split
Minimizing the entropy is the same as minimizing the impurity
In other words, the best split will be the one that makes sure each group contains data with the same value (i.e. the least impure)
Entropy is defined by the following equation:

Entropy = - \sum_{i=1}^{c} p_{i}log(p_{i})

Gini = 1 - \sum_{i=1}^{c}p_{i}^{2}

Let's say we are trying to evaluate the purity of a group, where we have 16 males and 14 females in our sample
Then, we could define our entropy as the following:

Entropy = (-\frac{16}{30})(-0.9) - (\frac{14}{30})(-1.1) = 0.99

In this case, we should think about splitting, since the group is extremely pure

Let's say we are trying to evalute the purity of a group, where we have only 16 males in our sample
Then, we could define our entropy as the following:

Entropy = (-\frac{16}{16})(0) = 0

In this case, we shouldn't think about splitting, since the group is extremely pure

Let's say we are trying to evaluate a split of a group, where we initially have 16 males and 14 females in our sample
Calculate the parent entropy
$-(\frac{14}{30})(log(\frac{14}{30})) - (\frac{16}{30})(log(\frac{16}{30})) = 0.996$
- In this case, the impurty is large, so we should split
Calculate one child's entropy
$-(\frac{13}{17})(log(\frac{13}{17})) - (\frac{4}{17})(log(\frac{4}{17})) = 0.787$
- Here, there are $17$ data points in this group after the split
- Also, $13$ of those data points are female, and $4$ of those data points are male
- In this case, the impurity is fairly high for this group after the split
Calculate the other child's entropy
$-(\frac{1}{13})(log(\frac{1}{13})) - (\frac{12}{13})(log(\frac{12}{13})) = 0.391$
- Here, there are $13$ data points in this group after the split
- Also, $1$ of those data points are female, and $12$ of those data points are male
- In this case, the impurty is fairly small for this group after the split
Calculate the weighted average entropy of the children
$(\frac{17}{30})0.787 + (\frac{13}{30})0.391 = 0.615$
Calculate the information gain
$0.996 - 0.615 = 0.38$
- Therefore, this split gives us $0.38$ amount of additional information
- We should evalute other splits, and choose this one if there aren't any other splits with an information gain greater than $0.38$

Ensemble Methods