Gradient Boosting

Introducing Gradient Boosting

  • Gradient boosting is a machine learning technique for regression and classification problems
  • Gradient boosting typically produces a prediction model in the form of an ensemble of weak prediction models
  • Typically, gradient boosting produces decision trees

Describing Gradient Boosting for Classification

  1. Choose a loss function LL

    • Loss values represent how off our model is when making predictions for an observation
  2. Compute an initial prediction yy^{*} that is the probability of the response:
log(odds)=num of Truenum of False\log(\text{odds}) = \frac{\text{num of True}}{\text{num of False}} y=P(y=1)=elog(odds)1+elog(odds)y^{*} = P(y=1) = \frac{e^{\log(\text{odds})}}{1+e^{\log(\text{odds})}}
  1. Compute the residuals rir_{i} between each observation yiy_{i} and yy^{*}
  2. Fit a regression tree to the rir_{i} values

    • Usually, these trees are shallow, but larger than a stump
  3. Send each observation through the new tree

    • Then, each observation is associated with a jthj^{th} leaf
  4. Compute γj\gamma_{j} that is the following transformation for each jthj^{th} leaf

    • Here, j=1Jri\sum_{j=1}^{J}r_{i} is the sum of all the residuals in leaf jj
    • Here, j=1Jyi\sum_{j=1}^{J}y_{i}^{*} is the sum of all the previous predictions in leaf jj
    γj=j=1Jrij=1J(yi×(1yi))\gamma_{j} = \frac{\sum_{j=1}^{J}r_{i}}{\sum_{j=1}^{J}(y_{i}^{*} \times (1-y_{i}^{*}))}
  5. Create a new prediction yy^{*} that is:

    • Here, ν\nu is the learning rate used for regularization
    • Here, γj\gamma_{j} is the tree output associated with the jthj^{th} leaf for the ithi^{th} observation
    yinew=yicurrent+(ν×γj)\overbrace{y_{i}^{*}}^{\text{new}} = \overbrace{y_{i}^{*}}^{\text{current}} + (\nu \times \gamma_{j})
  6. Repeat steps 373-7, until we build MM different shallow trees

    • In practice, typically M=100M=100

Iteratively Building New Trees

  • Each tree that is built in step 44 is a shallow tree that minimizes the cost function
  • Splits are determined using a greedy split-finding algorithm
  • Specifically, it iterates over all the possible splits on all the features
  • Then, determine the best split with the highest information gain
  • The depth of the tree is determined using a hyperparameter
  • The maximum number of leaved is determined using a hyperparameter too

References

Previous
Next

Random Forests

XGBoost for Classification