Gradient Boosting

Introducing Gradient Boosting

  • Gradient boosting is a machine learning technique for regression and classification problems
  • Gradient boosting typically produces a prediction model in the form of an ensemble of weak prediction models
  • Typically, gradient boosting produces decision trees

Describing Gradient Boosting for Regression

  1. Choose a loss function LL

    • Loss values represent how off our model is when making predictions for an observation
  2. Compute an initial prediction yy^{*} that is the average of the response
  3. Compute the residuals rir_{i} between each observation yiy_{i} and yy^{*}
  4. Fit a regression tree to the rir_{i} values

    • Usually, these trees are shallow, but larger than a stump
  5. Send each observation through the new tree

    • Then, each observation is associated with a jthj^{th} leaf
  6. Compute γj\gamma_{j} that is an average for each jthj^{th} leaf

    • Each γj\gamma_{j} is the average of the response values of all the observations associated with the jthj^{th} leaf
  7. Create a new prediction yy^{*} that is:

    • Here, ν\nu is the learning rate used for regularization
    • Here, γj\gamma_{j} is the average associated with the jthj^{th} leaf for the ithi^{th} observation
    yinew=yicurrent+(ν×γj)\overbrace{y_{i}^{*}}^{\text{new}} = \overbrace{y_{i}^{*}}^{\text{current}} + (\nu \times \gamma_{j})
  8. Repeat steps 373-7, until we build MM different shallow trees

    • In practice, typically M=100M=100

Iteratively Building New Trees

  • Each tree that is built in step 44 is a shallow tree that minimizes the cost function
  • Splits are determined using a greedy split-finding algorithm
  • Specifically, it iterates over all the possible splits on all the features
  • Then, determine the best split with the highest information gain
  • The depth of the tree is determined using a hyperparameter
  • The maximum number of leaved is determined using a hyperparameter too

References