Introducing XGBoost for Regression
-
The model used in XGboost is a tree ensemble
- Consisting of classification and regression trees
- Gradient boosting is used for building out the tree ensemble
-
Boosting algorithms are roughly functional gradient descent algorithms
- Meaning, they optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction
- XGBoost trees differ from normal decision trees, where each leaf is a score instead of actual decision values
- Random forests and gradient boosted trees are the same model (tree ensembles), but differ in how we train them
- In XGBoost, these trees are learned using a defined objective function and optimizing it
-
The objective function contains two components in XGBoost:
- A loss function
- A regularization function
- Prunes the tree (decides whether to add branches or not) by calculating the information gain for each branch and only adding branches with larger information gains
Describing XGBoost for Regression
-
Choose a loss function
- A loss value represents how good our model is at predicting an observation
- Compute an initial prediction that equals
- Compute the residuals between each observation and
-
Fit a regression tree to the values
- Where, is the number of observations
- Where, is
- Where, is
- Perform pruning on the tree by removing the lowest leaves
- Assign output values to each leaf:
-
Create a new prediction that is:
- Here, is the learning rate used for regularization
- Here, is the average associated with the leaf for the observation
-
Repeat steps , until we build different shallow trees
- In practice, typically
Fitting an XGBoost Regression Tree
- In XGBoost, a tree is fit (step ) using its own fitting method
-
Specifically, the XGBoost fitting method follows these steps:
- Select a single independent variable
- Select two observations with neighboring values of the selected independent variable
- Split on the average of these two observations
- Compute the following similarity score:
- Compute the information gain of the split:
- Shift to the next two observations with neighboring values of the selected independent variable
- Continue to shift and compute by repeating steps
- Continue to iterate through other independent variables by repeating steps
- Here, is a regularization hyperparameter
-
Again, the similarity score is computed for each leaf
- Leaves with similar residuals produce a larger similarity score
- Leaves with dissimilar residuals produce a smaller similarity score
-
The information gain determines whether a split is good or not
- Splits are considered good if they produce a large information gain
- Split are considered poor if they produce a small information gain
- Thus, the best split with the highest information gain
- The depth of the tree is determined using a hyperparameter
- The maximum number of leaved is determined using a hyperparameter
Pruning an XGBoost Regression Tree
- A tunable hyperparameter is used for pruning trees
- Calculate the following formula for each leaf:
- If the difference between the gain and is negative, then we remove the branch
References
Previous
Next