Introducing XGBoost for Regression
-
The model used in XGboost is a tree ensemble
- Consisting of classification and regression trees
- Gradient boosting is used for building out the tree ensemble
-
Boosting algorithms are roughly functional gradient descent algorithms
- Meaning, they optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction
- XGBoost trees differ from normal decision trees, where each leaf is a score instead of actual decision values
- Random forests and gradient boosted trees are the same model (tree ensembles), but differ in how we train them
- In XGBoost, these trees are learned using a defined objective function and optimizing it
-
The objective function contains two components in XGBoost:
- A loss function
- A regularization function
- Prunes the tree (decides whether to add branches or not) by calculating the information gain for each branch and only adding branches with larger information gains
Describing XGBoost for Regression
-
Choose a loss function
- A loss value represents how good our model is at predicting an observation
- Compute an initial prediction that equals
- Compute the residuals between each observation and
-
Fit a regression tree to the values
- Where, is the number of observations
- Where, is
- Where, is
- Perform pruning on the tree by removing the lowest leaves
- Assign output values to each leaf:
-
Create a new prediction that is:
- Here, is the learning rate used for regularization
- Here, is the average associated with the leaf for the observation
-
Repeat steps , until we build different shallow trees
- In practice, typically
Fitting an XGBoost Regression Tree
- In XGBoost, a tree is fit (step ) using its own fitting method
-
Specifically, the XGBoost fitting method follows these steps:
- Select a single independent variable
- Select two observations with neighboring values of the selected independent variable
- Split on the average of these two observations
- Compute the following similarity score:
- Compute the information gain of the split:
- Shift to the next two observations with neighboring values of the selected independent variable
- Continue to shift and compute by repeating steps
- Continue to iterate through other independent variables by repeating steps
- Here, is a regularization hyperparameter
-
Again, the similarity score is computed for each leaf
- Leaves with similar residuals produce a larger similarity score
- Leaves with dissimilar residuals produce a smaller similarity score
-
The information gain determines whether a split is good or not
- Splits are considered good if they produce a large information gain
- Split are considered poor if they produce a small information gain
- Thus, the best split with the highest information gain
- The depth of the tree is determined using a hyperparameter
- The maximum number of leaved is determined using a hyperparameter
Pruning an XGBoost Regression Tree
- A tunable hyperparameter is used for pruning trees
- Calculate the following formula for each leaf:
- If the difference between the gain and is negative, then we remove the branch
Improving the Accuracy of Classification Trees
-
If coverage of labels are imbalanced, then try different balancing methods such as the following:
- Oversampling
- Undersampling
- SMOTE
-
Compare model with most important features with model including all features:
- Typically, most important features include 20ish features
- Full model may include 100+ features
- However, XGBoost runs feature selection behind-the-scenes, so this most likely won't improve the accuracy
-
Try running dimensionality reduction before classification:
- Most likely, this won't improve the accuracy of the model
- Since, XGBoost does feature selection behind-the-scenes
-
Try removing outliers, imputing outliers, and imputing missing values
- XGBoost is sensitive to outliers, so this may improve the accuracy
-
Running hyper-parameter optimization methods, like:
- Grid search
- Random search
- Bayesian prior weighting
-
Apply a mix of feature engineering methods including:
-
Creating meta-features, such as:
- Squaring or taking the log of other features
- Computing the standard deviation of other features
- Calculating the difference of multiple features
-
Creating feature of clustering labels
- This feature could be the labels from running a k-means clustering of other features
- Then, we'd hope these labels could serve as a predictor
- Then, help in predicting the response in our classifier
-
- Stacking and blending multiple classification models together
Motivating Improvements for Classification Models
- Data-specific improvements (i.e. feature engineering and including more data) are usually better than model-specific improvements
- Model-specific improvements (i.e. ensembeling and trying different models) are better than hyper-parameter tuning
References
- Video about Mathematical Details of XGBoost
- Video about XGBoost for Classification
- Ensembling Methods and Stacking with XGBoost
- Post about Creating Meta-Features for XGBoost
- Illustrating Ensemble Stacking with XGBoost
- Illustrating Bayesian Prior Weighting on Hyper-Parameters
- Illustrating Clusters as Meta-Features for XGBoost