Data Science

Decision trees are easy to build and easy to interpret
However, decision trees are typically less accurate than other learning methods in practice
In other words, they work great with the data used to create them (i.e. training data), but are not flexible when it comes to classifying new samples (i.e. testing data)
The good news is that random forests combine the simplicity of decision trees with flexibility
As a result, random forests receive a great improvement in accuracy compared to ordinary decision trees

Create a bootstrapped dataset from out training data
- Sample from the training data with replacement in order to create a bootstrapped dataset
Create a decision tree using the bootstrapped dataset
- Only use a random subset of variables
- This is to prevent any bias or overfitting in the eventual random forest model
Repeat steps 1 and 2 a bunch of time so we have many different types of trees with different columns trained on different data
Predict any new observations (i.e. testing data) by running its attributes through each and every decision tree
- This step can refer to the model evalutation step or the prediction step
- We typically want to perform model evaulation directly after collecting all of our decision trees
- This process involves fitting our collection of decision trees on testing data to verify a high accuracy
- We typically will predict after our model evaluation step
Sum up all of those predictions and choose the option (or class) that received the most votes
- We refer to this process as bagging
- Specifically, bagging refers aggregating predictions on bootstrapped data

Decision Trees

Gradient Boosting