Motivating the Tuning Process
- One of the inconveniences of neural networks is tuning hyperparameters
 - 
Specifically, we're interested in the following questions:
- What are the most important hyperparameters to tune?
 - What range of values should I tune my hyperparameters over?
 - How should I organize my tuning process?
 
 
Prioritizing Hyperparameters
- It is important to tune each hyperparameter
 - However, some hyperparameters are more important to tune than other
 - 
The following are some common hyperparameters:
- : the learning rate used in our optimization algorithm
 - : used in momentum and adam
 - : used in rmsprop and adam
 - : used in rmsprop and adam
 - : the number of hidden layers in our network
 - : the number of hidden units per layer
 - : the learning rate decay
 - : the mini-batch size
 
 
| Hyperparameter | Priority | 
|---|---|
| 1 | |
| 2 | |
| 2 | |
| for momentum | 2 | 
| 3 | |
| 3 | |
| for adam | 4 | 
| 4 | |
| 4 | 
Tip 1: Tune using Random Sampling
- In the past, hyperparameters used to be tested in a systematic manner that involved looping over a fixed set of hyperparameters
 - This technique is called grid search
 - Now, hyperparameters are typically tested using random sampling from a hyperparameter space
 - This is because grid search only works fine with a small set of hyperparameters
 - Visually, the difference in hyperparameters looks like the following:
 
 
- Here, we notice that only possible values are tested for each hyperparameter in scenario 1
 - Specifically, training models only gives us information about possible values for each hyperparameter
 - On the other hand, we test possible values for each hyperparameter in scenario 2
 - Specifically, training models only gives us information about possible values for each hyperparameter
 
Tip 2: Zooming into Smaller Regions
- Once our models are finished training, we'll want to evaluate the hyperparameters to see which ones minimized the cost function
 - After doing this, we'll typically notice certain regions of hyperparameters working very well
 - Therefore, we'll usually focus on that region and sample more densely a second time (or more)
 - In other words, we'll keep focusing on a smaller range of hyperparameters after training
 
- In the above image, the the data points represents combinations of hyperparameters
 - A darker data point represents a set of hyperparameters that correspond to a very small cost
 - A lighter data point represents a set of hyperparameters that correspond to a very large cost
 - We decided that our second window should be be focused around the darker data points, since those sets of parameters tend towards a small cost
 - During the second training run, we would focus on this window and observe more sets of parameters in this specific regions
 
Scaling Hyperparameters Appropriately
- Up until now, we've been randomly sampling hyperparameters uniformly on a linear scale
 - However, we may want to randomly sample hyperparameters uniformly on a different scale
 - In other words, we'll want to first transform a hyperparameter to change its range of values before sampling
 - Specifically, we'll want to randomly sample hyperparameters uniformly on a logarithmic scale
 - When tuning a hyperparameter, we'll typically want to do this if we care about testing many hyperparameters in only a certain small region
 - For example, we may decide to randomly sample uniformly on a linear scale
 - In this scenario, we would be only using % of resources to search for the values of a hyperparameter between and
 
- However, we may want to use more resources to search for the values between and
 - Therefore, we may want to randomly sample uniformly on a logarithmic scale
 
Re-evaluating Hyperparameters Occasionally
- Our data is always changing
 - That means insights we take from some data can change over time as well
 - Therefore, we should always be re-evaluating our hyperparameters to update our model
 - 
There are two different methods used for re-evaluating hyperparameters:
- 
Babysitting one model
- This method refers to tuning hyperparameters of one model and observing how the accuracy of our test set changes over time
 - If we notice a high enough accuracy for a set of hyperparameters on a certain day, then we should use those
 - Usually we do this if we don't have enough computational resources
 
 - 
Training many models in parallel
- This method refers to tuning hyperparameters of many models in parallel and observing how the accuracy of a test set changes over time for each model
 - If we notice a high enough accuracy for a set of hyperparameters of one model on a certain day, then we should select that set of hyperparameters
 - Usually we do this if we have an abundance of computational resources
 
 
 - 
 - Typically, we prefer the second approach if we have enough computational resources
 - If we're not fortunate enough to have an abundance of CPUs and GPUs, then we should use the babysitting method
 
tldr
- When tuning hyperparameters, we should test many possible values using random sampling from a hyperparameter space
 - We should not test many possible values using grid search
 - After training, we should focus on a smaller range of hyperparameters that minimize the cost function
 - By default, we randomly sample hyperparameters uniformly on a linear scale
 - We may want to randomly sample hyperparameters uniformly on a logarithmic scale
 - Typically, we prefer to tune hyperparameters of many models in parallel if we have enough computational resource
 - If we don't have enough resources, then we should monitor the accuracy of one model over time
 
References
Previous
Next