Tuning Tips
- Importance of hyper-parameters:
Most Important | Learning Rate |
---|---|
2nd | Momentum, Mini-batch size |
3rd | Hidden Units, Number of Layers, Learning rate decay |
- Randomly choose hyper-parameter on a log scale. Don’t do grid search!
|
|
Mini-batch
Small training set (m <= 2000): Use batch gradient descent
Large training set (m > 2000): typical mini-batch size:
2^n
, for example, 64, 128, 256, 512
|
|
|
|
- Keras Example
|
|
Adam Optimizer
Adam optimizer combines the advantages of both Momentum and RMSprop thus enabling quick convergence.
Parameters: Learning rate: needs to be tuned
Beta1: 0.9 (dw)
Beta2: 0.999 (dw^2)
Epsilon: 10^-8
- Tensorflow Example
|
|
- Keras Example
|
|
Learning Rate Decay
learning rate changes with current epoch number. Some common algorithms are as follows:
|
|
|
|
- Keras Example
|
|
Initialization
The reason we want to refine initialization is to avoid vanishing and exploding gradient in deep networks. For example, let’s say g(z) = z, b = 0; then in a 15-layers NN we will have y = (w^15) * x
; thus we will have either very large or very small gradient. One way to avoid is to take into account number of features that will feed into the current layer. Some recommended initializations include:
np.sqrt(2/ (num_features_last_layer))
np.sqrt( 2/ (number_features_last_layer + number_features_this_layer))
- Tensorflow Example
|
|
|
|
- Keras Example
|
|