Improving Deep Neural Networks

Tuning Tips

  • Importance of hyper-parameters:
Most Important Learning Rate
2nd Momentum, Mini-batch size
3rd Hidden Units, Number of Layers, Learning rate decay
  • Randomly choose hyper-parameter on a log scale. Don’t do grid search!
1
2
r = -4 * np.random.rand() # r [-4, 0]
theta = 10^r

Mini-batch

Small training set (m <= 2000): Use batch gradient descent

Large training set (m > 2000): typical mini-batch size: 2^n, for example, 64, 128, 256, 512

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## Cheatsheet from above link
def (image_path_list, label_list):
imagepaths = tf.convert_to_tensor(image_path_list, dtype=tf.string)
labels = tf.convert_to_tensor(label_list, dtype=tf.int32)
# Build a TF Queue, shuffle data
# slice tensors into single instances and queue them up using threads.
image, label = tf.train.slice_input_producer([imagepaths, labels],
shuffle=True)
# Read images from disk
image = tf.read_file(image)
image = tf.image.decode_jpeg(image, channels=CHANNELS)
# Resize images to a common size
image = tf.image.resize_images(image, [IMG_HEIGHT, IMG_WIDTH])
# Normalize
image = image * 1.0/127.5 - 1.0
# Create batches
X, Y = tf.train.batch([image, label], batch_size=batch_size,
capacity=batch_size * 8,
num_threads=4)
return X, Y
1
2
3
4
5
6
7
8
9
10
11
with tf.Session() as sess:
# Run the initializer
sess.run(init)
# Start the data queue
tf.train.start_queue_runners()
# Training cycle
for step in range(1, num_steps+1):
sess.run(train_op)
  • Keras Example
1
2
3
model.fit(x_train, y_train,
epochs=2000,
batch_size=128)

Adam Optimizer

Adam optimizer combines the advantages of both Momentum and RMSprop thus enabling quick convergence.

Parameters: Learning rate: needs to be tuned

Beta1: 0.9 (dw)

Beta2: 0.999 (dw^2)

Epsilon: 10^-8

  • Tensorflow Example
1
2
3
4
5
6
7
8
9
10
11
12
13
“”“
default setting:
__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
”“”
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
  • Keras Example
1
keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

Learning Rate Decay

learning rate changes with current epoch number. Some common algorithms are as follows:

1
2
3
lr = 0.95^epoch_num*alpha
lr = k / np.sqrt(epoch_num)*alpha
lr = alpha/(1+kt)
1
2
3
4
5
6
7
8
9
10
11
12
# Optimizer: set up a variable that's incremented once per batch and controls the learning rate decay.
batch = tf.Variable(0, dtype=data_type())
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
  • Keras Example
1
keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.95)

Initialization

The reason we want to refine initialization is to avoid vanishing and exploding gradient in deep networks. For example, let’s say g(z) = z, b = 0; then in a 15-layers NN we will have y = (w^15) * x; thus we will have either very large or very small gradient. One way to avoid is to take into account number of features that will feed into the current layer. Some recommended initializations include:

np.sqrt(2/ (num_features_last_layer))

np.sqrt( 2/ (number_features_last_layer + number_features_this_layer))

  • Tensorflow Example
1
W = tf.get_variable('W', shape=(512, 256), initializer=tf.contrib.layers.xavier_initializer())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tf.contrib.layers.fully_connected(
inputs,
num_outputs,
activation_fn=tf.nn.relu,
normalizer_fn=None,
normalizer_params=None,
weights_initializer=initializers.xavier_initializer(),
weights_regularizer=None,
biases_initializer=tf.zeros_initializer(),
biases_regularizer=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
scope=None
)
  • Keras Example
1
2
3
4
5
6
7
8
9
10
11
12
model.add(Dense(64,
kernel_initializer='random_uniform',
bias_initializer='zeros'))
"""
Other choices include:
keras.initializers.Zeros()
keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None)
keras.initializers.TruncatedNormal(mean=0.0, stddev=0.05, seed=None)
lecun_uniform(seed=None)
glorot_normal(seed=None) # Xavier normal initializer.
...
"""

Tensorflow Tutorials