This post aims at collecting reasons behind the common practices of training neural networks, as well as maintaining a list of reference on how to train them.
In backpropagation, we just want to comput the gradient of example-wise loss with respect to paramters. The main piece is to apply the derivative chain rule wisely. A common property in bp is if computing the loss(example, paramters) is computation, then so is computing the gradient.
Figure 1. back propagation
Both operations are visualized in Figure 1.
We already knew that word vectors can be share cross tasks. The main idea of multi-task learning is instead of only word vectors, the hidden layer weights can be shared too. Only the final softmax weights are different. The cost function of multi-task learning is the sum of all the cross entropy errors.
Figure 2. Different types of nolinearties.(Left): hard tanh; (center): soft sign, tanh and softmax; (right):relu
Gradient check allows you to know that whether there are bugs in your neural network implementation or not. just follow the steps bellow:
Implement a finite difference computation by looping through the parameters of your network, adding and subtracting a small epsilon() and estimate derivatives
Before initialize paramter, make sure the input data are normalized to zero mean with small deviation. Some common practices on weight initializing:
Gradient descent uses total gradient over all examples per update, it is very slow and should never be used. Most commly used now is Mini-batch Stochastic Gradient Descent with a mini batch size between , the update rule is
where is the mini batch size.
The idea of momentum is adding a fraction of previous update to current one.
where is initialized to .
When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum.
Figure 3. Single SGD update with Momentum
Figure 4. Simple convex function optimization dynamics: without momentum(left), with momentum(right)
However, when using a lot of momentum, the global learning rate should be reduced.
The simpleest recipe is to keep it fixed and use the same for all parameters. Better results show by allowing learning rates to decrease Options:
AdaGrad adapts differently for each parameter and rare parameters get larger uodates than frequently occuring parameters. Let