The remaining learning works without batch normalization


This article mainly introduces the problems and solutions of gradient disappearance and gradient explosion in deep learning. This article is divided into three parts. The first part mainly explains why gradient update is used in deep learning. The second part mainly presents the reasons for the disappearance and explosion of gradients in deep learning. The third part proposes solutions for the disappearance and explosion of gradients. Basic shoes can read while jumping.

Among them, the gradient explosion solution mainly includes the following parts.

-Pre training plus fine tuning

-Gradient shear, weight regulation (for gradient explosions)

-Use various activation functions

-Use batch norm

- Use residual structure

- Use the LSTM network

Before introducing gradient disappearance and explosion, I would like to briefly discuss the root cause of gradient disappearance in deep neural networks and the spread of the back. With current deep learning methods, the development of deep neural networks has allowed us to build deeper networks to do more complex tasks like deep convolutional networks, LSTMs, etc. The final results show that when processing complex tasks the depth of networks work better than flat networks.However, the current methods for optimizing neural networks are based on the idea of ​​back propagation, ie the error calculated by the loss function is passed through the gradient back propagation in order to control the updating and optimization of deep network weights. There are reasons for thatFirst, the deep network consists of many non-linear layers. Each nonlinear layer can be viewed as a nonlinear function (nonlinearity comes from a nonlinear activation function) so the entire deep network can be viewed as a composite nonlinear multivariate function.

Our ultimate goal is to hope that this multivariate function can do a good job of completing the mapping between input and output. Assuming different inputs, the optimal solution for the output is to find the appropriate weights to meet the detection. Minimum points, like the simplest loss function,

Assuming that the data space of the loss function is as shown in the figure below, our optimal weight is to find the minimum point in the figure below. The gradient descent method cannot be more suitable for this type of mathematical determination of the minimum value.

Part 2: Gradient disappears and explodes

The gradient disappearance and the gradient explosion are actually a kind of situation. See the next article for more information. Gradient disappearance often occurs in two cases, one is in deep networks and the other uses inappropriate loss functions like sigmoid. A gradient explosion generally occurs when the deep network and weight initialization value are too large. The reasons for the gradient's disappearance and explosion are analyzed from these two perspectives.

1. Deep network perspective

The simpler deep network is as follows:

The figure is a fully connected network with four layers. Assume that the output of each layer of the network is fi (x), where i is the i-th layer and x represents the input of the i-th layer, which is the output of the i-1 layer, F is the Activation function, then received

Simply as

The BP algorithm is based on the gradient descent strategy and adapts the parameters to the negative gradient direction of the target and the parameters are updated as

Given the learning rate α, we get

If you want to update the weight information of the second hidden layer, update the gradient information according to the chain derivation rule:

Easy to see

That is, entering the second hidden level. So,

It is the differentiation of the activation function. If this fraction is greater than 1, the resulting gradient update increases exponentially as the number of layers increases, ie a gradient explosion occurs. If this part is less than 1, then with the number of layers. If the number increases, the gradient update information obtained falls exponentially, ie the gradient disappears. If not mathematically intuitive, the following figures can be very intuitive

Gradient problem 1 for deep networks (image content from references):

Note: The hidden level number in the following figure is exactly the opposite of the hidden level number in the first fully connected diagram.

The curve in the figure shows the weight update speed. For the two hidden levels in the figure below, we can see that the speed of the weight update of hidden level 2 is slower than that of hidden level 1.

Then it is more obvious for a network with four hidden layers. The fourth hidden layer is two orders of magnitude slower than the first hidden layer.

Summary: From the perspective of the deep network, the learning speed of different levels is very different. It is shown that the levels near the output learn well in the network, and the levels near the input learn slowly, sometimes even train for a long time. The weight of the layer is similar to the value originally randomly initialized. Hence, the main reason for gradients disappearing and exploding is that the rules for backpropagation training are inherently inadequate. The reason Hinton suggested capsules is to give up backpropagation entirely. If it can become widespread, it will be a revolution.

2. Activation function angle

In fact, I also noticed that when calculating the weight update information above, it is necessary to calculate the previous shift's partial derivative information. Therefore, if the activation function is not properly selected, e.g. B. when using Sigmoid, the gradient disappears. The reason is shown in the figure below. The figure shows the loss function of sigmoid. The right side is the image of its reciprocal. When sigmoid is used as a loss function, its gradient cannot exceed 0.25. In this way, the gradient disappears slightly after the chain derivation. The mathematical expression of the sigmoid function For:

Similarly, Tanh is as a loss function, its derivative diagram is as follows, it can be seen that Tanh is better than Sigmoid, but its inverse is still less than 1. The tanh math expression is:

Part III: Solutions to Gradient Disappearance and Explosion

3.1 Option 1 - pre-training plus fine-tuning

This method comes from an article published by Hinton in 2006. To solve the gradient problem, Hinton proposed an unsupervised layered training method. The basic idea is to train a hidden knot in layers. The output is used as input and the output from the hidden node in this layer is used as input from the hidden node in the next layer. This process is carried out layer by layer "before training". After completing the preliminary training, the entire network is "fine-tuned" (fine-tuning). Hinton used this method when training Deep Belief Networks. After the pre-training of each shift is completed, the entire network is trained using the BP algorithm. The idea is to find the local optimum first and then integrate to find the global optimum. Optimally, this method has certain advantages, but not many are in use at the moment.

3.2 Option 2 gradient shear

The gradient schema is proposed primarily for gradient explosions. The idea is to set a gradient shear threshold and then update the gradient. If the gradient exceeds this threshold, it is forcibly limited to this range. This prevents gradient explosions.

Note: There is also a gradient shear clipping operation in WGAN, but it is not the same. WGAN limits the gradient update information to ensure Lipchitz conditions.

Another way to solve the gradient explosion is to use weight regulation (Wethts regularization). It is more common to use regularization and regularization. There are corresponding APIs in various depth frameworks to use regularization. For example, in the middle of after setting the regularization parameters, you can calculate the regular loss directly by calling the following code:

regularization_loss = tf.add_n (tf.losses.get_regularization_losses (scope = 'my_resnet_50'))

If the initialization parameters are not set, the following code can also be used to calculate the regular loss:

l2_loss = tf.add_n ([tf.nn.l2_loss (var) for var in tf.trainable_variables () if 'weights' in])

The regularization is done by restricting the overfitting of the network weights. Take a closer look at the form of the regularization function in the loss function:

Among these, it refers to the regular term coefficient. Therefore, when a gradient explosion occurs, the norm of the weight becomes very large. The regularization term can partially limit the occurrence of the gradient explosion.

Note: In deep neural networks, the gradient actually disappears more frequently.

3.3 Option 3-relu, leakrelu, elu and other activation functions

Relu: The idea is also very simple. When the derivative of the activation function is 1, there is no problem that the explosion of the gradient disappears. The network on each layer can get the same update speed and Relu is born. First, take a look at relu's math expression:

Its functional picture:

From the above figure it is easy to see that the derivative of the Relu function in the positive part is always equal to 1, so using the Relu activation function in the deep network does not cause the problem of gradient disappearance and explosion.

The main contributions of relu are:

- The problem of gradients disappearing and exploding has been resolved

-Easy calculation and fast calculation

- Accelerate network training

There are also some disadvantages:

- Since the negative part is always 0, some neurons are not activated (can be partially solved by setting a small learning rate).

-Output is not centered on 0

Although relu has its drawbacks, it is still the most widely used activation feature.


Leakrelu is supposed to solve the effect of the 0 interval of relu. Its mathematical expression is:

Generally, when k is the leakage coefficient, 0.01 or 0.02 is selected, or it is obtained through learning

leakrelu solves the effect of the 0 interval and contains all the advantages of relu


The elu activation function is also to solve the effect of the 0 interval of relu, and its mathematical expression is:

Its function and its derived mathematical form are:

However, the calculation of elu is more time-consuming than that of leakrelu

3.4 Solution 4-batch standard

Batchnorm is one of the most important results proposed since the development of deep learning. It is widely used in large networks. It accelerates network convergence and improves training stability. Batchnorm essentially solves the process of back propagation. Gradient problem in. The full name of Batchnorm is batch normalization, or BN for short, which means batch normalization. The output signal x is normalized to a mean value of 0 and a variance of 1 in order to ensure the stability of the network.

The specific batch norm principle is very complicated, so I will not expand it in detail here. This part will likely talk about Batchnorm's solution to the gradient problem. Particularly in the case of back propagation, the gradient that runs through each layer is multiplied by the weight of the layer. A simple example:

F2 = f1 (wT ∗ x + b) during forward propagation, then during backward propagation

The presence of w in the back propagation formula, i.e. the size of w, affects the disappearance and explosion of the gradient. Batchnorm eliminates the enlargement and shrinkage caused by w by using the output specification of each layer to match the mean and variance effect, and then solve the gradient disappearance and explosion problem.

3.5 Solution 5 - residual structure

Speaking of residuals, I have to mention this paper:Deep Residual Learning for Image RecognitionFor the interpretation of this paper, you can refer to the Zhihu link: we only briefly introduce how the residuals can solve the gradient problem.

In fact, the emergence of the residual network led to the end of the image network competition. Since the residual was suggested, almost all deep networks cannot be separated from the residual. There are dozens of layers of deep networks compared to the previous layers, there is nothing worth mentioning about standing in front of the rest of the network. The residual can easily build up hundreds of layers and a thousand-layer network without worrying about the gradient disappearing too quickly. The reason lies in the join part of the residual.Where the remaining unit is shown below:

Compared to the via structure of the previous network, there are many such cross-layer connection structures in the residuals. Such structures have great advantages in back propagation, as shown in the following formula:

The loss function represented by the first factor of the formula reaches the gradient of L. The 1 in brackets indicates that the short-circuit mechanism can propagate the gradient without loss and that the other residual gradient must run through the layer with weights. The gradient isn't going straight through over here. The residual gradient is not randomly all -1, and even if it is small, the presence of 1 does not cause the gradient to vanish. So the remaining learning becomes easier.

Note: The above derivation is not a strict proof.

3.6 Solution 6-LSTM

The full name of LSTM is long-term short-term memory networks, and it is not easy for gradients to go away. The main reason is the complex "gates" within the LSTM. As shown in the following figure, the LSTM passes through its internal "gates". The "remaining memory" from previous training courses can be "saved" during the next update, so it is often used to generate text. There are currently LSTMs based on CNNs. You can try if you are interested.


1. 《Neural networks and deep learning》

2. "Machine Learning" Zhou Zhihua




Reprint at: