Paper discussion blog for COMP 150DL at Tufts University
Tuesday, February 7, 2017
Tues, Feb 14: Batch Normalization
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy, arXiv, 2015. Please comment with your summaries and questions.
Summary by Chris Mattioli: This paper presents a technique of training neural networks known as Batch Normalization. In a nutshell, at every layer in the network, each feature is normalized according to that feature's mean and variance within the batch and a linear transform whose slope and intercept parameters are learned throughout training. This normalization of every layer supposedly, and is shown to, reduce "Internal Covariate Shift" (ICS). The ICS refers to the change in distribution of values at each layer. The ICS can be rather severe in cases where models are infused with many non-linearities. The problem with ICS is that it requires more careful parameter settings and initializations, i.e., it makes training slower, more difficult, or even impossible given certain conditions. The idea of Batch Normalization is to fix the distribution of values coming out of each layer and there by eliminating/reducing the ICS. When it comes to finally applying the model to unclassified data, an average is taken from all the batch normalizing parameters. The affect that Batch Normalization has on a parameter such as learning rate is shown to nullify it's scaling effect on layer parameters. This equates being able to use larger or looser learning rates. Batch Normalization also has a built in regularization effect. The paper also presents an application of the technique to great effect on the ImageNet data. Questions: "In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization." Why? (This isn't as much a BN question as it is a mini-batch question). Also, in section 3.1, why do they choose the estimator for variance that they did instead of taking the variance over the whole dataset? Similarly for the mean as well?
1, in mini-batch SGD, we are supposed to only see a small subset of the entire dataset (that's why they say impossible), it makes sense we estimate the gradient as well as the mean and the variance based on this mini-batch we see for each step.
2, As they said the following:" the statistics used for normalization can fully participate in the gradient backpropagation". When you do the backprop and compute the derivatives, then only x_i that are in the mini-batch matters. Then I would think if you include all the points in training data to estimate the mean and the variance, it would wash out the gradient because of the scaling. (1/N instead of 1/m for the estimators)
The paper presents a novel approach to layer processing called Batch Normalization. Briefly Batch normalization modifies the weights of any given layer to ensure that they follow a normal distribution of a mean of 0 and a variance of 1. By ensuring the layer’s weights fall in this distribution, one lessens the impact of covariate shift. Additionally, this encourages values to remain tightly centered around zero and thus avoids the flat gradients seen in activation functions such as the sigmoid. The greatest benefit of Batch normalization falls into the category of speed of solution convergence. Because batch normalization will reduce the dependence of gradients on the scale of the parameters of initial values, one can greatly increase the learning rate of a system, without risking solution divergence due to initial conditions. Thus, Batch Normalization can allow an algorithm to reach a specified accuracy in a fraction of the typical time. It is important to note that normalizing a layer can change what that layer can represent. The author’s get around this problem by inserting a transformation (often linear) into the network, which can represent the identity transform. This action leads to several new variables that must be trained as the algorithm functions. Hence, it is important to implement both forward and back propagation of error. Using this method, the authors found great success in comparison to existing image classification accuracies. They were able to match accuracies while using a fraction of the steps. They were also able to improve upon these accuracies by training for longer periods of time. Batch normalization appears to be a powerful tool to add to most any machine learning algorithm.
Discussion Questions. Are there certain situations where batch normalization fails? Or makes accuracies worse? Could one see any improvement by scaling each layer to different means and variances? Can batch normalization be theoretically proven to decrease training times? Or are we stuck with an experimental conclusion?
Nathan Watts' summary This paper proposes batch normalization, a method for improving training and reducing overfitting without the need for dropout. The initial idea came from the observation that models train significantly better if the input data is whitened, that is, zero centered, normalized, and decorrelated. This acts as a sort of regularization by preventing the model from fitting to biases in the training data. Input normalization also reduces covariate shift, a term which refers to changes in the input distribution, which can cause neurons to saturate as gradients escape the expected bounds. The paper also introduces an heuristic called “internal covariate shift,” which refers to changes in the distribution of *internal* parameters as the network changes, which may cause later layers to saturate for the reason described above.
The premise the paper is testing is the logical extreme of input normalization-- what if the activations are whitened in between *every* layer? Unlike input normalization, this introduces issues to optimization. If the activations are normalized on the forward pass, and this is not accounted for on the backward pass when computing the gradients, it can cause weights and biases to explode, as in this model, extremely large activations will still get scaled to a reasonable range for the next layer, so a positive gradient will continue to grow the parameters indefinitely. This problem can be avoided by computing the gradient of the whitening process, but this is very computationally expensive when accounting for the entire training set, and is not, in fact, fully differentiable.
To fix this problem the paper proposes simplifying the computation by zero-centering and scaling to standard deviation 1 (no decorrelation), using *only* mini-batch statistics to normalize, and normalizing layer inputs and outputs independently. These changes dramatically reduce the computational complexity of both the normalization and the differentiation. They also add two parameters to each layer which can scale and shift the activations after normalization so that the full range of the activation function’s shape is still available to the model. When sampling, however, the full population statistics are used, for increased accuracy. They also note that batch normalization works similarly for convolutional neural networks.
They end by confirming that batch normalization enables higher learning rates to be used safely and regularizes the model and detailing the experiments that demonstrate their results, and then discuss ways to speed up training and/or improve accuracy.
Forgot to put my question: can some of the regularization effects of batchnorm be attributed to data augmentation effects, as any specific training sample will appear different if it appears within a different batch?
This paper introduced a technique called batch normalization that can significantly accelerate the training speed of the network. It also has other benefits such as less dependent on the initialization, anti-nonlinearity saturation, etc. as well.
The idea comes from the observation that the presence of the "internal covariate shift", which by definition means "the change in the distribution of network activations due to the change in network parameters during training", would complicate the training, and hence the authors would like to find out a reasonable way to remove it form each layer of our network.
Based on the previous study that the "whitening" could speed up the training but too costly to apply fully to every layer, the authors made two simplifications and hence invented the batch normalization. The main idea is that each feature is normalized to have mean 0 and variance 1 within the mini-batch, and the estimators of the mean and the variance are computed by using the examples in mini-batch. A scaling and a shifting parameter are added to be learned from the training, so that the normalization is "invertible" if necessary by learning these parameters.
Experiments were done to show that the method outperform the previous state of the art algorithms both in speed and accuracy.
Question: Is there no cost by introducing a lot more parameters? gammas and betas. So a total of sum_i 2 x D_i (dimension of layer i) more parameters to learn.
This paper extends previous work and observations about neural net covariate shift and presents a novel optimization for stochastic gradient descent. The key piece of insight that the authors manage to leverage is a surprisingly simple one. It is that Covariate Shift is not only a global phenomenon in neural nets but also a local one. The authors noticed that although whitening of training data centered and uncorrelated data that is consumed by a neural net, the mean and covariance of the outputs to, and subsequent inputs to internal layers may not necessarily by uncorrelated or zero centered.
The coin this phenomenon as Internal Covariate Shift and present a method to reduce the affect of Internal Covariate Shift on training time. The mathematically optimal way to eliminate Internal Covariate shift would require the expensive computation of covariances between each layer. This is not feasible and the proposed method approximates whitening of the features by instead dividing zero centered values with the variance calculated for each feature.
The authors provide further optimizations by suggesting that this normalization can be optimized by approximating the mean and variances of the activations of the entire training set between layers with the mean and unbiased variance of small batches.
The authors have also introduced flexibility into this framework by introducing two slack variables for each batch normalization step. A neural network can learn these variables and prevent normalization from obscuring some features in the data. This for me was the most profound observation.
The results of batch normalization are quite astonishing. Normalized networks can train much faster because of higher learning rates. Intuitively, we can think that the probability of parameters blowing up or getting lost to be decreased because we always center the activations.
Discussion:
The authors support intuitions about improved learning rate with math and clear writing. However, the authors fail to support their intuition about batch normalization as a regularization technique and instead leave the reader unsatisfied with a brief, passing passage. If batch normalization in essence centers and "rotates" the hyperplane upon which SGD is performed, how is it then penalizing overfitting? This is bizarre, astonishing and deserves more investigation.
This paper presents Batch Normalization, a method to normalize the inputs to neural network layers. Batch Normalization is a valuable tool because it allows for faster training times, removes the need for dropout, and prevents the gradients of non-linearities, such as the sigmoid activation function, from becoming saturated. Batch normalization accomplishes these improvements by normalizing the inputs to non-linearities between layers of the neural network using the mean and variance of the examples in a given mini-batch. In addition to normalizing these inputs, the batch normalization transform applies a linear function to each feature in each training example, scaling by a parameter gamma and then shifting by a bias parameter beta. Beta and gamma are treated as hyper parameters in the training process, which should be learned through a method such as cross validation. This final linear transformation of the data is included to ensure that the batch normalization transformation can represent the identity transform i.e. the un-normalized inputs to any layer can be recovered by setting gamma and beta to specific values. During the training process one must account for the batch normalization transform during back propagation by propagating the gradient of the loss through the batch normalization “layer”. When utilizing Batch Normalization it is beneficial to increase learning rate, remove dropout, reduce weight regularization, increase the decay rate, and prevent training examples from continually appearing together in mini-batches.
Discussion:
Why does Batch Normalization achieve superior performance with a reduced weight regularization? Are there any reasons to not use Batch Normalization? Is Batch Normalization an active research topic? What aspects of Batch Normalization transform must be modified for application to recurrent neural networks?
One problem with training deep neural networks is known as internal covariate shift. During training, in addition to layers adapting to fit the set of training data, layers must also adapt to “fit” each other as they change throughout the training process. This greatly complicates the way in which a deep neural network behaves during training. The paper introduces a method to reduce internal covariate shift, in which the inputs and outputs of each layer are normalized, scaled, and shifted for each batch according to two parameters, which are themselves learned during training. In order to fully take advantage of this newly minimized internal covariate shift, a number of other changes were made to the training procedure for this network. Among these changes, learning rates were increased, dropout was eliminated, L_2 weight regularization was reduced, the learning rate decay was accelerated, local response normalization was no longer used, and training examples were thoroughly shuffled to reduce duplicate examples within a given batch. A network using this batch normalization strategy was tested using ImageNet sets with 1000 different classes, and was able to improve on the previous best results, surpassing the accuracy of human raters.
Discussion:
The paper mentioned that the network was trained on less distorted images. What kind of effect will this omission have on the networks performance when it comes to distorted images in real-world applications?
What other methods are there to reduce internal covariate shift? This seems like a huge problem. Might there be a way to estimate the portion of loss attributed to internal covariate shift during training?
This paper presents one method called Batch Normalization to address Internal Covariate Shift problem via a normalization step that fixes the means and variance of layer inputs and Internal Covariate Shift problem refers to the change in the distributions of internal nodes of a deep network in the course of training, which slows down the training by requiring lower learning rates and careful parameter initialization. In the normalization step, this paper does the normalization via Mini-Batch Statistics, using two simplifications, first is to normalize each scalar feature independently and making it have zero mean and variance of one, second is to use mini-batches in stochastic gradient training. The author gets the conclusion that adding Batch normalization to a state-of-the-art image classification model yields a substantial speedup in training, and combined with multiple models, it can perform better than the best known system on ImageNet.
Question: In section 3.3, there are two derivative equations, how can we show that they are equal to get the conclusion that scale doesn’t affect the gradient propagation. And how to understand ‘larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.’
This paper presents a new method which can accelerate deep network training by reducing internal covariate shift.
In deep network training, we usually have many layers and we use gradient descent or its variants to update the weights for each layer. But at each step, when the weights are updated, the distribution of the inputs for next layers will also be changed. That means at each step, parameters need to adapt to inputs with different distributions. This makes the training very slow. This phenomena is called internal covariate shift.
Full whitening of each layer's inputs can reduce covariate shift, but it's costly and not everywhere differentiable. The authors make two simplications in the paper: the first is to normalize each feature independently by making it have the mean of zero and variance of 1; the second is to use mini-batch statistics to estimate the mean and variance since it's impractical to use whole dataset when using stochastic optimization. Simply normalizing each input of a layer may change what the layer can represent. To address this problem, the authors introduce a linear transformation with new parameters gamma and beta such that the normalized inputs can be original inputs
This batch normalization technique can reduce internal covariate shift, which means it will speed up the training process. Besides, batch normalization can reduce the dependence of gradients on the scale of of the parameters or of their initial values, which means we can use larger learning rate and we have less risk in choosing initial parameters.
This paper shows some experiments where batch normazation can make the training much faster and give better results.
Questions to discuss:
1. In my eyes, batch normalization is to make the distribution of inputs for layers consistent over the training when parameters are updated, but having same mean and variance doesn't mean distribution is unchanged. Why batch normalization can still be so good? Or in which case, it will not be good?
2. It seems to me that batch normalization will depend on mini-batches. What if the batch size is very small? What if the mini-batch sample is not i.i.d from the training data distribution?
Xinmeng Li's summary This paper presents a new method for accelerate the training for deep neural network. This paper differs from previous papers by not need for careful tuning of the model hyper-parameters, such as the learning rate used in optimization, and initial values for the model parameters. The Batch Normalization fixes the mean and variance of layer inputs to accelerating the training step; reduces the dependence of gradient on the scale of parameter or initial values to effects the gradient flow in the neural network; uses saturating nonlinearities to prevent the network from getting stuck in the saturated modes. The algorithm is effective for not only accelerate the training time more than ten times than original model, but also significant the margin between classes. It is limited by the need for preprocessing such as increasing learning rate, removing dropout and local response normalization, shuffling training examples more thoroughly and reducing the photometric distortion of the “real” images.
Discussion: The mini batch normalization will depend on the shuffle of training example, but how big is the influence? What maybe the reason when increasing the learning rate further in BN-x30 model causes the model to train somewhat slower initially, but allows it to reach a higher final accuracy? Is that an optimal learning rate? How to evaluate what is a good learning rate for this algorithm?
Batch Normalization describes the method of normalizing the input to layers of a neural network for each mini-batch. Because this prevents internal distributions from shifting wildly at each update (termed internal covariate shift in the paper), one can use a higher learning rate, and be less cautious in choosing initial weights. In addition, using Batch Normalization trains an otherwise same model much more quickly, and in some cases regularizes the model sufficiently such that other methods like Dropout are not needed. To normalize one mini-batch at one layer, the algorithm adjusts the input to have a mean of 0 and a variance of 1, then scales and shifts by hyperparameters learned during the training phase. The authors of the paper applied Batch Normalization to a modified version of the GoogLeNet network using SGD with momentum as their update function, and found that this outperforms the original network both in training speed and accuracy.
Discussion: How sensitive is Batch Normalization to mini-batch size? Would using a different update rule or learning rate for the scale and shift hyperparameters have an effect on performance?
Batch normalization is a technique by which you normalize the activation with the mini-batch mean and variance such that the activation has a standard normal distribution before you apply the non-linearity. This, turns out, accelerates training, regularizes the model, and does better than the previous state of the art.
The authors repeatedly bring up the input distributions, and how it's a problem when they differ for training and test set. What do they mean by that? And why does normalizing the input distribution accelerate training?
Batch normalization is used to normalize the input to each layer by reducing a confounding factor of covariance shift. By normalizing at this level the authors presented an effective way of both (1) speeding up training, (2) improving regularization.
The authors found by removing or reducing some typical regularization techniques they were able to speed up or improve performance of the network. What is the relationship between batch normalization and these other regularization techniques besides the simple fact they all provide weight normalization.
The paper presents a method to speed up training time by normalizing, which allowed them to decrease time spent finding appropriate hyper-parameters. The identified that the distribution of network activations shifts due to the change in network parameters when training. They developed a method to reduce this shift (internal covariate shift), which made finding the appropriate hyper-parameters faster. The method is to normalize each feature by itself, and then using a subset of the data they estimate the mean and variance of the whole set (tradeoffs for speed). They incorporate the normalization into the architecture of the layers themselves (stabilizing activation values). Their model yielded a significant increase in speed over the previous state of the art, and they achieved good results with fewer training steps.
Discussion: What kind of issues does this strategy introduce that one might need to be aware of?
In this paper, the authors present an architecture for improving deep network accuracy and training time. In particular, they suggest whitening transformations between each layer within a network. They show how whitening transformations help avoid internal covariate shifting, a phenomenon in which unbalanced shifts propagate through layers, epoch after epoch, resulting in either a vanishing gradient that may get stuck in a local optimum or exploding gradient that will find no optimum. They present an algorithm for calculating the whitening transformations in mini-batches, and then show how this algorithm allows for networks with higher networks, and makes networks consistent against outliers.
I did have one question. Whitening involves using a linear transformation on a dataset that scales and translates the data. Linear transformations are commutative with linear functions. It seems like if we had only linear layers, we could collapse the linear functions so that we didn't have to do the matrix multiplication at each step. Would that make a big difference, computationally? Also, is it ok for us to do this whitening linear transformation on both sides of a nonlinear function, like a ReLu layer, since it may be shifting the discontinuity at zero? I could see that causing unexpected gradient explosions.
This paper describes batch normalization, a technique used to assist the training of deep neural networks. First, it describes the learning process of neural networks, typically performed by stochastic gradient descent, operating on minibatches of data. It then introduces the problem of covariate shift, where the distrution of inputs to the various layers in the network shift as upstream parameters are being learned. It introduces the method of batch normalization as a solution to this problem. The gist of this technique is to performed normalization to the inputs of each layer activation during training. Also, to increase the representativeness of layer inputs, scale and shift parameters are added to these normalizations. In addition, the paper describes how this method is compatible with the minibatches used in stochastic gradient descent. After describing how backpropogation is extended to these normalizations, the paper then describes experiments where batch normalization was added to neural networks, and describes how increased learning rates and improved overall accuracy result using batch normalization.
Discussion: Although the batch normalization increases the representativeness of normalized layer inputs through the scale and shift parameters, I wonder if the theoretical maximum efficiency of a deep network may be adversely affected by implementing batch normalization. Does a simple scale and shift recover all information that was lost during the normalization?
It has been known for a long time that internal covariance shift in neural networks, that is the difference in the probability distribution of the activation of internal layers for different minibatches, can slow down the process of training them. If a single layer gets to learn a distribution for one batch and then suddenly it is presented a new batch with a completely different distribution, it will have trouble adapting to this new situation. This effect happening repetitively over time makes us choose slower learning rates and be more careful with the weights initialization.
This paper introduces the Batch Normalization method, which helps us solve this problem by introducing a new type of layer to the network architecture, to be placed right before each non linearity, that normalizes the distribution of the each minibatch that is input in each internal layer. That way we achieve a zero mean, and a variance of 1. Doing that as an independent layer allows us to perform backpropagation on the parameters of this new layer.
This method has been proved to speed up the training time and also to improve the accuracy of the networks.
Q: In batch normalization, we are introducing yer another two parameters to our model. With the weight parameters, we choose their initial values carefully and we add methods to the model so that they do not overfit the training set. Are there similar issues with alpha and beta or do we just fix them to a proper values as hyperparameters and forget about the rest?
Batch normalization is a technique that is used to reduce internal covariate shift in a network. By normalizing each batch between layers, each layer's domain distribution is more consistent between batches. As a result, the authors were able to train effectively with higher learning rate and smaller batch size. This technique ultimately allows more performant training and higher, more generalized accuracy.
The authors show that introducing batch normalizing layers to a network is generally helpful for improving both training time and accuracy. When is it a bad idea to use batch normalization? We discussed in class that the added parameters and network stages introduce additional complexity into the model that could hurt performance. In what sorts of situations would this not be cancelled out by the performance gains?
Summary by Chris Mattioli: This paper presents a technique of training neural networks known as Batch Normalization. In a nutshell, at every layer in the network, each feature is normalized according to that feature's mean and variance within the batch and a linear transform whose slope and intercept parameters are learned throughout training. This normalization of every layer supposedly, and is shown to, reduce "Internal Covariate Shift" (ICS). The ICS refers to the change in distribution of values at each layer. The ICS can be rather severe in cases where models are infused with many non-linearities. The problem with ICS is that it requires more careful parameter settings and initializations, i.e., it makes training slower, more difficult, or even impossible given certain conditions. The idea of Batch Normalization is to fix the distribution of values coming out of each layer and there by eliminating/reducing the ICS. When it comes to finally applying the model to unclassified data, an average is taken from all the batch normalizing parameters. The affect that Batch Normalization has on a parameter such as learning rate is shown to nullify it's scaling effect on layer parameters. This equates being able to use larger or looser learning rates. Batch Normalization also has a built in regularization effect. The paper also presents an application of the technique to great effect on the ImageNet data.
ReplyDeleteQuestions: "In the batch setting where each training step is based on
the entire training set, we would use the whole set to normalize
activations. However, this is impractical when using
stochastic optimization." Why? (This isn't as much a BN question as it is a mini-batch question). Also, in section 3.1, why do they choose the estimator for variance that they did instead of taking the variance over the whole dataset? Similarly for the mean as well?
This comment has been removed by the author.
DeleteI can think of two reasons:
Delete1, in mini-batch SGD, we are supposed to only see a small subset of the entire dataset (that's why they say impossible), it makes sense we estimate the gradient as well as the mean and the variance based on this mini-batch we see for each step.
2, As they said the following:" the statistics used for normalization can fully participate in the gradient backpropagation". When you do the backprop and compute the derivatives, then only x_i that are in the mini-batch matters. Then I would think if you include all the points in training data to estimate the mean and the variance, it would wash out the gradient because of the scaling. (1/N instead of 1/m for the estimators)
The paper presents a novel approach to layer processing called Batch Normalization. Briefly Batch normalization modifies the weights of any given layer to ensure that they follow a normal distribution of a mean of 0 and a variance of 1. By ensuring the layer’s weights fall in this distribution, one lessens the impact of covariate shift. Additionally, this encourages values to remain tightly centered around zero and thus avoids the flat gradients seen in activation functions such as the sigmoid.
ReplyDeleteThe greatest benefit of Batch normalization falls into the category of speed of solution convergence. Because batch normalization will reduce the dependence of gradients on the scale of the parameters of initial values, one can greatly increase the learning rate of a system, without risking solution divergence due to initial conditions. Thus, Batch Normalization can allow an algorithm to reach a specified accuracy in a fraction of the typical time.
It is important to note that normalizing a layer can change what that layer can represent. The author’s get around this problem by inserting a transformation (often linear) into the network, which can represent the identity transform. This action leads to several new variables that must be trained as the algorithm functions. Hence, it is important to implement both forward and back propagation of error.
Using this method, the authors found great success in comparison to existing image classification accuracies. They were able to match accuracies while using a fraction of the steps. They were also able to improve upon these accuracies by training for longer periods of time.
Batch normalization appears to be a powerful tool to add to most any machine learning algorithm.
Discussion Questions.
Are there certain situations where batch normalization fails? Or makes accuracies worse?
Could one see any improvement by scaling each layer to different means and variances?
Can batch normalization be theoretically proven to decrease training times? Or are we stuck with an experimental conclusion?
-Sam Woolf
Nathan Watts' summary
ReplyDeleteThis paper proposes batch normalization, a method for improving training and reducing overfitting without the need for dropout. The initial idea came from the observation that models train significantly better if the input data is whitened, that is, zero centered, normalized, and decorrelated. This acts as a sort of regularization by preventing the model from fitting to biases in the training data. Input normalization also reduces covariate shift, a term which refers to changes in the input distribution, which can cause neurons to saturate as gradients escape the expected bounds. The paper also introduces an heuristic called “internal covariate shift,” which refers to changes in the distribution of *internal* parameters as the network changes, which may cause later layers to saturate for the reason described above.
The premise the paper is testing is the logical extreme of input normalization-- what if the activations are whitened in between *every* layer? Unlike input normalization, this introduces issues to optimization. If the activations are normalized on the forward pass, and this is not accounted for on the backward pass when computing the gradients, it can cause weights and biases to explode, as in this model, extremely large activations will still get scaled to a reasonable range for the next layer, so a positive gradient will continue to grow the parameters indefinitely. This problem can be avoided by computing the gradient of the whitening process, but this is very computationally expensive when accounting for the entire training set, and is not, in fact, fully differentiable.
To fix this problem the paper proposes simplifying the computation by zero-centering and scaling to standard deviation 1 (no decorrelation), using *only* mini-batch statistics to normalize, and normalizing layer inputs and outputs independently. These changes dramatically reduce the computational complexity of both the normalization and the differentiation. They also add two parameters to each layer which can scale and shift the activations after normalization so that the full range of the activation function’s shape is still available to the model. When sampling, however, the full population statistics are used, for increased accuracy. They also note that batch normalization works similarly for convolutional neural networks.
They end by confirming that batch normalization enables higher learning rates to be used safely and regularizes the model and detailing the experiments that demonstrate their results, and then discuss ways to speed up training and/or improve accuracy.
Forgot to put my question: can some of the regularization effects of batchnorm be attributed to data augmentation effects, as any specific training sample will appear different if it appears within a different batch?
DeleteSummary from Jie Li:
ReplyDeleteThis paper introduced a technique called batch normalization that can significantly accelerate the training speed of the network. It also has other benefits such as less dependent on the initialization, anti-nonlinearity saturation, etc. as well.
The idea comes from the observation that the presence of the "internal covariate shift", which by definition means "the change in the
distribution of network activations due to the change in
network parameters during training", would complicate the training, and hence the authors would like to find out a reasonable way to remove it form each layer of our network.
Based on the previous study that the "whitening" could speed up the training but too costly to apply fully to every layer, the authors made two simplifications and hence invented the batch normalization. The main idea is that each feature is normalized to have mean 0 and variance 1 within the mini-batch, and the estimators of the mean and the variance are computed by using the examples in mini-batch. A scaling and a shifting parameter are added to be learned from the training, so that the normalization is "invertible" if necessary by learning these parameters.
Experiments were done to show that the method outperform the previous state of the art algorithms both in speed and accuracy.
Question: Is there no cost by introducing a lot more parameters? gammas and betas. So a total of sum_i 2 x D_i (dimension of layer i) more parameters to learn.
This paper extends previous work and observations about neural net covariate shift and presents a novel optimization for stochastic gradient descent. The key piece of insight that the authors manage to leverage is a surprisingly simple one. It is that Covariate Shift is not only a global phenomenon in neural nets but also a local one. The authors noticed that although whitening of training data centered and uncorrelated data that is consumed by a neural net, the mean and covariance of the outputs to, and subsequent inputs to internal layers may not necessarily by uncorrelated or zero centered.
ReplyDeleteThe coin this phenomenon as Internal Covariate Shift and present a method to reduce the affect of Internal Covariate Shift on training time. The mathematically optimal way to eliminate Internal Covariate shift would require the expensive computation of covariances between each layer. This is not feasible and the proposed method approximates whitening of the features by instead dividing zero centered values with the variance calculated for each feature.
The authors provide further optimizations by suggesting that this normalization can be optimized by approximating the mean and variances of the activations of the entire training set between layers with the mean and unbiased variance of small batches.
The authors have also introduced flexibility into this framework by introducing two slack variables for each batch normalization step. A neural network can learn these variables and prevent normalization from obscuring some features in the data. This for me was the most profound observation.
The results of batch normalization are quite astonishing. Normalized networks can train much faster because of higher learning rates. Intuitively, we can think that the probability of parameters blowing up or getting lost to be decreased because we always center the activations.
Discussion:
The authors support intuitions about improved learning rate with math and clear writing. However, the authors fail to support their intuition about batch normalization as a regularization technique and instead leave the reader unsatisfied with a brief, passing passage. If batch normalization in essence centers and "rotates" the hyperplane upon which SGD is performed, how is it then penalizing overfitting? This is bizarre, astonishing and deserves more investigation.
Summary by Jason Krone:
ReplyDeleteThis paper presents Batch Normalization, a method to normalize the inputs to neural network layers. Batch Normalization is a valuable tool because it allows for faster training times, removes the need for dropout, and prevents the gradients of non-linearities, such as the sigmoid activation function, from becoming saturated. Batch normalization accomplishes these improvements by normalizing the inputs to non-linearities between layers of the neural network using the mean and variance of the examples in a given mini-batch. In addition to normalizing these inputs, the batch normalization transform applies a linear function to each feature in each training example, scaling by a parameter gamma and then shifting by a bias parameter beta. Beta and gamma are treated as hyper parameters in the training process, which should be learned through a method such as cross validation. This final linear transformation of the data is included to ensure that the batch normalization transformation can represent the identity transform i.e. the un-normalized inputs to any layer can be recovered by setting gamma and beta to specific values. During the training process one must account for the batch normalization transform during back propagation by propagating the gradient of the loss through the batch normalization “layer”. When utilizing Batch Normalization it is beneficial to increase learning rate, remove dropout, reduce weight regularization, increase the decay rate, and prevent training examples from continually appearing together in mini-batches.
Discussion:
Why does Batch Normalization achieve superior performance with a reduced weight regularization? Are there any reasons to not use Batch Normalization? Is Batch Normalization an active research topic? What aspects of Batch Normalization transform must be modified for application to recurrent neural networks?
Sam Burck:
ReplyDeleteOne problem with training deep neural networks is known as internal covariate shift. During training, in addition to layers adapting to fit the set of training data, layers must also adapt to “fit” each other as they change throughout the training process. This greatly complicates the way in which a deep neural network behaves during training. The paper introduces a method to reduce internal covariate shift, in which the inputs and outputs of each layer are normalized, scaled, and shifted for each batch according to two parameters, which are themselves learned during training.
In order to fully take advantage of this newly minimized internal covariate shift, a number of other changes were made to the training procedure for this network. Among these changes, learning rates were increased, dropout was eliminated, L_2 weight regularization was reduced, the learning rate decay was accelerated, local response normalization was no longer used, and training examples were thoroughly shuffled to reduce duplicate examples within a given batch.
A network using this batch normalization strategy was tested using ImageNet sets with 1000 different classes, and was able to improve on the previous best results, surpassing the accuracy of human raters.
Discussion:
The paper mentioned that the network was trained on less distorted images. What kind of effect will this omission have on the networks performance when it comes to distorted images in real-world applications?
What other methods are there to reduce internal covariate shift? This seems like a huge problem. Might there be a way to estimate the portion of loss attributed to internal covariate shift during training?
This paper presents one method called Batch Normalization to address Internal Covariate Shift problem via a normalization step that fixes the means and variance of layer inputs and
ReplyDeleteInternal Covariate Shift problem refers to the change in the distributions of internal nodes of a deep network in the course of training, which slows down the training by requiring lower learning rates and careful parameter initialization. In the normalization step, this paper does the normalization via Mini-Batch Statistics, using two simplifications, first is to normalize each scalar feature independently and making it have zero mean and variance of one, second is to use mini-batches in stochastic gradient training. The author gets the conclusion that adding Batch normalization to a state-of-the-art image classification model yields a substantial speedup in training, and combined with multiple models, it can perform better than the best known system on ImageNet.
Question: In section 3.3, there are two derivative equations, how can we show that they are equal to get the conclusion that scale doesn’t affect the gradient propagation. And how to understand ‘larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.’
Hongyan Wang's summary:
ReplyDeleteThis paper presents a new method which can accelerate deep network training by reducing internal covariate shift.
In deep network training, we usually have many layers and we use gradient descent or its variants to update the weights for each layer. But at each step, when the weights are updated, the distribution of the inputs for next layers will also be changed. That means at each step, parameters need to adapt to inputs with different distributions. This makes the training very slow. This phenomena is called internal covariate shift.
Full whitening of each layer's inputs can reduce covariate shift, but it's costly and not everywhere differentiable. The authors make two simplications in the paper: the first is to normalize each feature independently by making it have the mean of zero and variance of 1; the second is to use mini-batch statistics to estimate the mean and variance since it's impractical to use whole dataset when using stochastic optimization. Simply normalizing each input of a layer may change what the layer can represent. To address this problem, the authors introduce a linear transformation with new parameters gamma and beta such that the normalized inputs can be original inputs
This batch normalization technique can reduce internal covariate shift, which means it will speed up the training process. Besides, batch normalization can reduce the dependence of gradients on the scale of of the parameters or of their initial values, which means we can use larger learning rate and we have less risk in choosing initial parameters.
This paper shows some experiments where batch normazation can make the training much faster and give better results.
Questions to discuss:
1. In my eyes, batch normalization is to make the distribution of inputs for layers consistent over the training when parameters are updated, but having same mean and variance doesn't mean distribution is unchanged. Why batch normalization can still be so good? Or in which case, it will not be good?
2. It seems to me that batch normalization will depend on mini-batches. What if the batch size is very small? What if the mini-batch sample is not i.i.d from the training data distribution?
Xinmeng Li's summary
ReplyDeleteThis paper presents a new method for accelerate the training for deep neural network. This paper differs from previous papers by not need for careful tuning of the model hyper-parameters, such as the learning rate used in optimization, and initial values for the model parameters. The Batch Normalization fixes the mean and variance of layer inputs to accelerating the training step; reduces the dependence of gradient on the scale of parameter or initial values to effects the gradient flow in the neural network; uses saturating nonlinearities to prevent the network from getting stuck in the saturated modes. The algorithm is effective for not only accelerate the training time more than ten times than original model, but also significant the margin between classes. It is limited by the need for preprocessing such as increasing learning rate, removing dropout and local response normalization, shuffling training examples more thoroughly and reducing the photometric distortion of the “real” images.
Discussion:
The mini batch normalization will depend on the shuffle of training example, but how big is the influence? What maybe the reason when increasing the learning rate further in BN-x30 model causes the model to train somewhat slower initially, but allows it to reach a higher final accuracy? Is that an optimal learning rate? How to evaluate what is a good learning rate for this algorithm?
Batch Normalization describes the method of normalizing the input to layers of a neural network for each mini-batch. Because this prevents internal distributions from shifting wildly at each update (termed internal covariate shift in the paper), one can use a higher learning rate, and be less cautious in choosing initial weights. In addition, using Batch Normalization trains an otherwise same model much more quickly, and in some cases regularizes the model sufficiently such that other methods like Dropout are not needed. To normalize one mini-batch at one layer, the algorithm adjusts the input to have a mean of 0 and a variance of 1, then scales and shifts by hyperparameters learned during the training phase. The authors of the paper applied Batch Normalization to a modified version of the GoogLeNet network using SGD with momentum as their update function, and found that this outperforms the original network both in training speed and accuracy.
ReplyDeleteDiscussion:
How sensitive is Batch Normalization to mini-batch size? Would using a different update rule or learning rate for the scale and shift hyperparameters have an effect on performance?
-Lisa Fan
Batch normalization is a technique by which you normalize the activation with the mini-batch mean and variance such that the activation has a standard normal distribution before you apply the non-linearity. This, turns out, accelerates training, regularizes the model, and does better than the previous state of the art.
ReplyDeleteThe authors repeatedly bring up the input distributions, and how it's a problem when they differ for training and test set. What do they mean by that? And why does normalizing the input distribution accelerate training?
-- Takuto
Batch normalization is used to normalize the input to each layer by reducing a confounding factor of covariance shift. By normalizing at this level the authors presented an effective way of both (1) speeding up training, (2) improving regularization.
ReplyDeleteThe authors found by removing or reducing some typical regularization techniques they were able to speed up or improve performance of the network. What is the relationship between batch normalization and these other regularization techniques besides the simple fact they all provide weight normalization.
The paper presents a method to speed up training time by normalizing, which allowed them to decrease time spent finding appropriate hyper-parameters. The identified that the distribution of network activations shifts due to the change in network parameters when training. They developed a method to reduce this shift (internal covariate shift), which made finding the appropriate hyper-parameters faster.
ReplyDeleteThe method is to normalize each feature by itself, and then using a subset of the data they estimate the mean and variance of the whole set (tradeoffs for speed). They incorporate the normalization into the architecture of the layers themselves (stabilizing activation values). Their model yielded a significant increase in speed over the previous state of the art, and they achieved good results with fewer training steps.
Discussion: What kind of issues does this strategy introduce that one might need to be aware of?
-Cole
In this paper, the authors present an architecture for improving deep network accuracy and training time. In particular, they suggest whitening transformations between each layer within a network. They show how whitening transformations help avoid internal covariate shifting, a phenomenon in which unbalanced shifts propagate through layers, epoch after epoch, resulting in either a vanishing gradient that may get stuck in a local optimum or exploding gradient that will find no optimum. They present an algorithm for calculating the whitening transformations in mini-batches, and then show how this algorithm allows for networks with higher networks, and makes networks consistent against outliers.
ReplyDeleteI did have one question. Whitening involves using a linear transformation on a dataset that scales and translates the data. Linear transformations are commutative with linear functions. It seems like if we had only linear layers, we could collapse the linear functions so that we didn't have to do the matrix multiplication at each step. Would that make a big difference, computationally? Also, is it ok for us to do this whitening linear transformation on both sides of a nonlinear function, like a ReLu layer, since it may be shifting the discontinuity at zero? I could see that causing unexpected gradient explosions.
-Dylan Cashman
This paper describes batch normalization, a technique used to assist the training of deep neural networks. First, it describes the learning process of neural networks, typically performed by stochastic gradient descent, operating on minibatches of data. It then introduces the problem of covariate shift, where the distrution of inputs to the various layers in the network shift as upstream parameters are being learned. It introduces the method of batch normalization as a solution to this problem. The gist of this technique is to performed normalization to the inputs of each layer activation during training. Also, to increase the representativeness of layer inputs, scale and shift parameters are added to these normalizations. In addition, the paper describes how this method is compatible with the minibatches used in stochastic gradient descent. After describing how backpropogation is extended to these normalizations, the paper then describes experiments where batch normalization was added to neural networks, and describes how increased learning rates and improved overall accuracy result using batch normalization.
ReplyDeleteDiscussion: Although the batch normalization increases the representativeness of normalized layer inputs through the scale and shift parameters, I wonder if the theoretical maximum efficiency of a deep network may be adversely affected by implementing batch normalization. Does a simple scale and shift recover all information that was lost during the normalization?
-Jonathan Hohrath
It has been known for a long time that internal covariance shift in neural networks, that is the difference in the probability distribution of the activation of internal layers for different minibatches, can slow down the process of training them. If a single layer gets to learn a distribution for one batch and then suddenly it is presented a new batch with a completely different distribution, it will have trouble adapting to this new situation. This effect happening repetitively over time makes us choose slower learning rates and be more careful with the weights initialization.
ReplyDeleteThis paper introduces the Batch Normalization method, which helps us solve this problem by introducing a new type of layer to the network architecture, to be placed right before each non linearity, that normalizes the distribution of the each minibatch that is input in each internal layer. That way we achieve a zero mean, and a variance of 1. Doing that as an independent layer allows us to perform backpropagation on the parameters of this new layer.
This method has been proved to speed up the training time and also to improve the accuracy of the networks.
Q: In batch normalization, we are introducing yer another two parameters to our model. With the weight parameters, we choose their initial values carefully and we add methods to the model so that they do not overfit the training set. Are there similar issues with alpha and beta or do we just fix them to a proper values as hyperparameters and forget about the rest?
-- Jorge Sendino
Batch normalization is a technique that is used to reduce internal covariate shift in a network. By normalizing each batch between layers, each layer's domain distribution is more consistent between batches. As a result, the authors were able to train effectively with higher learning rate and smaller batch size. This technique ultimately allows more performant training and higher, more generalized accuracy.
ReplyDeleteThe authors show that introducing batch normalizing layers to a network is generally helpful for improving both training time and accuracy. When is it a bad idea to use batch normalization? We discussed in class that the added parameters and network stages introduce additional complexity into the model that could hurt performance. In what sorts of situations would this not be cancelled out by the performance gains?
-Jay DeStories