Sunday, February 19, 2017

Tues Feb 21: AlexNet

ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton. NIPS 2012.

19 comments:

  1. In this paper, a team at University of Toronto used new methodologies and a large network into order to greatly increase the accuracy of an image classification algorithm on the ImageNet LSVRC-2010 dataset. Specifically, the algorithm included five convolutional layers in addition to three normal, feedforward neural network layers. In order to reach high accuracies, the team used a very large networks, including over 60 million parameters. As you can imagine, this took a long time to train, over 5 days on 2 GPUs. As net works get larger, over fitting is a key concern. This team combatted this problem by including dropout and data augmentation.

    Why is a CNN’s theoretically best performance likely to be worse than a standard feed forward NN?
    Would Batch normalization help here?

    -Sam Woolf

    ReplyDelete
  2. Summary: This paper was a famous early indicator of the effectiveness of convolutional neural networks (CNN) for image classification. There were a couple of recent technological advances that enabled the success of this network, including larger image databases and improved performance of GPUs. Also, the design of the network allowed it to learn the image features without overfitting. The architecture included 8 layers – 5 convolutional and 3 fully connected. It implemented ReLU nonlinearity to prevent saturation, was trained on multiple GPUs for fast training, had overlapping pooling, had a SoftMax loss function. To prevent overfitting of the network, data augmentation was performed through image translations, horizontal reflections, and RGB channel intensity modification. Also, dropout was implemented with probability 0.5. To train, stochastic gradient descent was implemented with momentum 0.9 and weight decay 0.0005. This network achieved top-1 and top-5 test error rates of 37.5% and 17.0% on the ILSVR-2010 dataset, vastly outperforming the 2nd place team.
    Discussion: Data augmentation was used in this network to reduce overfitting. We haven’t heard too much about this in class – is this something that is typically done in recent networks during training?

    -Jonathan Hohrath

    ReplyDelete
  3. Summary (Chris Mattioli): This paper is regards to a CNN that was able to accomplish an image recognition task with, to-date-of-the-paper, very low error. It focuses on the nuances of the model's architecture, and introduces/experiments with some different layer configurations and activation functions. Section 3 is in regards to the architecture: 5 convolutional layers and 3 fully connected layers. They estimated that their use of ReLu activations vastly sped up their training time (something desirable given the size of their training data). Additionally, they added normalization ("local response"). This normalization was able to give them about a percent-and-a-half improvement on their validation error. They also applied max-pooling layers in their architecture as well. Their final architecture looked like: Conv->Norm->Pool->Conv->Norm->Pool->Conv->Conv->Conv->Pool->FC(w/drop)->FC(w/drop)->FC->softmax. They foresaw overfitting being a big problem despite the size of their data set, so section 4 discusses the lengths at which they went to prevent overfitting. First off, they used "data augmentation". For this they took subsets of each image and the subset's
    horizontal reflection. The subset size was only off by epsilon of the original image. Using this technique they could turn one image into many. They also added to their RGB information by performing PCA over the RGBs of the entire dataset and adding multiples of the found components. Apart from augmention they also added dropout (p=.5) for the use of regularization. Their optimization consisted of SGD with momentum (.9) and LR decay (0.0005). They also included some manuel adjustment of the lr when the loss started to converge. Their results were the best to-date-of-paper on the task. Question: I found their method of manually reducing the learning rate at certain points interesting, I wonder how effective that is theoretically when using momentum. Is the momentum term also adjusted to this new learning rate? And if it's not, wouldn't that term dominate the update in theory since the new learning rate is even smaller?
    Food-for-thought-question: They mentioned that removing one of their conv layers resulted in the 2% loss of performance, but they didn't mention experimenting with more than 5 conv layers, was this due to hardware constraint? Do extra layers present not only diminishing returns, but detrimental performance?

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Sam Burck:

    This paper introduces a new type of CNN, which brought fourth a new generation of CNN's - along with a renewed interest in the use of CNN's for image processing applications. The particular CNN described by the paper is now colloquially known as the "Alex Net", and consists of 5 convolutional layers followed by 3 fully connected layers. Loss is calculated using the softmax function. One major departure of the Alex Net is the use of rectified linear units (ReLUs) as activation functions. Some of the convolutional layers use max pooling to isolate local features and reduce the size of the following layer's dimensionality. Relu's are beneficial because as linear functions, they are computationally simple, and their efficiency is further improved by not wasting computational resources on calculations where their output's saturated. Overfitting was handled by the use of data augmentation (making multiple images from single images through various means within the training set), and dropout. The network took roughly 5-6 days to train on a pair of high end gaming GPU's, using a set of 1.2 million images. The Alex Net won ImageNet competitions by a significant margin in both 2010 and 2012.

    Discussion:

    Are ReLU's too good to be true? What benefits might sigmoid or tanh activations functions offer?

    Is there something magical about 5 conv. layers and 3 fully connected layers?

    ReplyDelete
  6. In this paper, a research group in the University of Toronto introduced a novel architecture of CNN that they developed. This CNN, later known as AlexNet, was able to beat the state of the art visual recognition systems in the ImageNet LSVRC-2010 challenge.

    This architecture combined 5 convolutional layers and 3 fully connected layers and achieved top-1 and top-5 error rates of 37.5% and 17.0% in the mentioned contest. They also introduced max pooling layers between some of the convolutional layers and ReLU activation functions after each convolution and each fully connected layer, a novel innovation that has been shown to improve the network performance. After each ReLU nonlinearity, a normalization layer was introduced to improve generalization.

    The network was allocated in two different GTX 580 GPUs with all the nodes in each layer distributed evenly between both GPUs. The GPUs were connected in the third layer. With this setting and using SGD with momentum, it took 5-6 days to train the network.

    Discussion:

    When using BN, we add alpha and beta so that the new transformation could represent the identity function. Here it seems that they add a new normalization layer using some hyperparameters, but without taking care of this fact nor implementing backpropagation through this layer. Could this affect the performance of the CNN?

    It looks like each fully connected layer has a lot more parameters than each convolutional layer. Could we obtain the same results by reducing the number of FC layers and adding convolutional layers and so reducing the complexity of the model?

    -- Jorge Sendino

    ReplyDelete
  7. Jie Li's summary:

    This paper introduced a special architecture of a deep convolutional neural network, later known as AlexNet, and showed its performance on the well known ImageNet competition ILSVRC 2010 and ILSVRC 2012. The key features of this CNN includes using ReLU as activation, using pooling layers, using local response normalization, etc. The optimization scheme chosen was SGD with momentum. The architecture of the CNN consists 5 convolutional layers with pooling and 3 fully connected layers. Dropout witha rate 0.5 and two different ways of data augmentation are also introduced to prevent from overfitting. The network outperforms its competitors and the authors believe that the performance could be even improved with faster GPUs and more data.

    Discussion:
    Any intuitions about how to choose the the architecture. It seems there are too many things to choose but little reason explained.

    ReplyDelete
  8. This paper presents a very successful convolutional network. It's five convolutional layers deep and, according to the authors, every layer counts. The conv net splits the tasks into two GPUs for speed. The layers in separate GPUs don't communicate with one another except in the third hidden layer. Remarkably, they learn distinct set of filters (e.g. the filters in one GPU learn colors while those in the other GPU learn shapes). Since the network was huge, they carefully avoid overfitting via data augmentation (translations and PCA-based) and dropout.

    They also apply a funky normalization after ReLU. The normalization is based on the sum of activations from other filters (in the same input region). As Sam W said, I wonder how this compares to batch normalization.

    ReplyDelete
  9. Nathan Watts
    In this paper, the authors describe a neural network architecture which beat previous records for top-5 performance on the imagenet dataset, a massive dataset of natural images with 1000 categories. Their network used 5 convolutional layers, followed by 3 fully-connected layers. The network used a few unique features which dramatically improved performance over previous convolutional neural networks. They used ReLU nonlinearity after every convolutional and fully-connected layer instead of tanh or sigmoid, which prevents the filters from saturating, and is less computationally intensive, resulting in much faster convergence. Additionally, they parallelized training across multiple GPUs, which increased both training speed and model size. They also normalized the convolutional layers using a brightness heuristic. The most novel technique used, however, was using overlapping pooling filters. This means that after pooling, the same feature might be represented in two adjacent locations if it fell between two pooling filters, thus preserving additional positional information.

    The model showed an extreme tendency for overfitting. They used two techniques to combat this: to avoid overfitting in the convolutional layers, they used data augmentation, randomly translating or mirroring the images, and adjusting the intensities of the three color channels. This artificially inflated the size of the training set, which minimized the effect of outliers and prevented repetition of training samples. Overfitting in the fully connected layers was avoided using 50% dropout.

    Training was performed with a small amount of weight decay, and a relatively high momentum term (0.0005 and 0.9 respectively). Learning rate annealing was also used, dividing the learning rate by 10 when the validation error rate stopped improving. This resulted in state-of-the-art (for the time) performance on the imagenet dataset.

    Questions:
    While the overlapping pooling can preserve more positional data, could it also result in false positives in the case of repeating patterns, since the same pattern appearing in the center of each of two filters vs that pattern appearing in between two overlapping filters would appear the same after pooling?
    Doesn’t the manual learning rate annealing carry the risk of damaging repeatability?

    ReplyDelete
  10. In the paper, the authors apply state of the art techniques to train a deep convolutional neural network for the ImageNet LSVRC-2010 image classification contest. The network was deeper than the previous state of the art, using new techniques including RELU nonlinearity and dropout to make a larger network practical as opposed to training multiple networks in ensemble, which was done previously on the best performing models. With minimal preprocessing and carefully avoiding overfitting with dropout, data augmentation, and weight decay, they are able to beat the current state of the art by a wide margin.

    The paper describes how the authors used "brightness" normalization in some convolutional layers, but a mean shift only at the preprocessing step. I'm curious why they didn't shift the mean within the network and why they didn't perform any normalization in the fully connected layers.

    The authors split the kernels in the convolutional layers and limited their communication due to computational limitations. Have there been any follow-up papers exploring whether that architectural decision contributed to their success? Have people retrained the network without the separation on current hardware to compare?

    -Jay DeStories

    ReplyDelete
  11. Summary: A team from U of T (including the eponymous Alex) created a deep network with convolutional and fully connected layers that achieved far superior performance than the previous state of the are in the ImageNet contest. To achieve this, they implemented one of the largest convolutional networks at the time, and used various new techniques. Convolutional neural networks were used, the authors claim, to help bring in more “knowledge” to make up for the fact that they perceive image classification as a problem that would require far more data than the ImageNet dataset, and because they are fast and easier to train (than standard layers). The authors also used some new or non-standard techniques (at the time):
    • Activation: ReLU (rectified linear units) was used as opposed to traditional saturating functions, which trains significantly faster.
    • Normalization: They used local response normalization, which has a few hyper parameters they chose using a validation set. The local response normalization looks over a number of “adjacent” layers (for the same spatial position), and so introduces a sort of lateral inhibition.
    • Pooling: A modified overlapping pooling was used which improved performance.
    The authors note that they network has 60 million parameters, and thus with their data set are at considerable risk of overfitting, so they used two ways to combat it. The first is data augmentation, which they achieved by taking five slightly smaller random patches of each image (and their horizontal reflections), and also by performing some PCA calculations to modify the pixel values. The second strategy for avoiding overfitting is their implementation of dropout (with 0.5 probability a neuron outputs 0).
    This network was a famous indicator that large convolutional neural nets could be very effective.

    Questions: I would like to know more about local response normalization (how it works and who is using it).
    Also, how do convolutional layers bring “knowledge”, and how could their need potentially be obviated by a huge data set?

    -Cole

    ReplyDelete
  12. Xinmeng Li's Summary

    This paper present a method to classify ImageNet images using deep convolutional neural network with five convolutional layers and three fully connected layers. The method in the paper have better perfomance then state-of-art approach, with lower error rate and less time. The three fully connected layers are regularization as "dropout", which consists of setting to zero the output of each hidden neuron with probability 0.5 to avoid overfitting. The method are better than the other approach by fully use multiple GPU to calculate each layer in the network and non-saturating neurons. However, the layers are limitted by the ability of the GPU to have at most five convolutional layers, although more layers is likely to have better perfomance.

    Discussion:
    The way of image preprocessing are "rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image", is it the best way to acquire image information? Is there any other way of preprocessing image with better method?

    ReplyDelete
  13. Alex Tong's Summary

    This paper presents a learning system that works significantly better on this scale of data than any previous system. The authors made many compromises based on the available computing power, but were still able to show a significant improvement over other systems. Significant thought was put into reducing training time and reducing overfitting, the most common problems with networks of this size.

    I'm curious as to the relative computation power available to us now. Are networks now trained on multiple GPUs? If so is this an automatic or manual adaptation. What does the trajectory of GPU computing power look like, and when if ever will we hit a wall similar to the heat/clock speed tradeoff of CPUs.

    ReplyDelete
  14. In this paper the authors describe an architecture they used to get major improvements on the best-to-date accuracy on an image recognition dataset. In particular, they used a deep network. They outline the architecture as well as several tricks they used to improve accuracy, avoid overfitting, and to train efficiently.

    They use 5 convolutional layers and 3 fully connected layers. The convolutional layers make use of the semantic meaning of locality in pictures, making such a network feasible to compute. In addition, they use ReLU activation layers, a local version of batch normalization, overlapping pooling layers, and a larger set of neurons afforded by using two GPUs concurrently to improve training. To avoid overfitting, they augment their data by taking submatrices of each picture, and also by perturbing each pixel's RGB according to the dataset's principal components.

    I think it's a really impressive paper. Each one of their tricks seems to be a worthy contribution in its own right. Were these tricks used in the past? I was particularly impressed with the RGB pixel perturbations for data augmentation. I also really liked how they showed in Figure 4 the picture similarities their final layer found.

    My biggest question coming out of it is how the convolutional layers actually fit. They say the images are 224 X 224 X 3, and then they use 96 kernels to cover that space. First, 96 isn't a square, so it seems like the kernels can't squarely fit over the image. I also don't know what they mean by the stride.

    -Dylan Cashman

    ReplyDelete
  15. I'm Ben Papp and this is my summary:

    This paper outlines a network which made record-breaking strides in identifying high-resolution images based on over a million training images, in the ImageNet LSVRC-2010 competition. The network uses 5 convolutional neural network layers and 3 fully-connected layers, parallelized across 2 GPU's, to create a network deep enough to analyze and form intuitions on images from ImageNet’s massive dataset. To help generalize the results, the network uses max pooling, local normalization, and ReLU layers as part of the architecture of the network. The training images were randomly subsampled and flipped horizontally, and principal component analysis was applied to the RGB values of the images, to vary the input data enough to prevent overfitting on the data. The fully connected layers also use dropout with a factor of 0.5 to force interdependence within those layers, which helps make more nuanced high-level observations by the network. Learning was done with Stochastic Gradient Decent with weight decay and momentum. In the competition, the network achieved 37.5% and 17.0% error on top-1 and top-5 accuracy, beating the next best competitors by over 8 percentage points in each category.

    I would love to form a better intuition around what their local normalization aids in doing by taking into account the n nearby kernels- does this help similarize or distinctify the weights of the next layer?

    ReplyDelete
  16. In this paper, a new network consisting of 5 CNN and 3 fully connected layers is introduced. The subtlety of work lies on the selection of those CNNs; despite of containing around one percent of the parameters of the network, they play a significant role in the accuracy of the system. The network is tested on a scaled version of ImageNet data set and the errors are measured in top-1 and top-5. The results are better than the previous methods by significant numbers. Another contribution of this paper is a new fast implementation of convolution via 2 GPUs which made training time much more efficient.

    My main question is about the way of using the two GPUs in learning the network. In Figure 2 it says that upper and lower parts are being trained individually and they make contact in some levels. I like to know more about the criteria by which the author decided to choose those levels and way of "trading information between two GPUS".

    ReplyDelete
  17. Amit Patel:

    Summary: The researchers in the paper trained a deep convolutional neural network in order to classify over 1.2 million images and achieved a top-1 and top-5 error rate of 37.% and 17.0%, respectively. The team used ReLu nonlinearlity in order to train networks several times faster than equivalent networks with their tanh units. They also trained on multiple GPUs with a parallelization scheme that puts roughly half of the neurons on each GPU and having the GPUs communicate only in certain layers. The architecture of the net was eight layers with weights. The first five layers were convolutional and the last three were fully-connected. The architecture featured over 60 million parameters so overfitting was a concern. Two ways the team combated overfitting was through data augmentation. The first form involved image translations and horizontal reflections. The second form involved using PCA on the RGB pixel values of the images in the training set. The researchers also used dropout on the first two fully-connected layers.

    Overall, deep convolutional neural networks appear to achieve very impressive results on challenging datasets in supervised learning.

    In the discussion of the paper, it mentions that unsupervised pre-training would have helped. Have there been more recent papers that have attempted this given the increases of computational power?

    ReplyDelete
  18. Summary:

    The most important contribution brought about by paper is an activation function that was novel, at the time, and a new neural net architecture that leverages the growing compute power with new GPUs.

    ReLu is used to prevent saturation. And in certain layers, neurons are only interconnected inter-GPU and not intra-GPU. Here the authors leverage the ever growing RAM capacity of GPUs to train a deeper neural net with more parameters.

    One particular concern throughout is overfitting, the authors augment the training set and add dropout to prevent this. The augmentation using pixel color covariance and eigen-decomposition is particularly interesting.

    Questions:

    Now that we have seen the power of batch normalization, how much faster does AlexNet train with batch normalization? also, what is the intuition behind the choice of augmentation done to the images? It is easy to say that the classifier should be invariant to color and luminosity changes, but why specifically this PCA inspired transformation?

    ReplyDelete
  19. Hongyan Wang's summary:

    This paper presents a deep convolutional neural networks which achieves record-breaking results on a highly challenging datasets.

    This paper introduces several new techniques which may be helpful for future applications: First, ReLU Nonlinearity is much faster than traditional nonlinearities. Faster learning greatly influences the performance of large models trained on large datasets. Second, the authors trained on multiple GPUs. From this, we can see that memory and GPU capacities are limiting the power of deep neural networks. In the future, with more memory and more powerful GPU, we can see better CNNs.

    To reduce overfitting, this papers shows several methods such as Data Augmentation and Dropout.

    The overall achitecture: the net contains eight layers with weights: first five are convolutional and the remaining three are fully connected. The ReLU is applied to the output of every convolutional and fully-connected layer. The dropout parameter is 0.5 and dropout is applied to first two fully-connected layers. Learning method is SGD+momentum and momentum parameter is 0.9.

    Questions:
    The author mentioned that in the future, with faster GPU and more memory, their net model will be more powerful. How would they expand their model?

    ReplyDelete