Thursday, March 9, 2017

Tues, March 14: CNNs for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation. Jonathan Long, Evan Shelhamer, Trevor Darrell. CVPR 2015.

20 comments:

  1. (Chris Mattioli) "Fully Convolutional Networks for Semantic Segmentation" takes a deep drive into the problem of pixel-by-pixel classification. Long et al. propose the use of FCNs, their "fully" convolutional networks, for this task. They emphasize that the beauty of their technique, compared to others that have attempted pixel-by-pixel classification, is simply an FCN trained end-to-end. They mention that other techniques tend to "complicate" by adding pre/post processing steps and do not include pre-training. Their FCNs are also special is that they include a "skip" architecture that allows them to learn coarse, semantic detail as well as shallow appearance information. First what they do is adapt the fully connected layers on popular CNNs such as VGG and GoogLeNet to convolutional layers. This produces an output that is a spatial map rather than a classification vector. The spatial map is the key to understand the global and local nature of the task (pixel-by-pixel classification). Once they have the spatial map output, they need to link these (often) downsampled maps to the pixels of the original image. This is achieved by convolution of the spatial maps with fractional stride parameters. Once they have "convolutionized" the popular networks of Alexnet and VGG, they train using their new output. One thing they also add is the aforementioned "skips". The purpose of the skips is to also the model to make local decisions that respect the global structure of the image. It works by selecting the outputs of some layers and adding different convolution layers to those outputs, i.e., it turns a single-line network topology into a DAG. The network ultimately fuses the results of many of these layers into the final output spatial map. Augmenting the network with this skip architecture greatly improved their accuracies, and in general their technique showed improvement upon existing pixel-by-pixel classification.

    Question: I'm a little confused about their layer fusion that they detail on page 6. They do a spatial alignment of the layers (with the necessary upsampling), but then they say they use a "score layer" and some sum the scores. Maybe an illustration of this 1x1 "score layer" would help my understanding.

    ReplyDelete
  2. The authors propose an FCN for semantic segmentation. Similar to previous papers we read they took the approach of using end to end training on full images. This greatly simplifies the overall pipeline and architecture. Additionally they use more detailed pixel information from previous layers to modify the final prediction. This helps with the prediction detail. I'm unclear on what exactly the authors mean by shift-and-stitch.

    Broader question: I interpret figure 7 mean that segmentation works quite well on only shape data. I wonder how often this sort of data filtering works well in different problem domains, (classification, detection, etc.).

    ReplyDelete
  3. This paper propose the method of using end-to-end trained convolutional network to improve semantic segmentation for spatially dense datasets. The method of fully convolutional network improved segmentation for 30 percent in mean IU. The method implement AlexNex, VGG and GoogLeNet to convolutional network and then combine the coarse and fine layer semantic information result to improve the accuracy. The fully connected layer enable the classification net to output a spatial map which can extract efficient spatial information for further end-to-end pixelwise learning. By combining the information of the classification and spatial information the algorithm achieve the accuracy and reduce the time at the same time.The method have the restriction of spatially dense dataset but not for all variety of image, since only these dataset can use prior model on spatial prediction.

    Discussion
    What is the use of batch size one mentioned in image-to-image part for online learning? Does the weight of shallow, fine layer and deep, coarse layer contribute to the semantic information evenly? How are the pooling and prediction layer in grid combined? Why not choose other pooling layer but pool 3 and 4 and different stride for FCN-16s and FCN-8s respectively.

    ReplyDelete
  4. The authors present the use of convolutional networks for pixel level semantic segmentation. Prior work has been done with predictions at every pixel (labeled with class of enclosing object or region), but this paper addressed some previous shortcomings - it is the first work to train fully convolutional networks end-to-end for pixel wise prediction with supervised pre-training. Their architecture involved starting with pertained models of Alexnet, VGG and GoogLeNet, and then they transformed the fully connected layers into convolution layers, and fine-tuned. Having convolutional layers on top allows for a classification net to output to a spatial map, which gives them a “heat map” for each class. After this, since their network produces lower resolution output, they upsample back to the larger image resolution, and by using some fusing techniques they produce the final result.
    The loss was per-pixel softmax. Two interesting results stood out to me: they compared patch wise and whole image training and found that whole image was just as good and fast, and when the network was fed only shapes (foreground/background masks) the accuracy was still surprisingly good. Overall the network achieved very good results.

    Question: I am a bit confused by the skip architecture, and how the results are “fused” (i.e. how do they combine coarse high level info with fine low level info).

    ReplyDelete
  5. The paper proposes a fully convolutional network for segmentation problem. Segmentation, as far as I understand, groups the input images into segments pixel-by-pixel. They discuss techniques for obtaining this dense (i.e. pixel-wise) predictions from coarse output layer like upsampling and patch sampling, though I must say I didn't understand the details of at all. They also make use of skip layers. I don't get their notation either (e.g. first two equations in Chapter 3). They site some advantages (faster, keep spatial coordinates(?)) to using the fully-convolutional architecture, as opposed to an architecture with early convolutional layers and fully-connected layers attached. I believe this was brought up in R-CNN, and I would really like to understand this better.

    ReplyDelete
  6. Summary by Jason Krone:

    This paper describes the use of fully convolutional networks for semantic segmentation, which is the task of assigning a class label to every pixel in a given image. Regarding architecture, the authors modify existing CNNs such as AlexNet and convert them into fully convolutional networks by removing the final classifier layer and converting all fully connected layers to convolutional layers. Another interesting aspect of this paper is that it learns upsampling layers, transposed convolutions, which are essentially the same as the backward pass for a convolution. In addition, the paper introduces the use of “skip connections” where multiple feature maps from different layers in the network are upsampled and combined to generate the solution to the segmentation task. The idea behind “skip connections” is that lower level layers with smaller receptive fields have finer grained information while later layers have a larger receptive field and recognize larger patterns in the image.

    Question: Can we remove pooling completely and just use convolutions? How exactly are the multiple layers for the skip connections combined?

    ReplyDelete
  7. Fully Convolutional Networks for Semantic Segmentation
    This paper introduces a method for semantic segmentation of images using fully convolutional networks (FCN). The task is to predict pixel-level labels across an entire input image. They start to build their network by transfer learning from trained classification networks. The fully connected layers at the end of the network are reinterpreted as convolutional layers. In order to convert the output to pixel-wise, they up-sample the output layers by adding “deconvolution layers”, where up sampling is enabled by the layer having a backwards stride. A popular option for segmentation training includes patchwise training on the input image, however, the authors found that whole-image training was just as efficient, so they stuck with that. To capture more fine-grained details in the image, they also added skip layers to their network that connect lower layers in the network to the higher ones. Testing on the PASCAL VOC dataset, they achieved state-of-the-art performance on mean IU by a relative margin of 20%. In addition, their inference time was significantly faster than previous methods with an average time of ~175 ms.

    Discussion: I had some confusion about the skip layers they implemented. Are these the same thing as the bypass layers that are present in resnet? Also the training of these seems confusing. Do they retrain the network every time they add a skip layer to a lower level?

    -Jonathan Hohrath

    ReplyDelete
  8. Summary by Justin Lee

    This paper introduces the use of FCNs for semantic segmentation. Unlike classification, segmentation requires prediction on the pixel-level. To do so, they applied pre-trained models and add on unique up-sampling deconvolution layers to recover an output with the same dimension of the input.

    The paper discusses two up-sampling techniques. Ultimately bi-linear interpolation through fractional striding was realized most elegantly with backwards convolution. Unlike Shift and stitch, deconvolution can be learnt and has less computational overhead.

    In order to recover fine local details after up-sampling through course pooled layers. Feature maps from earlier convolution layers created from filters with smaller receptive fields are linked up as aa "DAG" to aid in reconstructing finer segmentation


    ReplyDelete
  9. Fully convolutional networks(FCN) reinterprets contemporary classification convnets(AlexNet, VGG net and GoogLeNet) for semantic segmentation. It defines a novel architecture which combine information from different layers for segmentation. Typical recognition nets can only take fixed-sized inputs and produce nonspatial outputs, but FCN wants to process image of arbitrary size and produce same size outputs with efficient inference and learning. In order to do dense prediction, it uses shift-and-stitch trick that yields dense prediction without interpolation, do upsampling via deconvolution which is performed in-network for end-to-end learning by backpropagation from the pixelwise loss and found to be fast and effective. Standard classification convnets is for recognition, what FCN did is to remove the classification layer and convert all fully connected layers to convolutions and append 1*1 convolution with channel dimensions to predict scores at each of the coarse output locations.FCN is a end-to-end, very fast model for pixelwise problems.

    ReplyDelete
  10. In this paper, the authors present a new architecture for image segmentation. These tasks typically require classification at a pixel level. Since typical CNNs (convolutional, then fully connected) throw away spatial dimensions in their FC layers, models that try to approach segmentation using this method require a lot of pre and post-processing.

    However, the architecture proposed by the authors models a fully convolutional network (FCN) that is able to do image segmentation in a single pipeline. To accomplish this task, they make use of well-known classification CNN (such as GoogleNet or VGG) and remove the FC layers. Then they rescale the feature maps obtained by the network using a series of interpolation layers (which can be viewed as reversed convolutional layers) until they get the original size.

    In addition to this, they add a new concept to the network: the skips. These skips are no more than fusions of reception fields of different layers, scored and upsampled. Using this trick, the network incorporates in its final decision multiscale maps, which in this context mean using fine grain information from shallower layers and coarser information from upper layers.

    They train the network using an augmented PASCAL dataset and obtained promising results.

    Question: I have the same question as other regarding how the scoring and fusing is done and also what exactly means that the interpolation is learned (they say that first is bilinear and then the network itself learns it).

    -- Jorge Sendino


    ReplyDelete
  11. Summary from Jie:

    This is the first paper we read on the topic of semantic segmentation. To my understand, semantic segmentation requires "dense prediction", that is prediction per pixel on the image. This paper proposed a fully convolutional network, which is adapted from the well-known classifiers such as VGG and GoogLeNet, to do this task. They transform the fully-connected layers in those classifiers into fully convolution layers so that the network can output a spatial map. One more innovative thing is that the network resolve the "where" and "what" problem by learn to combine coarse, high layer information with fine, low layer information. This is illustrated by fig 3.

    Question: To be frank, I don't quite understand this paper. there are a lot of concepts that I've not heard before. It could be helpful if we can review and summarize the most general ideas, like what exactly does it mean by "pixels-to-pixels", "end-to-end", etc.

    ReplyDelete
  12. Hongyan Wang's summary:

    This paper shows a fully convolutional network trained end-to-end, pixels-to-pixels on semantic segmentation and it performs better than the state-of-the-art without further machinery. They claim their method is very efficient both asymptotically and absolutely. Their model transfers those successful networks in classification to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from theor learned representations. This paper also defines a novel "skip" architecture to combine deep, coarse, semantic information and shallow, fine, appearance information. This paper finds that shift-and-stitch is not very efficient, while in-network upsampling is fast and effective for learning dense prediction.I think the most important part of this paper is that it defines a new fully convolutional net for segmentation that combines layers of the feature hierarchy and refines the spational precision of the output.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. This paper is based on the intuition of any sizes input and end-to-end convolutional nets. It is initialized by the GoogleNet, VGG and AlexNet. The idea is to use end-to-end pixel by pixel on semantic segmentation. Learning and inference are carried out on "whole-image-at-a-time by dense feedforward computation and backpropagation". Basically, the last FC layers in usual ConvNets are changed to convolution layers to output a spatial map.

    I would like to know more about how this assertion of any-size input works. Because they change the last FC to FCN, how is that supposed to matter the input?

    ReplyDelete
  15. This paper puts forward a CNN architecture using only convolutional layers (and no fully connected layers) to perform image segmentation, which is akin to labeling every pixel in the input image to an object. They first test their concept by modifying well known classification networks like AlexNet and GoogLeNet, and replacing the fully connected layers in those networks with “convolutional” layers with a filter size equivalent to the input size. This causes the network to output a heat map correlating to the original image. The authors then “deconvolve” the output with filters with fractional strides to get a final output the same size as the original input image. The authors also introduce the idea of a skip architecture, which up-samples and makes predictions from denser layers than the final layer.

    Discussion:
    How does the shift-and-stitch concept work?
    Is there a reason why the authors decided to deconvolve from the down-sampled layer to the final output in one step? Would there be a benefit to deconvolving in multiple layer rather than one large step? (ie H/32xW/32 -> H/16xW/16 -> H/8xW/8 -> H/4xW/4 -> HxW instead of H/32xW/32 -> HxW)

    ReplyDelete
  16. In this paper, the authors present a new approach for converting classification networks into segmentation networks. They accomplish this by upsampling the reduced layers back to the dimensions of the original image. This upsampling is interpreted as a deconvolution: the inverse of the original convolution layers. They first demonstrate doing this with AlexNet, GoogLenet, and other classic networks. They use this analysis to drive their construction of an optimal network to do semantic segmentation. This differs from other work in semantic segmentation in that the entire task is handled by the pixel-to-pixel network. Their results show that it is more accurate and very fast to segment images in this way.

    Questions:

    I had a lot of trouble following the adaption of classifiers for dense prediction. Figure 2 doesn't make much sense to me - the only difference seems to be that the filters seem to be thicker. How does this work? I would also appreciate a comparison of dilation verse the upsampling they use.

    ReplyDelete
  17. This paper presents a novel machine learning algorithm for Semantic Segmentation. The algorithm uses a fully convolutional network that take in an arbitrarily sized input image, and produce a similarly sized output layer that details the semantic segmentation. Specifically, this algorithm makes a prediction for every pixel in and image, and uses the combination of pixel predictions in order to make higher order inferences about an image.
    The first part of the network stack uses typical ConvNet architecture. The second part of the architecture focuses on converting the the output to be a spatial representation of the input image.

    How did the author’s determine which scaling features to include? i.e. Shift-and-stitch, up sampling. Were there other scaling methods they tried that didn’t work as well?
    Instead of going from pixel-to-pixel, could it be better to go from pixel-to-2*pixels, or 0.5*pixels?

    Sam Woolf

    ReplyDelete
  18. This paper proposes a technique for semantic segmentation of images by converting classifier networks into a fully convolutional (that is, no fully connected layers,) which allows it to retain spatial information throughout the entire process ans output in the original resolution of the image. The authors use several interesting techniques, including upsampling intermediary layers, and using multiple shifted versions of an image to improve coarse feature maps.

    Questions: how does the deconvolution that upsampling the resolution work?

    ReplyDelete
  19. The researchers in the paper present how convolution neural networks alone can yield improvements in semantic segmentation. Existing recognition nets use fixed-size inputs with non-spatial outputs. Fully connected layers can be transformed into convolutional layers, resulting in fully convolutional networks with inputs of any size and spatial output maps. These spatial output maps end up being a good choice for semantic segmentation since there is ground truth available at every output cell. The researchers also present a "skip" layers that fuse layer outputs which help make use of shallower, more local features in an effort to combine "what" and "where".

    ReplyDelete
  20. Sam Burck:

    This paper covers the use of a fully convolutional network (FCN) for semantic segmentation. The FCN in this paper was trained end-to-end, pixel by pixel. The authors cast traditional CNN classifiers such as AlexNet, VGG nets, and GoogLeNet into FCN's, and re-purpose them for semantic segmentation by performing in network up-sampling and pixel-wise loss. FC layers are are changed to convolutions, with the classification layer exchanged for a set of 1x1x21 convolutions for pixel classification.
    The resulting network achieves state-of-the-art performance across a wide variety of different metrics.

    Discussion: How might unique features related to AlexNet, VGG nets, and GoogLeNet effect their semantic segmentation versions.

    ReplyDelete