Wednesday, March 8, 2017

Thurs, March 9: Single Shot Multi-box Detector

SSD: Single shot multibox detector. Liu, Wei, et al. ECCV 2016.

19 comments:

  1. (Chris Mattioli) Up to this point, the approach has been to propose bounding boxes, sample features/pixels from each box, and put this information through a classifer. This paper discusses an approach that greatly simplifies the "pipeline", and proves that the said approach is also quite accurate. To be a little more specific, this approach is called the "Single Shot Detector" (SSD) because it is able to do object detection end-to-end using one network. The authors bring out two important features in their design, the first is they use detections from multiple convolutional layers at different scales. The second is that they make predictions from each of these layers using a fixed set of default bounding boxes. The latter approach is, heretofore, quite novel and unique. To be a little more specific, their input requires ground truth boxes which they align some number of their default boxes to match. (This alignment is done by calculating Jaccard Overlap and applying a threshold). The scales of these default boxes are decided by equation 4, but in essence the scale of one of these default boxes is a function of the smallest, largest scale and the number of feature maps. They also mentioned that they used the detections of their smallest and largest feature maps rather than the whole set. They remove some of their negative training examples for faster optimization. Data augmentation was of particular importance to them and it proved to yield better results when to compared to not having augmentation. For this they used their original images, took random samples of their images, and sampled images based on Jaccard Score threshold. (They would also do a horizontal flip with some probabality). Ultimately, their results showed that they were competitive against state of the art in both accuracy and speed.
    Question: Why not use all the feature map detections? There was such an emphasis on that in the beginning of the paper, then in one sentence they stealthily say that they only use the largest and smallest one. I'm also a little confused about the "non-maximum suppression" step is all about. Perhaps I know it by another name.

    ReplyDelete
    Replies
    1. NMS is an important thing to know about. It's a strategy for picking the max scoring box in a set of boxes that overlap, i.e. have IoU over a threshold. How it works: for each detected bbox, is there a bbox with IoU > 0.5 (for ex) that has a higher score? if so then 'suppress' this box (remove it from the list of detected bboxes). At the end of the for loop, only a small number of bboxes will survive.

      Delete
    2. Thanks for the reply Genevieve. Makes more sense now!

      Delete
  2. Jie's summary:

    This paper introduced SSD: Single Shot Multibox Detector, which is another detecting model that outperform its predecessors in both speed and accuracy. Like YOLO, SSD abandoned the region proposal idea that was prevailed and adopted by the R-CNN series of models. Instead, it simplifies further than YOLO by only detecting on a set of default bounding boxes. Because the region proposal was very slow and the bottleneck of the region based methods, totally abandoning it would significantly accelerate the model. The default bounding boxes vary in location and aspect ratio so that they could capture the objects in the picture to the largest extent. Another innovative idea of this paper is that they use multi-scale feature maps for detection. Those bounding boxes are per each feature map, so that they also changes in sizes. This would help with detecting large objects v.s. small object difficulty. During training the loss function is a weighted sum of the localization loss and confidence loss, this is similar to YOLO. Their base network is VGG which is also same as YOLO. They state that data augmentation is crucial to the training, and finally they show their model is very competitive in perform in both speed and accuracy in various widely used datasets.

    Question: Don't quite understand how exactly they used multiple scale feature maps, would like an explanation more in detail.

    ReplyDelete
  3. (Takuto)

    SSD is similar to YOLO. But they train the model a little differently. For one, multiple boxes may be proposed as long as they overlap fairly well with the truth. In YOLO you were only allowed one box. I think the authors of SSD designed the loss function more carefully compared to YOLO. They use the softmax output layer for the classification probabilities, and the multi-class cross entropy loss (i.e. negative of the log of the prob of the correct class), which I think is more theoretically pleasing than the regression approach (i.e. softmax won't predict probabilities that fall outside of [0,1]).

    In terms of architecture, they added more convolutional layers after the feature maps from the pre-trained, classification conv net (VGG). These additional layers, they claim, mimic viewing the image at multiple scales and sizes. They also feed the lower (more left) conv layers to the final output because these layers purportedly "capture more fine details of the input objects," which makes sense.

    I lost them when they talked about the scale of the default boxes (around equation (4)). What are they talking about?

    ReplyDelete
  4. This paper introduce a image detection network, SSD. SSD has a single neural network which generate scores for each bounding boxes with default aspect ratio and scales. Much like the YOLO, SSD avoid to fall in computational demanding by giving bounding box for the potential region which devotes to save time. And by implement it to multiple layer SSD improve the accuracy compared to YOLO. SSD has three beneficials. First the structure of single neural network makes it easy to train and to integrate with other networks. Besides, it can adapt with different resolution to detect object in various sizes. Third, the accuracy and speed of the method are competitive with other method, especially on small object. However, it has the downside of having too many negative examples, the author adjust it by selecting the highest confidence loss. The method can predict multiple overlapping default box rather than the maximum overlap which contributes to the matching accuracy.

    Discussion: How is the adjustment of the box realized to fit the object shape? How is the prediction of multiple features map to different resolution? While the negative examples are sufficient, would picking only the highest confidence loss for each default box bias?

    ReplyDelete
  5. The authors introduce the image detection network SSD. At time of publication, state of the art object detection networks hypothesized bounding boxes and then applied classifier to the pixels from each box. This is accurate, but slow. Previous attempts to increase speed have significantly decreased accuracy. Their improvement comes from eliminating the bounding box and subsequent resampling, and doing it well (their network is somewhat similar to YOLO). Another speed boost comes from the fact that their network still works well on lower resolution images (so faster). They claim that SSD is better than YOLO - more accurate and faster (at least at the time). Their architecture produces a fixed size collection of bounding boxes along with scores for the presence of objects. This allows for relatively straightforward training. Along with adding convolution layers, one innovation is the use of multiple layers for prediction at different scales.

    Question: I did not really follow the hard negative mining. How is the ratio related to the confidence levels?

    -Cole

    ReplyDelete
  6. This paper introduces a new model for doing object detection: SSD. All detection models up to now consist of three steps: prediction and scoring of bounding boxes, classification of the object that is in each bbox and then removing the boxes that are not relevant via NMS. However, SSD integrates everything in a single pipeline. Instead of first predicting bboxes and then predicting, it "collects" different bboxes from different layers of the network (each box with a different scale depending on the layer) along with the confidence of each box and the class distribution for the object in that box. These features are produced by convolutional filters applied on different feature maps. This approach allows the model to have a diverse set of predictions.

    To train the model, the authors used a loss function that combines a regression part (L1 smooth norm for the location of the boxes) and a classification part (softmax function for the class distribution). Random patches, horizontal reflections and distortions are used for data augmentation.

    With respect to the results, the obtain a much higher performance than current state-of-the-art detection models like Faster-RCNN or YOLO, both in terms of speed and mAP.

    Question: The different bounding boxes are computed from different feature maps (layers). Following the paper, it seems that those predictions are made by "small kernels" applied to every extra feature layer. If that is true, how do they make sure that the output of those added convolutions represents effectively a box? (I guess it is explained by Eq. 4, but I do not really get to understand it)

    -- Jorge Sendino


    ReplyDelete
  7. This paper presents a method for detecting objects using a single deep neural network which performs much better than Faster R-CNN for 512*512 input, has much better accuracy even with smaller input image size, has fundamental speed improvement since it eliminates bounding box proposal and subsequent pixel or feature resampling stage . SSD approach is first to produce a fixed-size set of bounding boxes and scores for the presence of object class instances in those boxes then to produce the final detections, compared to YOLO, SSD model adds several feature layers to the end of feed forward network, which can predict the offsets to default boxes with different scales , aspect ratios and their associated confidences, for every default box, this method can apply it to several feature maps of different resolutions, this way can discretize the space of possible output box shapes, as author mentioned, the key contribution of this method is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the end of network.

    Question: About a set of default boxes, how does the SSD framework work for different feature maps, with different scales and aspect ratios?

    ReplyDelete
  8. The paper presents a fast method of object detection. SSD, like YOLO, makes all object predictions in a single pass of the network, making it much faster than other object detection networks. The main architectural differences between YOLO and SSD are that SSD uses convolutions to generate predictions instead of fully connected layers, it generates predictions at multiple layers with different granularity in order to improve detection at different scales, and it selects bounding boxes from a set of default boxes instead of using regression. SSD performs significantly better than YOLO in both accuracy and speed. It compares well with the state of the art for accuracy.

    In the paper they train using all default boxes which have at least .5 jaccard overlap with the ground truth, have any papers that don't use default boxes explored adding redundancy / noise to the ground truth boxes to get the same effect?

    -Jay DeStories

    ReplyDelete
  9. Single Shot MultiBox Detector is a network architecture used for identifying objects and their bounding boxes within an image. It has state of the art accuracy and performs in real time (greater than 30 frames per second). It is very similar to the paper we just read, YOLO. It differs in the initial representation of the problem, and also in its network architecture, which avoids having a slow fully-connected layer.

    YOLO chose to represent the core problem as a regression problem, regressing on the location of the bounding boxes. SSD does the same thing, but initializes its proposal bounding boxes differently. While YOLO divided up the image into a grid and placed two boxes in each grid cell, SSD generates boxes of different aspect ratios at different scales in the image. They do this to capture objects of different sizes and orientations, which was an issue with YOLO. As with YOLO, their loss function is a function of the distance of their bounding boxes as well as their classifications.

    To eliminate the need for a fully connected layer, SSD instead uses a sequence of reducing convolutions before its softmax layer. Each of these convolutions feeds not only into the next convolution, but also directly into the softmax layer. In effect, this generates features at different scales in our image space. It seems like this allows the network to capture global features in the very reduced features but also more fine-grained qualities in the earlier convolution layers.

    Question: Normally, a fully connected layer is used to apply a nonlinearity function of all the variables before feeding to a softmax so the network can bend and stretch around the solution space. Here, they seem to approximate that bending and stretching by using an ensemble of different scales of convolution layers. Is this applicable to other networks? Can we simply stop using fully connected layers, since they are so slow? Or is this special to image detection, since it is such a locality-based problem.

    -Dylan Cashman

    ReplyDelete
  10. Summary by Jason Krone

    The paper discusses the single shot detector (SSD), which is an end to end image detection network. SSD is significant because it is both extremely fast and achieves very high performance as measured using the mAP metric. It notably outperforms YOLO in both the fps and mAP score. The network architecture uses a portion of the VGG net as the base and then adds a number of convolutional layers that decrease in size on to the end of the VGG net. In addition, they apply filters to each feature layer to produce a class score or a bounding box offset for a default bounding box. They use a loss function with two components; 1. a confidence loss (softmax loss) and 2. a localization loss (L1 loss). In addition, to their novel architecture and well thought out loss function, the author’s use of data augmentation as well as default boxes with different aspect rations at each location enabled them to create a highly successful image detection pipeline.

    Questions:
    It would be great to review the localization component of the loss in more detail.

    Is it common to make predictions of “detections at multiple scales”? If so, where else is this used?

    ReplyDelete
  11. SSD extends and refines core ideas in YOLO to present a faster and more accurate region proposal network.

    Instead of having one layer of cells propose bounding boxes via regression. SSD fixes the number and shapes of boxes in a cell and allows multiple layers with different dimensions at different depths suggest bounding boxes. Fixing these bounding box ratios and scales eliminates the computational cost associated with the bounding box regression problem that YOLO solves in its ROI proposal.

    Importantly, SSD makes it so that a box in each cell predicts class confidences independent from other default boxes in the cell.

    One interesting, mathematical, refinement that is made in the paper is that SSD uses a more sensible loss function for confidence loss. SSD uses a multi-class softmax loss for confidence predictions.

    Discussion:

    Would YOLO perform better with softmax loss? Can we adapt Yolo to also calculate confidence per box per cell instead of just per cell?

    Have we regressed back into the world of handcrafting and where we instead handcraft hyperparameters? Can we learn what default boxes work best?

    ReplyDelete
  12. SSD is a new algorithm for objects detection. In terms of mAP it overcomes the previous algorithms such as YOLO and R-CNN and can provide detection for 59 FPS at mAP of 74.3%. The main reason of the speed improvement comes from getting rid of bounding boxes which required resampling stages. However, it is not eh first algorithm to use a structure without using bounding boxes. SSD has a few slight yet important modifications which helped it to become much more precise than previous algorithms with the same structure.
    The improvements include using a small Conv Filter to find offsets in bounding box locations. In training section, the loss function is well designed to capture both localization and confidence error.

    Q: I would like to talk more about the sizes of the different layers of the pipeline and also deriving loss function

    - Hossein

    ReplyDelete
  13. This paper presents a method of object detection that outperforms other models in both speed and accuracy. SSD uses the VGG classification network as a base network, and add convolutional layers to the end of the base to predict bounding boxes of different scales and aspect ratios. Each cell in these extra convolutional layers can predict bounding boxes with confidence scores for each of the classes, resulting in 8732 bounding boxes per class per image, compared to YOLO’s 98. The loss is computed as a weighted sum between localization (using smoothed L1 distance) and confidence (using Softmax).

    Discussion:
    In the abstract, the authors write “At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.” What are these adjustments?

    ReplyDelete
  14. This is another paper that tackles the object detection task. The approach, dubbed ‘SSD’, is similar to YOLO in that it eliminates the bounding box proposal generation step by discretizing the bounding boxes proposed during detection. The novel part of this paper is that they added convolutional filters to the later stages of the network to predict categories and bounding box offsets for different aspect ratios. These minor improvements greatly improved the detection accuracy (mAP) of the model while maintaining the fast speed of YOLO. As a recap on this type of object detection, the network divices the image into an SxS grid, with each cell able to predict B bounding boxes, object classification, and confidence scores. The model tunes these parameters during training, with cells that contain objects to have high confidence that an object is there, and to predict what the object class is.
    The model architecture includes a base network, similar to image classification models. Upon this network, progressively smaller convolutional layers are added that assist with multi-scale classification.
    The limitations of this model is that it does not predict small objects as well as larger ones. To address this, part of their data augmentation included cropping, which “zoomed in” on smaller images during training.
    Discussion: I was little bit confused as to how they designed the smaller layers to predict smaller objects and the larger layers to predict larger objects. Was this a result of the default box sizes they initialized before training the network?
    -Jonathan Hohrath

    ReplyDelete
  15. Amit Patel


    The researchers in the paper present a network-based object detector called SSD (single-shot detector) that does not resample pixels that other networks typically do while still maintaining accuracy. Because there is no resampling, there is a large improvement in speed even faster than YOLO. In addition, the network is significantly more accurate. The architecture of the model is based off the VGG-16 network and adds auxiliary structure. One of the core improvements made by SSD is predicting object categories and offsets in bounding box locations.

    Question:
    What was the researchers' reasoning for choosing VGG16 as the base network and what are some examples of other base networks that could improve performance?

    ReplyDelete
  16. Single-Shot Detection is an object detection network which, like YOLO, only does one network pass to detect the location and classify objects in the image. However, it manages to be faster than YOLO, and far more accurate, even more accurate than some slower networks, such as the R-CNN's, in some experiments. The secret to SSD's speed is that it does detection on feature maps at several different levels of the network, capturing details at many different scales. Also, unlike other detectors we've studied, SSD uses "default" bounding boxes, with variable aspect ratios, at each level to discretize the bounding box options, improving speed. The "base network" of the model's architecture is a pre-trained VGG-16 style network, which is truncated with the feature-map architecture described above. They also did a decent amount of data augmentation, using a mixture of original images, random sampling, and fixed sampling, in addition to horizontal flipping and some photo-metric distortion.

    Despite using the discrete default boxes to speed up the predicion process, I'm confused about what about SSD makes it a speed improvement on YOLO. It is after all considering "an order of magnitude" more detection boxes; shouldn't the feature maps coming from all the levels of the model only make the network more complex?

    Ben Papp

    ReplyDelete
  17. The SSD network is a new state of the art image detection pipeline, that uses several simple methods to greatly improve on both accuracy and speed. The network utilizes a set of predetermined potential bounding boxes of various sizes and aspect ratios. Then using a single deep neural network, the network generates a score for each category for each box. At train time, the network uses the difference between these values and the ground truth to back propagate and update weights. At test time, the network averages these scores to determine the best guess bounding box and object classification.
    The architecture here is interesting, exhibiting a parallel nature. There is a typical set of convolution layer. Though, each sublayer is additionally passed to a Detection layer. This allows the detection layer to be influenced by both the fully convolved high-order feature layers as well as the initial low-order features.
    Using these methods, the network significantly increases its classification speed, allowing calculations to be done in real time on video.

    What constitutes the difference between a correctly classified image and an incorrectly classified image? How does this compare to human testing?
    Interesting use of data augmentation. how would the other comparable networks compare with similar data augmentation

    Sam Woolf

    ReplyDelete