Deep Learning for Computer Vision Spring 2017: Tues, March 9: You Only Look Once

Sunday, March 5, 2017

Tues, March 9: You Only Look Once

You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. arXiv 2016.

21 comments:

UnknownMarch 5, 2017 at 1:21 PM
Nathan Watts summary:
This paper proposes an object detection pipeline which only uses one step, making it much faster than previous methods, though it has less accuracy than state-of-the-art systems. It works by separating the generation of bounding boxes (with confidences) and a class probability map into two parallel streams which are then multiplied together. The boxes with the highest product of confidence and class probability for each class present are selected as the bounding box for that class. The network uses 1x1 convolutions to reduce dimensionality between convolutional layers, similarly to GoogLeNet. This method dramatically outperforms real-time methods, and the fast implementation if faster than any other method at the time. Additionally, creating an ensemble with Fast R-CNN, another very fast method, results in even better performance.
Training is done by pretraining a classifier, then adding additional convolution and fully connected layers. A leaky ReLU activation function is used, and a momentum-based optimizer. Additionally, dropout and data augmentation are used to prevent overfitting.

Questions:
Could recurrence in the final fully connected layers (LSTM for example) be used to improve accuracy for video or the webcam-based demo? Has anyone tried this? (answer, evidently, yes and yes: http://guanghan.info/projects/ROLO/)

How are class probabilities that correspond to a cell within a given bounding box factored into the loss function (ergo, how does that loss function work?) How does it treat cells that are only partially within a given bounding box?
ReplyDelete
Replies
Chris MattioliMarch 6, 2017 at 9:44 AM
This paper presents a method of image detection called "You Only Look Once". The premise is that the detections occur as single regression problem rather than going through some complicated pipeline. It considers the entire input image, and learns very generalizable properties about the image. This technique is also extremely fast in practice. Their network architecture was inspired by GoogLeNet and makes use of 24 conv layers with 1x1 layers to reduce the feature space. Their ultimate result is class scores and bounding box locations. For their optimization they adjusted their loss function from those we've seen in class. This was because their bounding box errors were being treated equally, yet they wanted to adjust these values according to the bounding box size. It's also worth noting that they used dropout and data augmentation techniques.
In terms of limitations, the YOLO struggles with detecting nearby objects within the same grid cell area. This is because of the number of bounding boxes. There is a trade off using this YOLO approach which becomes clear in the results section. The YOLO model sacrifices some localization prediction for a reduced background detection error. When compared to an R-CNN, the localization error is much worse, but the background detection error is much better. They tested the combination of both models to good effect, but the speed up of using YOLO is lost since the R-CNN is being run.
Question: I'm little confused about their theoretical approach. I don't understand how they calculate Pr(object) and Pr(class_i). Basically, equation 1 is a little confusing to me. I also have the same question as Nathan, "how does it treat cells that are only partially in a given bounding box"?
ReplyDelete
Replies
JonMarch 6, 2017 at 12:32 PM
This paper introduces a novel approach to object detection in images. Object detection is comprised of two tasks: creating bounding boxes around objects within an image and to classify the objects in these bounding boxes. Rather than solve this problem using a pipeline like traditional methods, the YOLO model combines both tasks into a single regression problem that is solvable by a single CNN.
The YOLO method starts by dividing an image into a grid of boxes. Each box is trained to output sets of bounding box predictions, object classification predictions, and confidence scores for each of its predictions. Predictions with only sufficiently high confidence in classification and bounding box are kept. The loss function is a combination of the classification error and the bounding box IoU.
In this paper, a model architecture was chosen that was inspired by Googlenet. The model was pre-trained for classification on the Imagenet 2010 dataset. To perform detection, additional convolution and fully connected layers were added and trained.
The YOLO method was compared against other object detection methods such as DPM models, R-CNN, fast R-CNN, faster R-CNN, Deep Multibox, and Overfeat. In general, Yolo has the fastest detection while maintaining respectable accuracy. The YOLO model is highly generizable, performing well predicting object detection in art from the model trained on ImageNet. Yolo has particularly low error from classifying the background, however, it has relatively high error rate with localization errors and predicted smaller objects in an image. An assemble model combining the YOLO model with an fast R-CNN model was built that yielded one of the highest performing detection methods to date.

Discussion: This method is highly relevant for video recognition tasks. Can the temporal relation between frames of a videos be used to further improve prediction?

-Jonathan Hohrath
ReplyDelete
Replies
JorgeMarch 6, 2017 at 5:30 PM
This comment has been removed by the author.
ReplyDelete
Replies
JorgeMarch 6, 2017 at 5:31 PM
This paper presents a novel system for object detection and classification. The authors propose an architecture that uses concepts used in existing detection systems such as region proposal, bounding box scoring and classification. However, they implemented everything in a single CNN that can be trained as a single model and that performs faster than every other system.

YOLO is basically a CNN with 24 convolutional layers and 2 fully connected layers. They first pretrained the network using ImageNet (reducing the resolution) for image classification and then again with ImageNet using full resolution. They use a loss function that is basically a sum of square errors for the location of the boxes and the size of them but adapted to this task. For example, they account for those boxes in which no object was found. The system divides every image in grids and each grid is responsible of proposing a fixed number of boxes, each one with its corresponding confidence.

Using this scheme, YOLO performs faster than every other detection model. Compared to R-CNN, it achieves a achieve fewer background errors but more localization errors. However, thanks to its speed, it can be used at the same time as R-CNN and correct a big part of the background errors without adding significant computation load.

Discussion: How do the make sure that each image is divided into a S x S grid and that each cell predicts exactly B bounding box. From the paper, it seems that the only thing they implemented in the network was the final layer, that outputs a SxSx(5*B + C) tensor. But doing that does not enforce all the constraints. Do they split the image before inputting it to the CNN?

-- Jorge Sendino
ReplyDelete
Replies
UnknownMarch 6, 2017 at 6:19 PM
This paper introduces a method of object detection that only requires seeing the entire image once, and therefore is significantly faster than other state-of-the-art networks. An image is split into grid cells, and each cell is responsible for predicting (in the paper’s case) two bounding boxes whose center lies in the cell, as well as class probabilities for objects within the cell. The network has 24 convolutional layers and 2 fully connected layers. Compared to other object detectors, YOLO has lower accuracy, but much higher prediction speed. The majority of its errors are localization errors, which may be because each grid can only predict a small number of bounding boxes. As a results, groups of small objects are not correctly detected. In comparison, the Fast R-CNN model has relatively less localizations but a higher background error. The authors of this paper combined these two networks, checking if bounding boxes predicted by Fast R-CNN overlap with those predicted by YOLO. This resulted in the highest seen accuracy in the then state-of-the-art, but at the cost of losing YOLO’s speed. YOLO can also be generalized to detect objects in non-photographic artwork. Its high test time speed also allows for real-time object detection.

Discussion:
How does a single CNN simultaneously predict two things (bounding boxes and class probabilities)?
Why didn’t they combine YOLO with Faster R-CNN, which has higher accuracy and speed than Fast R-CNN?
I would also like to go over the loss function in more detail.
ReplyDelete
Replies
UnknownMarch 6, 2017 at 6:34 PM
Xinmeng Li’s Summary:

This paper introduce a novel method, YOLO, to object detection. It define the object detection as a regression problem, as the features are the spatial of the detection boxes and the class probabilities of the item in the box. The method has three positive side. First, its process speed is very fast and have a great performance in real time. Second, the method tends to gain less false positives. Third, the method better on generalizing representations of objects comparing to DPM and R-CNN methods. The downside of the method is it has lower localization accuracy. Although YOLO have no better performance than Fast R-CNN on overall accuracy, YOLO have much less background error. The paper claim that by combining Fast R-CNN and YOLO the performance boost. YOLO have the limits based on its unified detection, for each unified grid can only predict two box and one class. Therefore YOLO failed on small object detection according to object which size smaller than the grid.

Discussion: How much does this method’s localization error rate higher than other methods? How related are the fames number to the real time performance? The paper bring the inspiration of using single neural network for real time vision task. Does it tells a trend that it is better to use single neural network and regression for future work?
ReplyDelete
Replies
UnknownMarch 6, 2017 at 7:06 PM
This paper presents a new approach to object detection, YOLO. It uses a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation, which has several benefits over traditional methods of object detection. It’s very fast since it frames detection as a regression problem and don’t need complex pipeline, and it reasons globally about the image when making predictions, so it makes less than half the number of background errors compared to Fast R-CNN, also it’s highly generalizable so less likely to break down when applied to new domains or unexpected inputs. About unified detection, the system first divides input image into S*S grid, each grid cell predicts B bounding boxes and confidence scores for those boxes, confidence scores defined as product of probability of object and IOU (intersection union between predicted box and ground truth) and each bounding box consists of 5 predictions. Also, each grid cell will predict C conditional class probabilities. At test time, class-specific confidence score can be obtained by product of conditional class probability and individual box confidence predictions. Architecture of this approach include 24 convolutional layers followed by 2 fully connected layers. During training period, the model use sum-squared error in the output since it’s easier to do optimization, also uses dropout and data augmentation to avoid overfitting. Even though it has some limitations as like treating errors same in small bounding boxes and large bounding boxes, struggles to generalize to objects in new or unusual aspect ratios or configurations, its performance is very good comparison to other detection systems and real-time systems.

Question: About loss function, it’s not good to treat localization error same with classification error, how can use two parameters to fix this problem? Increase loss from bounding box and decrease loss from confidence predictions.
ReplyDelete
Replies
UnknownMarch 6, 2017 at 7:58 PM
Sam Burck:

This paper introduces the YOLO (You Only Live Once) network, which aims to perform detection and classification using a single CNN. It uses 24 conv. layers followed by 2 fully connected layers, and it uses a loss function created from summed l2 norms taken with respect to both the presence of classes and the presence of bounding boxes (This is a gross oversimplification). The network splits the image up into a grid of cells, and calculates a confidence score for each potential class and bounding box occurring in each grid cell. This strategy makes the process of classification and detection very simple and efficient, but it gives up some accuracy in the process. While this system can process 45 frames per second on a single (powerful) GPU, it has some built in constraints, such as only being able to predict a single class and 2 bounding boxes per grid cell. Regardless of these constraints, it's an interesting approach that should be further developed due to it's simplicity and performance.

Discussion:

Is it worth trading speed for accuracy? In an application like a self driving car, will the YOLO system's ability to update it's prediction 45 times a second offer an advantage over more accurate systems with lower refresh rates? Should both be used in conjunction?
ReplyDelete
Replies
AnonymousMarch 6, 2017 at 8:05 PM
Jason Krone's Summary:

This paper discusses YOLO, an end-to-end system, which uses a CNN to perform both detection and classification. YOLO is an extremely fast system and processes images at 45 frames per second while still maintaining a very high mAP (mean average precision) score. The YOLO CNN architecture consists of 24 convolutional layers and 2 fully connected layers. Generally, the convolutional layers are followed by either a max pooling layer or a 1x1 convolutional layer; however, the last 4 convolutional layers are stacked without any pooling or 1x1 convolutions. The system can be thought of as dividing the input image into a SxS grid of cells. For each grid cell containing the center of an object, that grid cell must predict B bounding boxes and a confidence score (P(object) * IOU) for each box. In addition, the cell must predict C conditional class probabilities P(c_i | object). YOLO’s main flaw is that it makes a large number of localization errors i.e. it correctly predicts the class but produces a bounding box with .1 < IOU < .5. However, YOLO does succeed in dramatically reducing the number of background errors compared to the Fast R-CNN.

Questions:
Why did they normalize bounding box coordinates?
Does the choice of S = 7 grid size impact the neural network architecture anywhere other than the input and output layer?
What is the ground truth output for a grid cell that does not contain the center of an object?
ReplyDelete
Replies
UnknownMarch 6, 2017 at 8:08 PM
In this paper, the authors present an architecture for fast object detection, entitled You Only Look Once, or YOLO for short. It borrows from previous object detection architectures including Fast-R-CNN and Deformable Parts Models. These methods first try to find potential bounding boxes that may contain objects and then use vision algorithms to determine class probabilities of those bounding boxes.

YOLO addresses two problems in these types of models.

1. Since these models are broken up into several stages (first bounding box candidate detection, then image recognition), they are slow to run at test time and thus cannot be used in real-time image detection.

2. Because the bounding boxes are trained independently, they can sometimes lose some global relationships between distant locations inside the image.

YOLO gets around this by combining the bounding box detection and the image classification into a single neural network. They split the image up into SxS cells in a grid, and for each cell, they find 2 bounding boxes and class probabilities of the class being in either of the two boxes. Each bounding box is parameterized by 5 values - x,y,w,h, and confidence that it is really an object bounding box. Their loss function balances all of these parameters to try to force the bounding boxes and class probabilities in all grid cells to be interdependent. Interestingly enough, these features exist only in the last layer of the network. The first 20-24 layers are simply convolutions that reduce the dimensionality of the network.

This network has comparable accuracy to many current networks, although it is generally below their performance. However, it is one of the only networks that can be sped up to above 30 frames per second, and it is twice as accurate as the next network that is that fast. In addition, they show that it picks up different qualities of the image than Fast R-CNN, so it can be used in ensemble with Fast R-CNN to create a very fast and accurate classifier.

Questions: In the paper, it is only determining bounding boxes in each grid cell. How does it end up offering object bounding boxes that are in more than one cell? They show examples of it doing this, but I don't see a determination in the paper.

In the loss function, they only penalize classifications where an object is in the given cell, and bounding boxes that are "responsible" for ground truth. It seems to me like you should penalize proportional to the probability distribution, without that ONE function that only turns on if the cell predicts (is that called the dirac delta function?). It seems to me like if you penalize proportionally in the loss function, you'd train faster. If this sacrifices sparsity, then maybe you could have an extra parameter that determines how much you penalize.

-Dylan Cashman
ReplyDelete
Replies
UnknownMarch 6, 2017 at 8:18 PM
The paper introduces a method for object detection which uses a single step instead of a pipeline of predictions. YOLO splits images up into a grid, then makes a fixed number of bounding box predictions and class probability predictions corresponding to the bounding boxes. By treating object prediction as regression and using a single step on the entire image, each prediction takes advantage of global context, and the process is significantly faster than state of the art methods. Additionally, the network does very well at generalizing between domains. The paper shows that YOLO beats the state of the art at object detection on artwork after being trained on natural images. One weakness of YOLO is that it does poorly at localizing small objects. The paper explains that this could be because the loss function doesn't properly reflect how much localization error affects IOU for small objects. YOLO's speed and efficacy make it a great candidate for tasks that require real-time object detection, such as driving cars.

Has there been an effort to retrain YOLO with a loss function that more accurately reflects IOU in localization error?

The paper describes training at a low learning rate, then at a high learning rate, then a low learning rate again. So far we've mostly seen examples of learning rate decay as opposed to growth. Is growing the learning rate initially common practice? Has this been studied in other contexts?

- Jay DeStories
ReplyDelete
Replies
Sambit PradhanMarch 6, 2017 at 8:34 PM
-- Sambit Pradhan --
The paper presents an ingenious method to detect object and classify them in images with the “You only look once” process. The core architecture of the process is the implementation of classification, region proposal and bounding box algorithms integrated into a single trainable CNN. The CNN is trained by pre-training a classifier, later inserting convolution and fully connected layers. The ReLU activation function together with momentum-based optimization is used. Finally, dropout and data augmentation are used to address overfitting. The method has several advantages compared to other methods. Since the method requires analyzing the source image only once it has a faster performance. The network performs the analysis by splitting the image into a grid of cells, calculating a confidence score for each probable class for each bounding box occurring in the grid cell. The network has 24 convolutional layers and 2 fully connected layers. For training the network the model uses sum-squared error in the output. The main drawback of the method is that it has a lower accuracy with detecting objects that are near to the imaging plane within the same grid cell area. The paper describes comparison of YOLO with existing networks. The YOLO model performs poorly with some localization prediction for a reduced background detection error. When compared to an R-CNN, the localization error is worse however the background detection error is better than a R-NN. Over all the network and the method are capable of handling higher frame rates as compared to existing methods and have great potential to be used in Real-Time computer vision pipelines.
The main disadvantage of the YOLO network is low accuracy under certain conditions – can the accuracy be improved by use of or combining existing models that would shore up the accuracy but affect the overall performance only marginally ?
ReplyDelete
Replies
ATongMarch 6, 2017 at 8:54 PM
The Yolo network is useful for real time object detection. It tackles the problem by designing one network that does localization and classification instead of dividing the task into multiple components. I'm very interested in this philosophy. Often in computer science, we choose to divide a problem into as many small subproblems as are feasible and tackle each individually. In this way we are able to build up with modular blocks better and better systems. The one monolithic network seems to go against this philosophy.

In the future, will we see more large networks that can tackle whole problems? Or will we see more specialized and modular network design. Of the three reasons that the authors provide concerning the benefits of YOLO, the only one I see as inherently a property of one larger network is the first, that YOLO is fast. Even this property seems like it could be overcome with sufficiently specialized networks.

YOLO uses prior that objects should be distributed throughout the image by using a grid pattern on object centers. Would it make more sense to have say a Gaussian distribution or other, putting essentially more likelihood of objects in the center of the image?
ReplyDelete
Replies
UnknownMarch 7, 2017 at 9:11 AM
The Yolo architecture is designed based on improving the testing time rather than accuracy. It can achieve real-time object recognition at 30 fps.
The key to this success is training the boxes for detection and classifier of objects at the same time, in contrary to the other algorithms where they acquire one classifier for each object and then scanning each object classifier on the image which needless to say takes a long time.
The YOLO divides the picture into S by S grid and then associates each grid with a conditional probability of for each P(obj/class) and IOU which represents confidence for each box and any ground truth box.
The architecture of the network consists of 24 convolutional layers followed by 2 fully connected layers.
They use momentum SGD with small decaying rate and a variant learning rate for 135 epochs, due to unstable gradients. They use drop-out for concurring overfitting. They use pre training on PASCAL VOC 2007 as well.

Question: The author asserts that they use randomized exposure and saturation of the image by up to a factor of 1.5 in the HSV color space. What does this mean and how could this help?

ReplyDelete
Replies
UnknownMarch 7, 2017 at 9:20 AM
Hongyan Wang's summary:

Traditional object detection methods take a classifier for the object and evaluate it at various locations, but these complex pipelines are slow and hard to optimize. YOLO reframes object detection as a single regresion problem, straight from image pixels to bounding box coordinates and class probabilities.

YOLO is simple, and extremely fast. It achieves more than twice the mean average precision of other real-time systems. Besides, YOLO reasons globally about the image when making predictions. Third, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin on artworks. The main disadvantage of YOLO is that it still lags behind state-of-the-art detection systems in accuracy. In particular, it performs badly on small objects.

The key contribution of YOLO is unified detection. YOLO divides the input image into an S X S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. YOLO uses these bounding boxes for regression and these confidence scores for classification.

YOLO has some limitations: YOLO struggles with small objects that appear in groups; YOLO struggles to generalize to objects in new or unusual aspect ratios or configurations.

Compared to other real-time systems, YOLO is much faster and has better performance on different datasets.
ReplyDelete
Replies
UnknownMarch 7, 2017 at 2:05 PM
Summary:

YOLO is a region proposal network in which an image is divided into SxS cells in which each cell proposes B bounding boxes for some class C.

YOLO, then calculates the probability of a bounded region containing an object of class C. If the probability exceeds some threshold, the bounding box for that cell is then outputted.

The network is trained on a loss function that takes into account false positives, correct class selection, bounding box centers, and bounding box dimensions.

Discussion:

The parameters, B and S are 2 and 7 respectively and the network proposes 98 bounding boxes prior to probabilistic thresholding. How did the authors come to this number? Was it something to do with the data set? or was it something to do with the amount of computational power they had access to? The speed of the network is closely related to the computational power one has access to, with this in mind how can we abstract away some architecture and parameter choices to implement new networks like YOLO on faster newer hardware?

Why choose B to be 2? Can we not have each cell propose one or few bounding boxes for more than one class?

The author mentions that bounding box proposals are restricted to some "spatial" constraints. What are these?!
ReplyDelete
Replies
UnknownMarch 7, 2017 at 3:38 PM
YOLO is an object detection network whose main claim to fame is speed. The acronym stands for You Only Look Once, referencing the fact that, unlike other object detection networks, it only has to do one pass through the network to make predictions, even though many objects might be present. The network transforms the input picture into a grid of cells, asks each to predict the likelihood of an object, the intersection over union with that object (centered relative to the cell, height and width relative to the image), and the likely classifications of objects in that cell. It uses pre-trained convolutional layers to detects features in an image and uses fully connected layers at the top of the network to form its predictions. Its accuracy is a bit worse than the state of the art, but the power of this network is that it's able to do object detection essentially in real time.

The paper says YOLO is better at detection of people in artwork than other people, but what about this model makes it better at working with unnatural images? Also, wouldn't slicing the image into a grid make it worse at detection of large objects as well? I'm confused about how using the grids doesn't severely influence the feature space. Is the grid of cells idea used at the bottom layer near the image, or at the top layer near the classification?

Ben Papp
ReplyDelete
Replies
UnknownMarch 7, 2017 at 3:58 PM
Amit Patel

The researchers presented a new extremely fast approach to object detection that uses a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation. One key characteristic in YOLO is that it reasons globally
about an image since it sees the entire image during training. With all these advantages, one trade-off is that YOLO performs worse in terms of accuracy compared to state-of-the-art detection systems.

Question: In the limitations of YOLO, it is mentioned due to the spatial constraints that the model has issues with small objects in groups. Is there a way to make the model perform well with small objects in groups?
ReplyDelete
Replies
UnknownMarch 7, 2017 at 4:07 PM
Justin Lee's Summary:
YOLO is an optimized real time detection CNN. Traditionally, object detection is simply vanilla classifiers ran on different windows of the image. YOLO combines these two problems into a single regression problem.

This is achieved by splitting the image to a S*S grid. Each grid box will also have B predicted bounding box. Each bounding box contains tuples of x,y,width,height and a confidence score (IOU) that the box contains an object. Finally each box will also have the predicted class scores. The prediction is therefore a high dimensional S*S*(B*5 + C) tenor.

Training is transferred from a pretrained model on image-net. A custom loss function is used to minimize loss from bounding box with not objects so that irrelevant background error wont overpower the learning process. A prediction layer is responsible in choosing/learning to pick the the highest IOU box for one
object, which might be predicted multiple times from different grids and bounding boxes.

Question:
In the paper, the authors mentioned that the bounding box proposal struggles with small objects. This is possible be due to the discrete S*S partition and the limited times the same small object is evaluated across different grid points.
Crazy idea but is it possible to learn the way in which the image is partitioned, so that the network can adaptively change its resolution in convolution?
ReplyDelete
Replies
UnknownMarch 10, 2017 at 5:48 AM
The goal of the network presented in the YOLO Paper is make object detection occur much more quickly. In order to do this, the network approaches object detection as a regression problem focusing on a slew of potential bounding boxes and corresponding class probabilities. This is contrasting to previous detection systems that repurposed classifiers to preform detection, for example, sliding windows approaches.
The paper says it well, “A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance.”. This simple implementation is highly successful, though, its invention seems to have created a categorical shift in how this problem was considered.

Questions
Is the network’s bounding box accuracy limited by the initial grid of boxes it looks at?
If, for example, two dogs are mostly overlapped, will YOLO be able to give bounding boxes for each separate dog, even though each cell scores high for ‘dog’?

Sam Woolf

ReplyDelete
Replies

Add comment