Deep Learning for Computer Vision Spring 2017: Tues. March 28: Understanding Deep Image Representations by Inverting Them

Saturday, March 25, 2017

Tues. March 28: Understanding Deep Image Representations by Inverting Them

Understanding Deep Image Representations by Inverting Them. Aravindh Mahendran, Andrea Vedaldi. CVPR 2015.

21 comments:

Chris MattioliMarch 26, 2017 at 9:25 AM
"Understanding Deep Image Representations by Inverting Them" is particularly concerned with the answer to the question: how accurately can an image be recreated from its representation. By representation here, they consider several image representations such as HOG, etc. Also, and probably most relevant to this class, is they want to convert CNN representations of images, and they want their inversion (going from representation to image), to be as general as possible. By embarking on this study and further research, the hope is to gain some better insight into the properties of these representations especially those that are learned in deep CNNs. Their inversion technique is the minimization over the image space of an L2 loss between the target representation and the image representation plus a regularization term. Here the regularization term captures a "natural image prior", which is a different function over image space (WxHxC). They considered two such functions, the first being the "alpha norm" which is simply the vectorized (stretched into a H*W*Cx1 vector) mean subtracted image raised to a parameter alpha. The second being the "total variation" which is the the integral of the sum of squared partial derivatives of the image raised to the 1/2 (or beta/2). Since images are more discrete quantities, the integral is replaced by a sum, and the derivatives are replaced by differences. From this they derive their final objective that gets minimized. Here they make some caveats about the constant factors of the regularization and the normalization of the loss. But it's worth mentioning that their final objective includes both regularizers discussed above. I found it interesting that they scaled their input image by the average norm of the images in the training set to compensate for image intensity bias in the first few layers of CNNs. Now equipped with their objective they apply their optimization, via SGD+momentum, to each layer of a CNN (Caffe-Alex model). For the convolutional layers, their inverted representations were quite close to the original image (minus fuzziness), but more importantly the inversions could all be easily identified as the original image. However, for the FC layers, their inversions did not yield images close the original but rather images containing some vaguely identifiable regions. They continued this experiment on patches of images, to similar results. Additionally, they experimented on sets of layers to similar success. Their general conclusion is that, as one moves forward in the network, the more and more abstract aspects of the images are learned.
Question: in their total variation regularizer, would it not be prudent to consider entire images, ie, HxWxC rather than just slices of HxW images? Or is the intuition that
the nature of RGB breaks this "piece-wise constant patches" (page 3) property they're going for?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 1:37 PM
Summary from Jie:

This paper did experiments for understanding what "features" really are by basically inverting and visualizing them. The authors considered both the traditional widely used HOG and SIFT features, as well as CNNs, that they called them "shallow" and "deep" representations respectively. They pointed out that the HOG and SIFT can actually be implemented as CNN. One key contribution of this paper is that the authors introduced a generic way of inverting any representations. The idea is very simple, just look for the image that minimizes the L2 distance between its own representation and the desired representation. Then gradient descent (with momentum) is applied as optimization scheme. They also added a regularizer and experimented over "image prior" for the inversion. The result shows that the first few layers of the CNN preserves the image too a great extent, as the layer goes deeper, the inverted image becomes less recognizable because more abstract features are learned.

Question: What are the "abstract features" learned in the deeper layers? Is visualizing even a good way to understand it?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 5:08 PM
Nathan Watts summary

This paper presents a method of visualizing image representations by inverting them. This is done by optimizing an equation which finds the image which has a representation best matches the representation of the original image created by the model (using L2 loss). The quality is improved using several optimization and regularization techniques which recover some of the data lost by the representation. They also demonstrate how to implement more traditional computer vision models in a CNN framework so they can be differentiated. Additionally, by visualizing different layers in deep models, they show that deeper layers in deep convolutional models have a higher degree of invariance.

Questions: I don’t understand the notation of the normalized loss/objective function. What does this mean?
ReplyDelete
Replies
ATongMarch 27, 2017 at 5:36 PM
This paper attempts to invert commonly used image representations. Older representations such as HOG and SIFT are much better understood, whereas newer representations that perform better (CNNs) are much less understood. The authors hope to gain a better understanding of image representations by CNNs not by visualization, but by optimizing an inverting function. Their main contribution to this pipeline is the use of an image prior, which they claim helps "recover low level image statistics removed by the representation", which I take to mean it puts a prior on image translation and lighting as a good image representation should be fairly invariant to these statistics.

Question: Inversions seem interesting qualitatively, but it seems hard to quantitatively compare image representation inverters. Does it even make sense to talk about a "better" image inverter when a good representation loses so much data? Comparing a shallow representation to a deep one seems difficult as in theory so much more image data can be contained within a deep representation based on number of parameters. Could it be the fault of the inverter that deep representations produce much less accurate reconstructions?

Alex Tong
ReplyDelete
Replies
UnknownMarch 27, 2017 at 5:43 PM
Sam Burck:

This paper introduces a powerful method for image re-construction. The method is completely independent of the image processing algorithm itself, and thus can be used with a variety of image processing techniques such as HOG, DSIFT, and CNN's. As this reconstruction technique can be applied to various image processing algorithms, comparison of reconstructed images associated with different algorithms can be informative as to how the different image processing algorithms work.

The method itself involves minimizing the (regularized) loss between the image representation, and a representation of a natural image prior, which is initialized as random noise. The loss function used is the euclidian distance between both image representations. Several regularizers were used to regularize the natural image prior, including a high-order norm (6 was used in this paper), as well as a more complex normalizer that builds on a total variation norm. Gradient descent was used to optimize the function.

This method was able to provide better reconstructions than previously possible for HOG, and when applied to CNN's, is able to visualize what occurs at each CNN layer.

Discussion: How can we use image reconstruction to design better classifiers?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 6:15 PM
The researchers in the paper present a general method to reconstruct visual information from inverting SIFT, HOG, and convolutional neural network image representations. The method helps answer the question: what does the representation capture. In inverting a deep CNN, the researchers considered AlexNet. In comparing the original image with the reconstructed image from each layer of CNN-A, as each layer goes deeper, there is a loss in information and blurriness in the reconstructions, but is still quite recognizable.

Question: How well would using the reconstructed images be for data augmentation? Can the reconstructed images be used in training to help classify the normal images better?
ReplyDelete
Replies
JonMarch 27, 2017 at 6:46 PM
This paper introduces a method to understand image representations in both classic image recognition techniques as well as in deep CNNs. The idea behind this technique is to compute an approximate inverse of an image representation through a regression. The regression involves the minimization of a loss function combined with a regularizing term. The regression is solved using gradient descent.
The method was applied to DSIFT, HOG, and CNN features. It was found that the pre-images computed from HOG and DSIFT features were significantly more accurate than other methods, but were a bit slower. The image reconstructions were also performed across all layers of the Alexnet CNN. It was found that near photometric pre-images were reconstructed through the convolutional layers, with increasing fuzziness as the layer depth increased. Through the fully connected layers, objects in pre-images were broken down into parts in random locations in the image. Next, the invariance across pre-images computed from multiple reinitializations was analyzed at each layer. Larger deformations across pre-images were observed later in the network. Also, the reconstructions obtained from subset neurons in CNN layers were examined, and it was observed that the receptive fields of neurons were sometimes significantly smaller than theoretical. Finally reconstructions obtained across the two channel subsets of alexnet were analyzed, and it was observed that specialization in image representation occurred with respect to edges as well as color.
Discussion: What do they mean when they say you can observe the invariances of image representations by comparing multiple preimages? Is there a simple example on two images reconstructions?

-Jonathan Hohrath
ReplyDelete
Replies
JorgeMarch 27, 2017 at 6:58 PM
"Understanding Deep Image Representations by Inverting Them" develops a framework that allows researchers to analyze what kind of the representations are being learned both by HOG/SIFT methods and by CNNs.

They focus on the idea of "inverting" the image representation. That is, of generating a pre-image whose representation matches the target one. That is, using GD with momentum they would find the image that minimizes a loss function that is basically a Euclidean distance between the generated image and the target image in the representation space. Since this representation is not uniquely inverted (multiple images can have the same representation), they add regularization factors to the loss function. These regularizers are in fact "natural priors" that ensure that the optimizing image belongs to the natural set of the target image. In other words, they ensure that the generated image a has a natural meaning.

With this framework, they analyze the AlexNet CNN and find interesting results such that the convolutional layers maintain visual meaning or that the branches of the CNN allocated in different GPUs do indeed specialize in learning different representation of the image. Applying it to HOG or SIFT, it performs better both quantitatively and qualitatively than other existing methods.

Question: They say that they set lambda to the value that gets better results both quantitatively and qualitatively. In Table 3, they show what values what error they get for different layers and what values of lambda they choose. Here it seems that for the deeper layers, they choose lambda to be the value with higher errors. Isn't this a contradiction?

More generally, if non-natural images could potentially obtain a very low Euclidean distance (this is the reason why they use the priors) does it make sense to fix lambda to the value the minimizes the quantitative error? In other words, are we sure that minimizing the quantitative error gives also an acceptable qualitative result?

ReplyDelete
Replies
UnknownMarch 27, 2017 at 7:01 PM
The paper measures the amount of information contained in classical (DSIFT, HOG) and CNN-based (AlexNet) image representations of an image. The way they do so is by reconstructing the image from its feature representation; the more accurately you construct an image, the more information its feature representation contains. They invert the image via minimizing a loss function, which they were able to do easily with backprop because they reframed DSIFT and HOG as equivalent CNNs.

This method of inverting an image proved better than previous methods for classical representations. As for CNN, it revealed what sort of information is stored in each layer. For instance, later layers stored more generalizable, abstract features (i.e. more than one flamingos in different orientations). Another cool finding was that although in theory the receptive field (which I understood to mean the amount of the original image a small patch in the downstream representation can see) should expand as you get to deeper layers, it really doesn’t; the central 5 by 5 square always captures the face of the monkey, whether you’re in an earlier layer or in the last fully-connected layer.

This is all cool, but how can I use this knowledge to build a better model? Has anyone done it?
ReplyDelete
Replies
ColeMarch 27, 2017 at 7:17 PM
The authors introduce a general method to invert image representations (i.e. re-construct the possible inputs). They test their algorithm on CNN’s, HOG and DSIFT. They note the thinking that representations collapse differences in images (e.g. illumination), so they should not be uniquely invertible. Therefore their algorithm finds a number of possible reconstructions.

Their method is an algorithm that minimizes the loss between the representation and a “prior”. The prior starts out as random noise, and their loss is the Euclidean distance (which is normalized). They tested out numerous unique regularizers, which turn out to be important for their success. To reduce the loss, they used gradient descent with momentum.

To use the inversion algorithm, they need to be able to computer derivatives, so to reverse DSIFT and HOG, they implement them as CNN’s so they can invert them. They test their inversion algorithm on a CNN like Alexnet, and their CNN versions of DSIFT and HOG. They achieve surprisingly low error rates for all three, and achieve much better results than a comparable HOG inverter (error is computed between the inverse and the original).

They use the inverter on each layer of the CNN and this provides interesting visualizations - they find that all the CNN layers keep a somewhat faithful representation of the image, and that “fuzziness” increases on the deeper layers. The fully connected layers seem to produce a composition of parts of the original image.

Discussion: It is interesting how the reconstructions seemingly allow us to sort of see what machine learning networks are learning to extract from images - I wonder if a technique like this could be used to help speed up training of a network? What kind of relationship is there between an accurate network and how “invertible” layers are?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 7:37 PM
THe core of this paper seeks to reconstruct images from their learned representations, both from deep networks and shallow representations such as HOG and SIFT. In writing this paper, they created CNN forms of SIFT and HOG. At the core of their algorithm is finding an image which minimizes the loss between its representation and the target representation. Embedded in this loss function, which is based on L2 regularization, is also regularization functions which seek to improve the result in terms of resembling a natural image. The first, described as R-alpha, simply regularizes by the mean-subtracted image raised to the alpha power, to discourage divergence within the image. The second, and apparently more unique and important regularizer is the R-v-beta equalizer, which encourages uniform patches of pixels by adding greater loss based on the difference between neighboring pixels in the image. This algorithms succeeds in producing quantitatively more accurate and qualitatively more realistic generated images than their competitors, and give useful insight into the representations of images at each layer of convolutional networks, resulting in more and more abstract images at the highest levels.

The results are impressive, but what does making patches more homogeneous counteract within the representation? Simply the tendency of networks to care more about high-frequency details rather than low-frequency ones? Furthermore, in the extent to which these generated images help us visualize the representation, doesn't adding these other regularizers serve to detract from what is really captured at that level?

Ben Papp
ReplyDelete
Replies
UnknownMarch 27, 2017 at 8:10 PM
This paper presents a method for visualizing image representations learned in models like HOG and deep convolutional networks by attempting to generate images based on their encodings. The authors invert represenations of images by optimizing for images that minimize difference in represenation, while incorporating a regularizer that penalizes non-natural images (e.g. the static-like images we've seen in class from directly minimizing represenation difference). They use gradient descent to minimize this loss. By visualizing convolutional networks this way, they are able to visualize the representations at each layer of convolutional nets, demonstrating what kind of information is learned by different parts of a network.

The paper demonstrates that in CNNs, the effective receptive field is often much smaller than the theoretical one. How should this affect how we design network architectures?

Jay DeStories
ReplyDelete
Replies
AnonymousMarch 27, 2017 at 8:23 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMarch 27, 2017 at 8:24 PM
Summary by Jason Krone:

This paper discusses the recreation of images from image representations generated by both traditional computer vision techniques such as SIFT and HOG as well as CNNS. The loss function used combines a standard euclidean loss with a regularization term. For the regularization term the authors experiment with both the alpha norm and a finite difference approximation. SGD with momentum and decay is used to optimize the loss. The authors also show that DSIFT and HOG feature representations can be implemented as CNNs. The performance of their proposed image reconstruction technique is slower than the alternative method (HOGgle) but it is significantly more accurate. They found that the HOG representation without gradient orientation is more difficult to invert than the SIFT representation. In addition, they discovered that the CNN representation was not much more difficult to invert than the HOG representation. The authors also examine images reconstructed by feature maps at different layers in the CNN and find that deeper layers yield more general representations of the input image with the objects in multiple positions and scales.

Questions:
Could we go into more detail on how DSIFT and HOG can be implemented as CNNs?
Is it common to think of regularization terms as image priors? What is the intuition behind this?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 8:33 PM
This paper presents a key mathematical insight that bridges the gap between qualitative and quantitative understanding of neural networks.

Prior to this paper, researchers have described in a mostly qualitative ways how convolutions at different levels attempt to extract features at different levels/scales. Though it was relatively easy to understand and visualize gradients that convolutions captured at a local scale, because of the location invariant nature convolutions, it was difficult to understand what a neural network layer saw at a global scale.

This paper presents a novel, mathematically rigorous, regularization parameter that encourages "piecewise constant patches" in the output of gradient ascent based visualization methods.

Question:
What work has been done after this paper has come out? Can we are similar regularization parameters used in generative networks or class activation visualizations for generic neural networks?
ReplyDelete
Replies
Dylan CashmanMarch 27, 2017 at 8:35 PM
This paper presents a method for inverting an image representation to demonstrate what those representations are seeing in an image. They do this by inverting the embedding function. First, they present a mathematical framing of inversion as an optimization problem. They aim to choose an element in the image space that minimizes a certain cost function. That cost function is Euclidean distance in the image space, summed with two types of regularizers. The first regularizer, is the regular norm of the proposed image, which constrains the image to prevent it from diverging. The second regularizer uses total variation to encourage proposed images to consist of piece-wise constant patchers. The total variation norm penalizes if the image changes drasticly in local regions of an image.

The paper then shows that this inversion technique can work both with deep neural nets and SIFT and HOG. It does this by reconstructing SIFT and HOG as CNNs. It performs much better than comparable reconstruction techniques for SIFT and HOG. For CNNs, it isn't very good at reconstructing the original image from deep layers, but that is to be expected - a lot of the information of the image has been compressed. However, it's really interesting to see the image recovered from different layers. In figure 6, we see that the convolutional layers are retaining most of the image's features, almost like a JPEG compression. Then, the fully connected layers are really morphing the space. They also report the layer-by-layer loss, and it stays relatively consistent through the network, suggesting that the network gradually deforms the image space. This is likely a sign that the CNN they used, from AlexNet, is fairly well optimized. If we saw a very non-uniform distribution of loss from layer to layer, I would think it would be a sign of a poorly parameterized network.

Questions:

I don't really understand what the alpha-normed regularization is doing. We use a norm like that for regularization when we have a weight matrix because we want to constrain weights to be close to 0. But here, we're norming the input image - why would we want the input image to be close to 0?

This is more of a rough thought, but I was thinking that it's really interesting that the CNN features, in contrast to SIFT or HOG, are dynamically generated by the training set. If we changed the training set, the reconstruction in figure 6 would look very different. I'd love to see the difference in reconstruction if you put different masks on the training set - if we take out all other animals from the training set, and retrain AlexNet, how might it deform? If we had a CNN trained to only output to 2 classes, would it deform much quicker, and would the final layers look more like a simple element from a binary space?
ReplyDelete
Replies
UnknownMarch 27, 2017 at 9:10 PM
This paper presents a direct analysis of representations by characterising the image information that they retain. First, they introduces an inverting method to compute an approximate inverse of an image representation. This method is very general to invert representations, including SIFT, HOG and CNNs. The idea is simple. They formulate the problem as an optimization problem with Euclidean distance loss function and some different regularisers. They use gradient descent with momentum as optimizer. Then they studied different image representations such as DSIFT, HOG and some deep CNNs. They experiments the inversion method on deep representation and get visualisatins on the information represented by each layer.

Question: I think the idea of their inversion method is simple, but I would like to know more details how it works. Some toy examples may be helpful for understanding it.
ReplyDelete
Replies
UnknownMarch 28, 2017 at 4:05 AM
This paper presents a novel way of using an end-to-end, pixel-to-pixel convolutional network to improve semantic segmentation. By making a classification decision at every pixel, the authors were able to improve the quality of image segmentation. One of the biggest hurdles here is to balance the competing forces of location versus classification. Where the former cares about small scale features and the latter often cares about wholistic image details. This concept is later tested by running masking tests, which allow the algorithm to classify images with certain sections blocked out, answering the question of whether the network cares about minute details to make larger scale conclusions.
The network accomplishes this up sampling by implementing layer fusion, combining a lower resolution and a higher resolution layer.

Why do staged training and all-at-once training give similar results?
Is it ever possible, or worthwhile, to have a convolutional network output to be bigger than an input image?
How does back prop work when you are combining layers of different resolutions?
-Sam Woolf
ReplyDelete
Replies
UnknownMarch 28, 2017 at 9:34 AM
This paper seeks to visualize the features that common algorithms such as DSIFt, HOG and CNN extract. The representation of an image from its features should not be unique since for example, change of the perspective may not change the feature and key elements. They define phi(x) which is a representation of the image and they seek to reconstruct it from its inverse. The optimization object function defined as to find an image that minimized the difference of phi(x) and given image. The regression is also used in different various ways(?) and via momentum.

Question: I am not yet clear on how they find given image features(phi_0).
ReplyDelete
Replies
Sambit PradhanMarch 28, 2017 at 2:41 PM
The paper introduces a novel method to recreate images using conventional computer vision technique representations, using algorithms such as HOG and SIFT and CNNs thus recreating
images from their representations. The author details a methodology of inverting image representation by optimizing/minimizing the L2 loss distance between the representation and the representation of the natural image prior
High-order norm and total variation norm techniques are used to regularize the natural image prior. The authors conclude that image priors computed from HOG and DSIFT features were more accurate compared to other methods.

Alexnet was used in the experiment to reconstruct the image across all layers of the CNN. The authors conclude that with increasing depth of the convolution layes the
image was reconstructed through the convolutional layers but with higer amount of blurring. Furthermore, objects in the images were segmented into small parts at different locations in the image in the fully connected layers of the Alexnet.
The invariance of the pre-images is calculate across the reinitializations at each layer. The paper is interesting because it illustrates how to visualize the image across each layer of the Alexnet as they are getting reconstructed. Though the overall performance of the methodology is slower compared to other methods the overall accuracy is significantally higher.

Discussion question - Could we discuss is the "piecewise constant patches" regularization method used in the paper ?
ReplyDelete
Replies
UnknownMarch 28, 2017 at 3:29 PM
This paper presents a method of inverting image representations for the purposes of understanding what information are contained in techniques such as HOG, SIFT (Shallow) or even CNN (Deep).
The author models finding the representation inverse as a regularized regression problem. The objective here is to find an approximate image that minimize the L2 difference of the image representation.

The interesting part is their use of regularization. They propose the use of R-alpha which is an "image-prior" in order too keep the model from diverging.
They also used R-beta to encourage piece-wise constant patches in order to avoid the model from spitting out noisy images. The model is then trained with momentum update.
ReplyDelete
Replies

Add comment