Deep Learning for Computer Vision Spring 2017: Thurs, March 16:Deep Visual-Semantic Alignments for Generating Image Descriptions

Wednesday, March 15, 2017

Thurs, March 16:Deep Visual-Semantic Alignments for Generating Image Descriptions

Deep Visual-Semantic Alignments for Generating Image Descriptions. Andrej Karpathy and Li Fei-Fei. CVPR 2015.

16 comments:

Chris MattioliMarch 15, 2017 at 9:25 AM
(Chris Mattioli) "Deep Visual-Semantic Alignments for Generating Image Descriptions" takes a look at not only identifying parts of images, but generating descriptions of
those segments. The idea being that humans can describe several parts of an image when only looking at it briefly. To be more specific they derive a model whose job is
twofold: reason about the image data and reason about the image descriptions. They start with a DNN that attempts to align sentence snippets in the training data to the
approxminate location those snippets belong to in the images. They then train an RNN to generate descriptions of images. Section 3.1 details their first approach in which they
map the images and words into a common "multimodal embedding" and learn off this representation with an appropriate objective function. The details of this function attempt to cover image/sentence alignment. They detect phenomena in the images using a RCNN and condense them into an h-dimensional vector. For their sentence representations, they also covert to h-dimensional space using a BRNN (bidirectional recurrent NN). Section 3.2 details the second part of their modeling which is to put words to the images. They use an RNN framework, though, slightly augmented in order to include conditioning on images as well. (Appropriate since they're taking this multimodal approach). When it comes to their results, for their first-part model, the one which aligns descriptions and image, they were able to out-class the state-of-the-art. Their second-part model, the model which generates descriptions of images, they evaluated the results in two steps: fullframe and regional. Their fullframe results outperform the retrieval baseline and are qualitatively quite good. Their regional assessment performs better than their ranking assessment.
Question: In section 3.1.4 they mention the challenge of linking sets of words to a particular segment of the image rather than just a single word. They do so using equations 10, 11, and 12. How the values of the latent variables, a_{1...n}, being learned for each word? In 12, it seems like these need to already be decided. Is it just an iteration over all possibilities? They dynamic programming, but I'm still a little confused.
ReplyDelete
Replies
UnknownMarch 15, 2017 at 4:23 PM
This paper presents a method that combine image classification and natural language processing together. The method learns to align two model with multimodal embedding, combining the convolutional neural network and bidirectional recurrent neural networks together. The convolutional neural network takes over the image region semantic segmentation. The bidirectional Recurrent neural networks takes over the generating the sentences. The paper also presents a new architecture, Multimodal recurrent neural network to generate novel region description. Then the words of the region from the input image are transferred to RNN input. The method have outperformance on full image and image on region level annotation. The method is restricted by the fixed resolution to generate one description for one array. Also the methods are combined with two different network handling image and nature language separately. The author claim that single model trained network will be the trend of image description.

Discussion
What is the ground true for region level annotation? Will it tend to lose a certain set of information, since if it does not say anything it will not be wrong? When constructing the multimodal embedding of words and image, what happened when too many object appear in the same region?
ReplyDelete
Replies
JorgeMarch 15, 2017 at 5:12 PM
This paper presents a model for image captioning. This model consists on a two-step pipeline. The authors had the intuition that using a single-step model (e.g., a single RNN) will lead to low performance due to the fact that sentences describing an image are weak labels: they only refer to some parts of the image that might be hard to detect for a network. Instead, the authors add a previous step to the RNN consisting of the annotation of each image with snippets that describe each of its parts and then feeding it into the RNN.

To achieve the first step, they build an alignment system that takes a image and feed it into a RCNN (without the classification head) to obtain a embedding of the bounding boxes of that image into a latent space. They also feed the sentence that describes the image into a RNN that maps the sentence into another latent spaces. Trained with the same objective function, this two networks will project both the image and the sentence to a similar space. Once both things are embedded, the align every word to every bounding box using a Markov Random Field.

When they have every bounding box with its corresponding snippet, they use them to feed a bimodal RNN that generates the final caption. Here, bimodal refers to the fact that the image itself is input into this second RNN after passing it through a pre-trained CNN (e.g., VGG16 or GoogleNet).

With this model, the authors obtained a state of the art performance in image captioning.

Q: I have the same question as Chris and also this one: in training time I understand how they align every bbox with the corresponding snippet and then train the bimodal RNN with each annotated box. However, at test time there is no sentence to align. Do they simply feed the test images directly into the second RNN (the bimodal one) that has already learned from the alignment and is able to generate correct captions?

-- Jorge Sendino
ReplyDelete
Replies
Dylan CashmanMarch 15, 2017 at 6:57 PM
The authors outline a method for producing captions for images. The core tactic is to embed sentence descriptions and image representations in the same space. The embeddings chosen also account for the locality in both image space and sentence space. A bidirectional recurrent neural network is used to embed the sentences, and the RCNN architecture is used to embed the images.

The score of a proposed sentence is based on the entities within a sentence being found with high probability in given bounding boxes in the image, with the scores being given by the RCNN. Since the locality of an entity in a sentence is nonrandom, they use a Markov Random Field to bias the score towards favoring spatial locality of entities in the sentence.

To produce a caption, an RNN is used that is conditioned on the input image by biasing in the direction of the image in the embedding space. Then, the RNN proceeds as RNNs are used in NLP sentence generation tasks.

Question: There's a lot going on in this network, to the point that it's hard to tell why an erroneous caption works. I would love to see the more problematic examples - when does the network pick up on relationships in words through the image, and when does it pick up on relationships because of the grammatical rules of a language? I'd also love to hear if this has been used in other langauges.
ReplyDelete
Replies
ColeMarch 15, 2017 at 7:00 PM
The authors present a novel method of dense image captioning, their network achieves state of the art performance. One of their core improvements is how they can use images with text labels that are not region annotated. They treat the labeling sentences as weak labels, where sentence chunks refer to specific things in the image at an unknown location in the image. Their model first infers the alignment between segments of sentences and the parts of they image they describe, and then they use this data to train an rnn that will general a textual descriptions. For the first step, they use rnn architecture that instead of classifying images, obtains the bounding box space from the images and the space from the labelling sentences, and since have been mapped to the same space, they train to align the regions and the sentence fragments. Once they have this aligned data, they use it as a training set for a different RNN that will actually output the caption.
Their model works reasonably well, and on image-sentence ranking experiments they achieve the best results seen.
Question:
I am still a little unclear as to how they map the sentences to the same size space as the image
ReplyDelete
Replies
UnknownMarch 15, 2017 at 7:18 PM
Sam Burck:

This paper describes an image captioning approach that differs from other common approaches in that the goal is to train a network to learn "inter-modal correspondences between language and visual data", and use this to construct labels for individual image regions that describe complex image features such as "glass of water with ice and lemon". The approach uses CNN's for image processing and bi-directional RNN's for language processing, and "aligns the two modalities through a multimodal embedding". In doing this, the system is able to extract phrases that describe collections of image elements that are somehow related in the scene. They call this architecture a "multimodal recurrent neural network". The network produces a set of log probabilities for words in the dictionary, and uses these probabilities to choose words to caption the image with. This approach produced state of the art results for image annotation and image search on the flickr30k set.

Discussion: I understand what multimodal network means in the context of this, but I'm still a bit unclear on this.
ReplyDelete
Replies
UnknownMarch 15, 2017 at 7:39 PM
This paper presents a model to generate language descriptions of images and their regions.Input of the model is a set of images and corresponding sentence descriptions, first it presents an alignment model to align sentence snippets to the visual regions that they describe through a multimodal embedding, this model is based on CNN over image regions, bidirectional recurrent neural networks (BRNN) to compute word representations in the sentence, and structured objective that aligned image-sentences pairs. Second, it introduces a Multimodal Recurrent Neural Network for generating descriptions, challenge here is given an image, it should predict a variable-sized sequence of outputs, during training, RNN takes a word, context from previous time steps and defines distribution over the next word in the sentence, it’s conditioned on the image information at the first time step, during testing,it founds beam search can improve results. Third, this model use SGD with mini-batches of 100 image-sentence pairs and momentum of 0.9 to optimize the alignment model. About experiment, it shows that this model outperforms retrieval baselines on both full images and a new dataset of region-level annotations. The author also mentions some limitations abou this model and points out that going directly from an image sentence dataset to region-level annotations as part of a single model trained end-to-end remains an open problem.

Question: In section decoding text segment alignments to images, how to align extended, contiguous sequences of words to a single box? Why treat true alignments as latent variables in MRF can address the mentioned issue that assigned each word to highest-scoring region will lead to words getting scattered inconsistently to different regions?
ReplyDelete
Replies
UnknownMarch 15, 2017 at 8:58 PM
This paper presents two tools. The first tool takes an image and a sentence description of the image. It then breaks up the sentence into fragments, and maps each fragment a place in the image that the fragment is talking about. The way they do it, they first use RCNN to get the bounding boxes and the feature map. Then they feed it to their Bidirectional Recurrent Network. I need to spend more time looking at the math, but the gist of it seems to be that what we have called forward propagation up to this point happens both forward and backward. They then get the usual unnormalized log probabilities of each class, apply softmax, and sample from it.

The second tool (I think) is the tool presented in lecture. It takes an image, gets its feature map from VGG Net, and then feed the feature map to computing the first hidden layer, which is then used to get the unnormalized log probabilities of words. Then from it we sample the new word, get the new hidden state, and so on until we draw END.

I would love to spend some time to understand how the training for the first tool works in detail.
ReplyDelete
Replies
UnknownMarch 15, 2017 at 9:33 PM
This paper presents a model for generating captions for regions of an image using recurrent neural networks. Interestingly, they use images with captions (and without bounding boxes) for their dataset. Their framework first detects objects in an image using a pre-trained Region Convolutional Neural Network, and transforms each of the top 19 detections and the entire image into a vector of h-dimension. Separately, the framework uses a Bidirectional Recurrent Neural Network to learn a representation for the image’s caption sentence. The BRNN produces a h-dimension vector for every word in the sentence. The framework then finds an alignment of words to regions by searching for word-region pairs where their dot product is largest, and uses a Markov Random Field to assign groups of words with image regions. They then use this image annotated with bounding boxes and corresponding captions to train a Multimodal RNN to generate captions.

Discussion:
The paper says they produce 19 bounding boxes in the image representation stage. When and how do they decide which bboxes are relevant and which can be ignored?
I have seen papers on CNNs that can produce a single complex sentence caption for an image. Is there really a benefit to generating phrases to caption bboxes rather than generating a complex caption for the whole image, or bboxes with one-word categories?
ReplyDelete
Replies
AnonymousMarch 15, 2017 at 9:36 PM

Summary by Jason Krone:

This paper presents a model for densely annotating the contents of images with high level descriptions. Using a dataset of Image sentence pairs, they detect objects in every image with a RCNN. Then for each image they compute h-dimensional vector representations of the image and the top 19 detected locations in the image. Next they project each word in each input sentence into the same h-dimensional space using a Bidirectional Recurrent Neural Network, which takes word2vec representations of each word. After this they use a markov random field to align segments of text to each bounding box. Lastly, based on these text segment, region pairs they train a RNN that takes image and produces a description of regions in the image.

Questions:
- What is a Markov Random Field?
- Would be helpful to discuss equation 10 in more detail
- Why do they clip gradients element wise ? what does this mean?
ReplyDelete
Replies
UnknownMarch 16, 2017 at 12:33 AM
The authors in the paper present a model that generates dense descriptions of image regions. This model is a combination of both convolutional neural networks and recurrent neural networks. A pre-trained CNN on ImageNet is used to identify how segments of sentences align with regions of an image. A bidirectional recurrent neural network is used to infer the correspondence by computing the word representations. A multimodal recurrent network is used for generating sentences on full images. Overall, the model reaches state-of-the-art performance on image-sentence ranking experiments.

Question:
I am still a bit confused regarding the "effective extension" mentioned in section 3.2.
ReplyDelete
Replies
UnknownMarch 16, 2017 at 7:13 AM
This paper presents a novel approach to generating descriptions of images and sub regions, using both Convolutional and Recurrent Neural Networks. This paper uses data of captioned images, and develops a strategy for discerning which specific location in an image the caption refers to. First, a network is trained to associate snippets image level sentences with regions on an image. Second, a network is trained on those images to generate the snippets and locations.

Is it typical that methodologies that are pertinent to the image world of ML are also useful in the natural language processing world of ML?

Sam Woolf
ReplyDelete
Replies
UnknownMarch 16, 2017 at 10:18 AM
Hongyan Wang's summary:

This paper presents a method to generate dense description of images. The develop a deep eural network model that infers the latent alignment between segments of sentences and region of the image that they describe. Also they introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. During training, the input to their model is a set of images and their corresponding sentence descriptions. They have two steps: First, they use a model which aligns sentence snippets to the visual regions that they describe through a multimodal embedding. Second, they treat these correspondences as training data for a second, multimodal Recurrent Neural NEtwork model that learns to generate the snippets.

Question: I think I get the general idea of this paper, but a lot of details are still unclear to me. I hope to understand their training pipeline and test pipeline better.
ReplyDelete
Replies
UnknownMarch 16, 2017 at 10:19 AM
This paper introduces a new algorithm that infers language and images and combines them to generate a complicated description. It first, through a CNN, processes the images and then using a bi directional R-NN understands the language. They present a model that alines sentence snippets to the visual regions. Finally, by using muti-modal R-NN it combines the image and NL part.

I still don't understand how combining snippets would generate a meaningful sentence(grammatically), because As far as I understood, alignment in the picture kills the NL understanding, right?
ReplyDelete
Replies
UnknownMarch 16, 2017 at 1:55 PM
Summary from Jie:

This paper introduces a model that can generate natural language descriptions of images and their regions. The pipeline is divided into two steps: The first step is an alignment model that can establish correspondence between image regions and the sentences snippets. Their key insight here is that sentences written by people make frequent references to some particular, but unknown location in the image. For this part, they used a novel combination of the previously well-known region-based CNN and a Bidirectional RNN through a “multimodal embedding”. They tested their alignment model alone on commonly used datasets and show that it outperforms the other state-of-the-art retrieval based models. The second step is a Multimodal RNN that takes the outputs from step 1 and learn to generate novel description for image regions. While previous models achieved this by defining a probability distribution of the next word in a sequence given the current word and context from previous time steps, the authors explore a simple but effective extension that additionally conditions the generative process on the content of an input image. Finally, the authors evaluated its performance on both fullframe and region-level experiments and showed that in both cases the Multimodal RNN outperforms retrieval baselines.

Question: Would be helpful if we go over the multimodal embedding.
ReplyDelete
Replies
UnknownMarch 16, 2017 at 4:32 PM
This paper presents a technique for generating more descriptive image detections, capable of generating natural language descriptions of objects rather than just classifications. The model has two steps- a convolutional neural network which creates alignments for the language, and a bidirectional RNN which generates the text do the initial processing, and then the two parts are combined by a second, multimodal RNN.
ReplyDelete
Replies

Add comment