Sunday, April 16, 2017

Tues. April 18: Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering Jiasen Lu, Jianwei Yang, Dhruv Batra , Devi Parikh, NIPS 2016.

17 comments:

  1. (By Chris Mattioli) "Hierarchical Question-Image Co-Attention for Visual Question Answering" attacks the problem of visual question answering (VQA). The authors sought to expand methods/techniques of VQA, which typically are focused only on visual attention, by experimenting with what they refer to as "question attention". Formally, VQA methods were focused solely on this visual attention - the "where to look?" aspect of the image. The idea they had was that there are features of the question which could be important to creating a effective model. Specifically, they experimented with a model that was co-attentive to both the "where" question and the "what words are important" question. This assessment is broken up into three parts - a hierarchy. First they consider the word level, which is a one-hot encoding of the questions' word. Next they consider phrases. In this tier, they represent their phrase feature as the max of a 1D convolution over a single, double and triple word vector set (the one-hot encoding) for each word in the question (equations 1 and 2). Finally, the question level tier is the LSTM encoding of the phrase level feature (the output of this max-pooling). Now that they have this feature, they can perform co-attention with their visible data. They describe two different methods for doing this. The first is parallel co-attention. This method calculates an affinity matrix which measures similarity between visible features and question features. This matrix is then used as a feature to predict the image and question "attention maps". It's worth noting that this technique is applied to each tier of the hierarchy. The second method is alternating co-attention. This technique consists of alternating between question/visual attention like so: summarize the question into a single vector, attend to the visual information based off this summary, and attend the question based off of the attended visual information. Once one of these two methods is executed, MLP is applied to generate the answer to the question.

    They showed this methodology improves accuracy on previous VQA attempts on datasets such as VQA and COCO-QA. Though most interesting to me was the ablation study: this is where they removed certain steps from their method to see how performance changed. They showed the most improvement when using each of the tiers I described above. Removing any one of them caused a performance drop.

    Questions: I think this was a pretty interesting paper, but I found myself struggling to develop an intuition on some of the equations and why they selected what they selected. For example, why does using the affinity matrix as part of the equations in (4) yield increased performance? Why use LSTM to encode the question feature? Second question is one that's more general to machine learning and CV and it's really just food for thought: is a 2% increase in accuracy (over current techniques) really something to write home about? In other words, what's the minimum percent increase that's also publishable?

    ReplyDelete
  2. Nathan Watts summary:
    This paper presents a novel architecture for question answering which uses attention mechanisms jointly on both the image and the question, which improves not only meets or exceeds state-of-the-art performance, but provides dramatic improvements in understandability over previous question-answering models.
    The questions are encoded at three levels- word-level, using a vector embedding, phrase-level, using one-dimensional convolutions, and question-level, using an RNN. The phrase-level model uses max pooling across several window sizes to select the appropriate n-gram for each word location. This information, along with the compressed representation of the image created by a convolutional network, is then passed through a “co-attention mechanism,” which selects, either in parallel or in an alternating manner, what data in each of the two representations to attend to to produce one unit of output. The selected data are then used to generate an answer. This attention mechanism allows for visualization of the importance of specific question words or image features in generating the answer, but also in the relationship between question words and image features. Qualitative results show that the mechanism makes very easily understandable connections between words and image features.
    Performing an ablation study, they found that while removing individual components of the model might slightly improve performance on yes/no question answering, it dramatically reduced performance on numerical and open-ended questions.

    Question: What exactly is an attention mechanism/how do they work? Is it similar to the localization stuff we’ve seen with CNN visualization?

    ReplyDelete
  3. This paper presents an approach to the problem of visual question answering (VQA) that outperforms previous state-of-the-art. In addition to visual attention to guide question answering, this paper introduces the concept of question attention to determine how important aspects of a question are. This is done by creating a hierarchy of structure within a question, from words to phrases, to the whole question. Specifically, words are embedded into a vector space, unigram, bigram, and trigrams are convolved across the word vectors that make up the sentence, then max pooling is applied across the n-grams. Question level features include the LSTM hidden vector after feeding in the phrase level embedding at each time point of the sentence. Next, this paper also introduces a co-attention model to balances image attention and question attention when learning answers. Two mechanisms to achieve this include parallel-coattention, where the image attention and question attention are resolved simultaneously and alternating co-attention, which alternates sequentially between visual and question attention. These techniques were implemented and tested on the VQA and the COCO-QA datasets. The network achieved state-of-the-art results of both datasets. It was discovered that of the word, phrase and question level embeddings, the question level were the most important to achieving high accuracy.

    Discussion:
    Which of the two co-attention mechanisms were used in the results? Can you go through how this co-attention mechanism work?

    -Jonathan Hohrath

    ReplyDelete
  4. This paper studies the problem of visual question answering, given an image and natural language question about image, the task is to provide an natural language answer. In this paper, the authors propose a novel mechanism that reasons about visual attention and question attention, rebuilds a hierarchical that co-attends to the image and question at three levels. At word level, they build the words to a vector space through embedding matrix, at phrase level, 1d cnns are used to capture the information contained in unigrams,bigrams and trigrams, at question level, they use rnn to encode the entire question.At each level, they construct joint question and image co-attention maps which are used to predict a distribution over the answers. About the co-attention mechanisms, parallel co-attention mechanism attends to the image and question simultaneously and alternating co-attention mechanism sequentially alternate between generating image and question attention. In the experiment, it shows that the model they proposed improves the performance on both VQA and COCO-QA datasets.

    Q: How the two mechanisms parallel co-attention and alternating co-attention works at three levels?

    ReplyDelete
  5. The authors present in this paper a novel approach to visual question answering (VQA). They introduce a co-attention system that allows the model to pay attention to the image and the question in a joint manner. Multiple hierarchical question features are used for this purpose.

    This question hierarchy is designed so that the final question features combine information from different word window sizes: unigrams, bigrams and trigrams. These n-grams are generated feeding a convolutional filter with the word embedding vectors and then are max-pooled and encoded using a LSTM. Using this method, three levels of features are extracted from the question: words, phrases and sentences.

    Then, the authors combine this features with the ones obtained from the image (using deep nets such as VGG or ResNet) to achieve the co-attention feature. Two configurations are possible: parallel, in which both modality features are used at the same time to generate the attention vectors; and alternating, in which the image attention vector is used to generate the question attention vector. Finally, this attention vectors are combined to obtain the final answer.

    Using this described method and testing it on VQA and COCO-QA datasets, the authors obtain state of the art performance.

    Questions:
    Figure 3(b) explains how attention vectors from different features are combined to generate a final answer. It seems that being the final layer a softmax function, what is produced is a vector of probabilities. The authors actually say that it is a probability vector, but they also say that this is the final answer (supposedly a sentence). If it finally is a prob vector, what is the corresponding sentence?

    Equations 3, 4 and 6 make use of some weights matrices. However, they do not explain how these parameters are trained. Is training performed separately for each of the matrices? Do they make public what loss function they use?


    ReplyDelete
  6. Xinmeng's Summary
    The paper proposed a hierarchical co-attention model for visual question answering. The co-attention model respond to different fragment of the question with different regions. The model turn the question into three hierarchical levels, word level, phrase level and question level, to capture information and interpret regions of images and question for predicting the answer. The method is implemented by first layer word embedding, secondly convolution layer with multiple filter of different widths, then max-over different filter pooling layer, last LSTM question encoding layer. The pooling method in the paper differs from those in previous works which selects different gram features at each time step, the model use a LSTM to encode the sequence after max-pooling. The corresponding question-level feature is the LSTM hidden vector at time t. Finally, recursively combine the attended question and image features to output answers. The model was able to improve the performance 1.7% on COCO-QA and further improve 2.1% with ResNet.

    Discussion: Is the improvement significant enough to sell the model?

    ReplyDelete
  7. This paper covers a new method of Visual Question Answering, in which the network is able to use the question to direct attention in the vision space, as well as use the image to direct attention in the question space. It also covers a way of breaking down the question into word-level, phrase-level, and question-levels, to help break down ambiguities that might be caused by analyzing these levels alone. The two models for performing co-attention direction are parallel and alternative, the former of which develops an affinity matrix which helps translates the vision features into the question space and vice-versa in real time. The latter computes the attentions for each sequentially, first by summarizing the question into a single vector, using that to direct attention in the vision space, and then using that to direct attention back in the question space. After performing their experiments, the authors then visualize the semantics of the trained network by ablation, that is, by dropping out parts of the experiment and visualizing the difference of performance.
    Discussion;
    Can you discuss the visuals in Figure 4 of the paper? Why do the different levels of the word maps activate so differently in the images?

    -Ben Papp

    ReplyDelete
  8. Nice paper. In essence, they learn the correlation between the question and the image. They do so by first creating a feature map for the questions (Q : (T, d)) and images (V : (N, d), where N is the number of pixels – wish they used a different letter than N) using a 1-d convolution at uni-, bi-, and tri-gram level. Then they multiply these matrices to get the affinity matrix C. Then they compute what they call attention, which is a vector of probability (relevance) for each word in the question and each pixel in the image. They propose two ways of doing this: parallel and alternating.

    They use a funky, recursive fully-connect style network to identify a word (answer) as a multi-dimensional classification. The recursion took me by surprise; it’s not that I have a better alternative, but it seems magical and they don’t explain their choice.

    ReplyDelete
  9. Summary by Jason Krone:

    This paper discuses the use of co-attention in neural networks for the task of visual question answering. Co-attention enables the network to model “where to look” in the image as well as “what words to listen to” in the question. Given a question Q in 1-hot encoding and an image V, they first embed the words into a vector space. Next they use a 1-D convolution with a tanh activation to compute phrase features from those word embedding using adjacent subsequences of 1,2, or 3 words. After the convolution 1D is applied, they then use max pooling across different n-grams at each location to obtain phrase level features, which are then fed through an LSTM to produce a question level feature. The authors explore two types of attention mechanisms: 1. parallel co-attention, which simultaneously incorporates image and question attention and 2. alternating co-attention, which alternates image and question attention. Parallel co-attention uses a similarity matrix C (which is the product of the questions representation Q, image features V, and weights Wb with a tanh activation) to compute attention. Specifically, they then use C as a feature for predicting image and question attention maps. Theres attention maps use a tanh activation function and are fed into a softmax that produces attention probabilities. Attention vectors are then made by summing the product of the the image features and questions features by their attention probabilities. Alternating co-attention uses essentially the same transformations but only operates on a given form of attention. Lastly, to predict answers based on attention they use a multi-layer perceptron to encode attention features from the word, phrase, and sentence level.

    Questions:
    - It seems somewhat odd that the max pooling operation is used to create phrase level features for the question. Would mean pooling not incorporate more information about the entire phrase?
    - What are the main limitations of this system?

    ReplyDelete
  10. In order to answer visual questions, previous models focus on visual attention such as identifying "where to look", but this paper argues that question attentions such as "which words to listen to" is equivalently important. This paper presents a novel mul-ti-modal attention model for answering visual questions with two unique features: First, their model reasons about both visual attentin and question attention. Their model has a symmetry that image representation will help question attention and question representations are used to help image attention. Another unique feature is that we propose a hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level.

    ReplyDelete
  11. The researchers in the paper introduce a co-attention model for Visual Question Answering that reasons about both image and question attention. What is novel about their co-attention model is that it features natural symmetry in which the image representation is used to guide the question attention and question representations in turn are used to guide image attention. The architecture features three hierarchical levels including words, phrases, and questions. The hierarchical question encoding involves word embedding, a convolutional layer with multiple filter of different widths, max-over different filter pooling layer, and finally LSTM question encoding. The researchers also present two different co-attention methods that are different in the way order in which the question and image attention maps are generated.

    Question:
    What types of images or questions does the model perform poorly on?

    ReplyDelete
  12. This paper discusses a method of visual question and answering. The paper first encodes the words in the question sentence into a vector space. These vectors are then combined using inner product with their neighbors and a learned weight vector to get a unigram, bigram, and trigram vector for each word in the question. A maxpool is applied to get the best n-gram for each word, and this list of n-grams forms the representation of the question sentence at a phrase level. This phrase level representation is passed to an LSTM to get a vector representation of the entire question. The image is transformed into a feature space. In the parallel co-attention model, at the word, phrase, and sentence level, affinity vectors are computed for both image space and question space. In the alternating co-attention model, the attention features for question vector(s) are combined with the image features to learn the image affinity vector, and this is combined again with the question vector to learn the question affinity vector. The affinity vectors at each question level are combined and passed through a 3-layer perceptron to produce a final answer to the question.

    Discussion:
    How do they convert the words to a vector space? The paper says it’s learned end-to-end, does that mean they created the mapping beforehand? If so, how did they create the mapping?
    Similarly, what do they use to convert the image to a feature map?

    ReplyDelete
  13. This paper covers an architecture for VQA, or Visual Question Answering. Unlike other VQA models, this architecture as a co-attention model, which in this case, means it relies on both where to look in the image and the question being asked. The authors have two different approaches for comparing questions and image features, which are parallel co-attention and alternating co-attention. Parallel co-attention creates an affinity matrix by comparing question and image features for all permutations of feature pairs. Alternating co-attention first summarizes the question into a feature vector, and then it "attends" to the image based on question, and "attends" to the question based on image features. The authors use both VGG and ResNet for image feature extraction. This model mostly attains state of the art results when compared to other VQA algorithms, and has close-to state of the art results where other models have higher accuracy.

    How are n-grams modeled with 1d convolutions?

    ReplyDelete
  14. This comment has been removed by the author.

    ReplyDelete
  15. The authors introduce a new method for visual question answering. Previously, there were techniques to help decide which parts of the image were important in answering the question, the authors added to this the ability to decide which words and phrases are the more important parts of the question.
    The technique considers multiple levels - the word level, then single, double and triple word vectors, and then the question phrase level (from pooling across different locations and then using LSTM).

    They do co-attention with this and the image using parallel co-attention (similarity between visible features and the question) and also alternating co-attention (which alternates between the image and the question). Perceptron is then used to predict the answer using the attention information. This model improved state-of-the-art results on VQA and COCO-QA dataset (by .2 and 1.7 percent respectively), and more using ResNet.

    Discussion: Can you clarify what to compare the ResNet results to? How much is attributable to their changes - they say some other groups used res-net for a performance that was 1.8% lower but they did not test themselves?

    ReplyDelete
  16. This paper presents two novel ways of performing Visual Question Answering, VQA, that incorporates and "aggregates" features from both images and question. The first model, Parallel co-attention, computes a matrix product with visual features, question features and a weight matrix, to obtain an affinity matrix that is then used to calculate visual and question attention vectors.

    In the second model, alternating co-attention, computed activations from an RNN like cell where hidden states are either derived from visual or question features are combined in an affine transformation with question or visual features respectively to compute visual and question attention vectors.

    Discussion:
    Can the parallel or alternating attention model be abstracted to encapsulate more than two feature spaces? For example, what if we wanted to a neural network to answer questions about a video? Could we abstract away the idea of the affinity matrix to include also temporal features?

    ReplyDelete
  17. This paper presents a novel approach for Visual Question Answering. Specifically, it focuses on the importance of question attention, or ‘what words to listen to’. This leads to a parallel structure where the network learns both visual attention (“where in the image”) alongside question attention (“what words are important”), which they call Co-Attention. This is done by one-hot encoding the question using a recurrent neural network, and then use this encoding, alongside image data, in order to train a network.
    This training takes two forms, Parallel Co-Attention and Alternating Co-Attention. In both cases, the network treats VQA as a classification task, generating both questions from images, and images locations from questions.

    What other tasks could this network be used for? Can it be used to ask questions of non image data, such as numerical?
    How does this network respond to words that dictate specific spatial questions? For example, what color is the bird on the left?

    -Sam Woolf

    ReplyDelete