CVPR2016 Paper Notes

1. Generation and Comprehension of Unambiguous Object Descriptions

This paper proposed a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.

Highlights: 1. A new large-scale dataset for referring expressions. 2. Discriminative Training, given a training sample (I, R, S), where I is image, R is region and S is sentence, we train a model that outputs a high p(S|R,I), while main-training a low p(S|R’,I), whenever R’ != R. 3. Semi-supervised Training, use a small dataset to train a model. Then use this model to generate a set of descriptions for large dataset with bounding boxes. Train another ensemble model on the same small dataset and use it to verify auto-generated descriptions.


2. Stacked Attention Networks for Image Question Answering

This paper presents stacked attention networks (SANs) that learn to answer natural language questions by querying an image multiple times to infer the answer progressively.

Highlights: 1. Given the image feature matrix V_I and the question feature vector V_Q, the SANs update the two features iteratively, and the SANs are able to filter out the noises and pinpoint the regions that are highly relevant to the answer.


Published by



Leave a Reply

Your email address will not be published. Required fields are marked *