This paper proposed a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.
Highlights: 1. A new large-scale dataset for referring expressions. 2. Discriminative Training, given a training sample (I, R, S), where I is image, R is region and S is sentence, we train a model that outputs a high p(S|R,I), while main-training a low p(S|R’,I), whenever R’ != R. 3. Semi-supervised Training, use a small dataset to train a model. Then use this model to generate a set of descriptions for large dataset with bounding boxes. Train another ensemble model on the same small dataset and use it to verify auto-generated descriptions.
This paper presents stacked attention networks (SANs) that learn to answer natural language questions by querying an image multiple times to infer the answer progressively.
Highlights: 1. Given the image feature matrix V_I and the question feature vector V_Q, the SANs update the two features iteratively, and the SANs are able to filter out the noises and pinpoint the regions that are highly relevant to the answer.