This paper proposed a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.
Highlights: 1. A new large-scale dataset for referring expressions. 2. Discriminative Training, given a training sample (I, R, S), where I is image, R is region and S is sentence, we train a model that outputs a high p(S|R,I), while main-training a low p(S|R’,I), whenever R’ != R. 3. Semi-supervised Training, use a small dataset to train a model. Then use this model to generate a set of descriptions for large dataset with bounding boxes. Train another ensemble model on the same small dataset and use it to verify auto-generated descriptions.