Stage : Probing joint vision-and-language representations

Candidature avant : 01/02/2021

Modalité :
To apply, send an email to

Résumé :
Vision and language multimodal representations are created through self-training of classifiers that input both texts and images on large datasets. Even though such representations can be fine-tuned to perform tasks like visual question answering, image captioning, multimodal understanding, etc, little is known about the information they model and how it is used to perform a variety of tasks. The objective of this internship is to probe multimodal representations to better understand their inner working. The candidate will propose probes and analysis methods for testing high- level structural and semantic concepts in various vison-and-language representations. In particular, the candidate will explore how data, neural architecture, and self-training tasks affect such probes. The work will be implemented within the MMF framework which itself relies on Pytorch.