Multimodal learning, which is simultaneous learning from different data sources such as audio, text, images; is a rapidly emerging field of Machine Learning. It is also considered to be learning on the next level of abstraction, which will allow us to tackle more complicated problems such as creating cartoons from a plot or speech recognition based on lips movement.
In this paper, we propose to research whether state-of-the-art techniques of multimodal learning, will solve the problem of recommending the most relevant images for a Wikipedia article. In other words, we need to create a shared text-image representation of an abstract notion paper describes, so that having only a text description machine would "understand" which images would visualize the same notion accurately.
Dataset was collected from Wikipedia and uploaded to Kaggle using pyWikiMM library. For more details about dataset collection and structure, please refer to its description on Kaggle.
You can reproduce the results either by setting up the environment locally or by cloning notebooks from Kaggle. Following Kaggle notebooks are available:
You can find more details about problem statement and our solution approach in "Image Recommendation for Wikipedia Articles" thesis.
Special thanks to Miriam Redi for actively mentoring me in this project.