paper link (https://arxiv.org/pdf/1804.01452.pdf)
Python 3.6, Tensorflow 1.8, wavio, python_speech_features
1) download flickr8k speech caption files and image files
2) In the data folder, flickr8k.pkl provides paired information. Details of how to use this pickle file can be found in main_SISA or MISA python file.
3) python main_SISA/MISA.py
this result is on test dataset, which is the last 1000 images and captions
R@1: 0.027, R@5: 0.127, R@10:0.245
1) image to caption retrieval
2) ...