We need to work in python notebooks (google colab). We take three public benchmark datasets which include images and the associated text from social media (like MVSA dataset). The image and textual data are preprocessed separately. Then the features are extracted and a model based on a NOVEL SIMILARITY METRIC is created. Then we would need to move to the retrieval and recommendation part.
PS: Novelty of the above is critical.