Image and Video Captioning

Technologies Used

TensorFlow Python Computer Vision NLP Deep Learning Transformers CNNs RNNs
Image and Video Captioning

Project Overview

I worked on this project during my participation in the Artificial Intelligence and Multi-Agent Systems Summer School at the AI-MAS Laboratory, University Politehnica of Bucharest. The focus was on improving the existing model for image and video captioning, through curriculum learning.

Key Features

  • Exploring the use of different learning strategies that use curriculum learning
  • Increasing difficulty through: caption length, syntactic structure, vocabulary difficulty
  • Evaluating comprehensively on standard datasets (Flickr8k, MSVD)

Results

The implemented models achieved competitive performance on standard benchmarks, with particular improvements in BLEU scores on the Flickr8k dataset. However, as a result of training on simpler data for longer, the predicted captions were more generic and less descriptive.

Learning Outcomes

This project provided valuable hands-on experience with cutting-edge deep learning research, combining computer vision and natural language processing. It enhanced my understanding of both fields.