Welcome to VLMs Zero to Hero! This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
- Word2Veq: Efficient Estimation of Word Representations in Vector Space (2013) and Distributed Representations of Words and Phrases and their Compositionality (2013)
- Seq2Seq: Sequence to Sequence Learning with Neural Networks (2014)
- Attention Is All You Need (2017)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
- GPT: Improving Language Understanding by Generative Pre-Training (2018)
- AlexNet: ImageNet Classification with Deep Convolutional Neural Networks (2012)
- VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
- ResNet: Deep Residual Learning for Image Recognition (2015)
- Show and Tell: A Neural Image Caption Generator (2014) and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
- A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)
- CLIP: Learning Transferable Visual Models from Natural Language Supervision (2021)
- Scaling Laws for Neural Language Models (2020)
- LoRA: Low-Rank Adaptation of Large Language Models (2021)
- QLoRA: Efficient Fine-tuning of Quantized LLMs (2023)
- Flamingo: A Visual Language Model for Few-Shot Learning (2022)
- LLaVA: Visual Instruction Tuning (2023)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
- PaliGemma: A versatile 3B VLM for transfer (2024)
Are there important papers, models, or techniques we missed? Do you have a favorite breakthrough in vision-language research that isn't listed here? We’d love to hear your suggestions!