VisionGPT-2: Image Captioning Model

that's my dog Sumo, the model has never seen him before:)

I wanted to build multimodal models for a while now and what better way that to start with Image Captioning, which is kinda like the hello world of multimodal.

Notebook:

Model

I used the following 2 models:

ViT Base, patch size = 16, image size = 224
GPT2 small
I prepared the architecturealmostfrom scratch
I extracted the useful ViT layers from thetimmpackage and used it as the encoder with the pretrained weights.
As for GPT2, I coded the entirety from scratch, added a newCross Attentionlayer in the decoder block to get a standardencoder-decodertransformer.
GPT2 weights were loaded via HuggingFace. Refer toNanoGPT.

Dataset

The dataset I used wasCOCO 2017with options forFlickr30kandFlickr8k.

The dataset preparation was also done from scratch
my code goes in detail about how to prepare the labels for causal language modeling, calculating the loss while ignoring special tokens, etc.
Dynamic padding with custom collate function to pad sequences based on the batch and not the max length of the model.

Training

The training loop was written from scratch, the metric I used wasperplexity = e^loss
I trained it with mixed-precision fp16 usingtorch.amp.
I initially trained the randomly initialized cross-attention layers, then in further epochs, I finetuned the entire GPT2 and in further epochs I finetuned the entire ViT-GPT2 model.

Generation

Standardtorch.multinomialsampling based generation with temperature control.
Support for deterministic generation withtorch.argmax
The results are good not great, I only trained on about 30% of the training samples in COCO.

Results

Epoch	Train Loss	Train Perplexity	Val Loss	Val Perplexity
0	5.164732	174.990611	3.288565	26.804375
1	2.668888	14.423919	2.341017	10.391795
2	2.30841	10.058415	2.201064	9.034617
3	2.033982	7.64447	2.099659	8.163385
4	1.855595	6.395501	2.08667	8.058035

Predictions

See morehere

Psalm 32:8

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
architecture.jpeg		architecture.jpeg
dataset.py		dataset.py
model.py		model.py
trainer.py		trainer.py
visiongpt2-image-captioning-pytorch.ipynb		visiongpt2-image-captioning-pytorch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionGPT-2: Image Captioning Model

Notebook:

Model

Dataset

Training

Generation

Results

Predictions

See morehere

About

Languages

shreydan/VisionGPT2

Folders and files

Latest commit

History

Repository files navigation

VisionGPT-2: Image Captioning Model

Notebook:

Model

Dataset

Training

Generation

Results

Predictions

See morehere

About

Topics

Resources

Stars

Watchers

Forks

Languages