Image Transcription using CLIP + GPT

We do image transcription by alternating between:

Example:

creates a following search tree (after prefiltering 10 out of 10000 proposals at each decision step):

And atfer a few iterations comes up with:

This photo shows two bodies of a cat (seen on Flickr, April 2011) carrying a knife

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
ImageTranscription.ipynb		ImageTranscription.ipynb
README.md		README.md
search_tree.png		search_tree.png

Provide feedback