--- title: CLIP-MoCo keywords: fastai sidebar: home_sidebar summary: "**CLIP**: Learning Transferable Visual Models From Natural Language Supervision" description: "**CLIP**: Learning Transferable Visual Models From Natural Language Supervision" nb_path: "nbs/21 - clip-moco.ipynb" ---
This module combines CLIP and MoCo for increasing negative samples. This is useful when there is no available compute such as GPUs with large memory to support large batch sizes or multi-gpu machines to leverage distributed infonce loss implementation.
A useful proxy metric for tracking training performance and convergence.
num2txt = {'3': 'three', '7': 'seven'}
def num_to_txt(o): return num2txt[o]
def dummy_targ(o): return 0 # loss func is not called without it
path = untar_data(URLs.MNIST_TINY)
items = get_image_files(path)
clip_tokenizer = ClipTokenizer()
tds = Datasets(items, [PILImage.create, [parent_label, num_to_txt], dummy_targ], n_inp=2, splits=GrandparentSplitter()(items))
dls = tds.dataloaders(bs=4, after_item=[Resize(224), clip_tokenizer, ToTensor()], after_batch=[IntToFloatTensor()], device='cpu')
vitb32_config_dict = vitb32_config(224, clip_tokenizer.context_length, clip_tokenizer.vocab_size)
clip_model = CLIPMOCO(K=4096,m=0.999, **vitb32_config_dict, checkpoint=False, checkpoint_nchunks=0)
learner = Learner(dls, clip_model, loss_func=noop, cbs=[CLIPMOCOTrainer(), ShortEpochCallback(0.001)],
metrics=[RetrievalAtK(k=5),
RetrievalAtK(k=20),
RetrievalAtK(k="mean"),
RetrievalAtK(k="median")])
learner.fit(1)
learner.recorder.losses