--- title: CLIP-MoCo keywords: fastai sidebar: home_sidebar summary: "**CLIP**: Learning Transferable Visual Models From Natural Language Supervision" description: "**CLIP**: Learning Transferable Visual Models From Natural Language Supervision" nb_path: "nbs/21 - clip-moco.ipynb" ---
{% raw %}
{% endraw %}

This module combines CLIP and MoCo for increasing negative samples. This is useful when there is no available compute such as GPUs with large memory to support large batch sizes or multi-gpu machines to leverage distributed infonce loss implementation.

{% raw %}
{% endraw %} {% raw %}
{% endraw %}

Algorithm

CLIP

CLIP

MoCo

Tokenizer

{% raw %}

class ClipTokenizer[source]

ClipTokenizer(context_length=77) :: DisplayedTransform

Tokenizer from https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py

{% endraw %} {% raw %}
{% endraw %}

Model

{% raw %}

vitb32_config[source]

vitb32_config(input_res, context_length, vocab_size)

ViT-B/32 configuration, uses 32x32 patches

{% endraw %} {% raw %}
{% endraw %} {% raw %}

vitl14_config[source]

vitl14_config(input_res, context_length, vocab_size)

ViT-L/14 configuration, uses 14x14 patches

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class Bottleneck[source]

Bottleneck(inplanes, planes, stride=1) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class AttentionPool2d[source]

AttentionPool2d(spacial_dim:int, embed_dim:int, num_heads:int, output_dim:int=None) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class ModifiedResNet[source]

ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64) :: Module

A ResNet class that is similar to torchvision's but contains the following changes:

  • There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
  • Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
  • The final pooling layer is a QKV attention instead of an average pool
{% endraw %} {% raw %}

class LayerNorm[source]

LayerNorm(normalized_shape:Union[int, List[int], Size], eps:float=1e-05, elementwise_affine:bool=True) :: LayerNorm

Subclass torch's LayerNorm to handle fp16.

{% endraw %} {% raw %}

class QuickGELU[source]

QuickGELU() :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class ResidualAttentionBlock[source]

ResidualAttentionBlock(d_model:int, n_head:int, attn_mask:Tensor=None) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class Transformer[source]

Transformer(width:int, layers:int, heads:int, attn_mask:Tensor=None, checkpoint=False, checkpoint_nchunks=2) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class VisualTransformer[source]

VisualTransformer(input_resolution:int, patch_size:int, width:int, layers:int, heads:int, output_dim:int, **kwargs) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}

class CLIPMOCO[source]

CLIPMOCO(embed_dim:int, image_resolution:int, vision_layers:Union[Tuple[int, int, int, int], int], vision_width:int, vision_patch_size:int, context_length:int, vocab_size:int, transformer_width:int, transformer_heads:int, transformer_layers:int, K=4096, m=0.999, **kwargs) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

{% endraw %} {% raw %}
{% endraw %}

Metric

A useful proxy metric for tracking training performance and convergence.

{% raw %}

class RetrievalAtK[source]

RetrievalAtK(k=20, **kwargs) :: AccumMetric

Stores predictions and targets on CPU in accumulate to perform final calculations with func.

{% endraw %} {% raw %}
{% endraw %}

CLIP-MoCo Callback

{% raw %}

class CLIPMOCOTrainer[source]

CLIPMOCOTrainer(after_create=None, before_fit=None, before_epoch=None, before_train=None, before_batch=None, after_pred=None, after_loss=None, before_backward=None, before_step=None, after_cancel_step=None, after_step=None, after_cancel_batch=None, after_batch=None, after_cancel_train=None, after_train=None, before_validate=None, after_cancel_validate=None, after_validate=None, after_cancel_epoch=None, after_epoch=None, after_cancel_fit=None, after_fit=None) :: Callback

MoCo Loss for CLIP. Can be used with or without DistributedDataParallel

{% endraw %} {% raw %}
{% endraw %}

Example Usage

{% raw %}
num2txt = {'3': 'three', '7': 'seven'}
def num_to_txt(o): return num2txt[o]
def dummy_targ(o): return 0 # loss func is not called without it
{% endraw %} {% raw %}
path = untar_data(URLs.MNIST_TINY)
items = get_image_files(path)
clip_tokenizer = ClipTokenizer()
tds = Datasets(items, [PILImage.create, [parent_label, num_to_txt], dummy_targ], n_inp=2, splits=GrandparentSplitter()(items))
dls = tds.dataloaders(bs=2, after_item=[Resize(224), clip_tokenizer, ToTensor()], after_batch=[IntToFloatTensor()], device='cpu')
{% endraw %} {% raw %}
vitb32_config_dict = vitb32_config(224, clip_tokenizer.context_length, clip_tokenizer.vocab_size)
clip_model = CLIPMOCO(K=4096,m=0.999, **vitb32_config_dict, checkpoint=False, checkpoint_nchunks=0)
learner = Learner(dls, clip_model, loss_func=noop, cbs=[CLIPMOCOTrainer(), ShortEpochCallback(0.001)],
                  metrics=[RetrievalAtK(k=5), 
                           RetrievalAtK(k=20), 
                           RetrievalAtK(k="mean"),
                           RetrievalAtK(k="median")])
{% endraw %}