Tensor Model Parallelism Tutorial

  • The concept of the tensor model parallelism can be found here.

  • The source code of this tutorial can be found here.

0. Distributed Launcher

This tutorial must be launched using distributed launcher.

If you have 4 GPUs:

python -m torch.distributed.launch --nproc_per_node=4 YOUR_SCRIPT.py

If you installed DeepSpeed in your environments, the following works the same.

deepspeed --num_gpus=4 YOUR_SCRIPT.py

For more information of the distributed launchers, refer to:

1. Inference

How to use the tensor model parallelism for inference?

1.1. Create model and tokenizer

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

1.2. Parallelize the model

  • tensor_parallel_size must be same or smaller than total number of gpus.

  • tensor_parallel_size must be power of 2. (e.g. 2, 4, 8, 16, …)

  • tensor_parallel_size must be positive number.

  • tensor_parallel_size must be same or greater than hidden size

  • tensor_parallel_size must be same or greater than the number of heads

import oslo

model = oslo.initialize(
    model, config={"model_parallelism": {"enable": True, "tensor_parallel_size": NUM_YOUR_GPUS}}
)

You can also use json file (example for 4GPUs)

// oslo-config.json

{
     "model_parallelism": {
          "enable": true,
          "tensor_parallel_size": 4
      }
}

And you can use the json file like this:

model = oslo.initialize(model, config="oslo-config.json")

1.3. Do inference as usual

This is an example of text generation. In addition to this, it can be used in various tasks such as sequence classification or masked lm. Likewise, you can write the code as usual.

text = "I don't want a lot for Christmas. There is just one thing"
tokens = tokenizer(text, return_tensors="pt").to("cuda")
print(tokenizer.decode(model.generate(**tokens, num_beams=3)[0]))
I don't want a lot for Christmas. There is just one thing I want to ...

2. Training

How to use the tensor model parallelism for training?

2.1. Initialize some variables

BATCH_SIZE = 4
SEQ_LEN = 64
SAVE_INTERVAL = 50
TRAIN_STEP = 100

2.2. Create model, optimizer and tokenizer

from torch.optim import Adam
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
optimizer = Adam(model.parameters(), lr=3e-5)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Add pad token for batch training 
# GPT2 tokenizer doesn't have pad token.
tokenizer.pad_token = tokenizer.eos_token

2.3. Parallelize the model

import oslo

model = oslo.initialize(
    model, config={"model_parallelism": {"tensor_parallel_size": NUM_YOUR_GPUS}}
)

2.4. Load dataset and create dataloader

In this tutorial, I used datasets library of Hugging Face.

from datasets import load_dataset
from torch.utils.data import DataLoader

datasets = load_dataset("squad").data["train"]["context"]
datasets = [str(_) for _ in datasets[: TRAIN_STEP * BATCH_SIZE]]
dataloader = DataLoader(datasets, batch_size=BATCH_SIZE, shuffle=True)

2.5. Do training as usual

for step, batch in enumerate(dataloader):
    optimizer.zero_grad()

    # Make batch
    input_batch = tokenizer(
        batch,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=SEQ_LEN,
    ).to("cuda")

    # Forward-Backward-Step
    loss = model(**input_batch, labels=input_batch["input_ids"]).loss
    loss.backward()
    optimizer.step()

2.6. Save the parallelized model

We support save_parallelized method, and this is similar with save_pretrained in the Transformers. So, it can be used with the same argument with save_pretrained. Then, the checkpoints like pytorch_model_tp_0_pp_0.bin will be saved in your local path.

for step, batch in enumerate(dataloader):
    optimizer.zero_grad()

    # Make batch
    input_batch = tokenizer(
        batch,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=SEQ_LEN,
    ).to("cuda")

    # Forward-Backward-Step
    loss = model(**input_batch, labels=input_batch["input_ids"]).loss
    loss.backward()
    optimizer.step()
    
    # Save the parallelized model using `save_parallelized`
    if step % SAVE_INTERVAL == 0:
        model.save_parallelized(save_directory="./parallel_ckpt")

    if step > TRAIN_STEP:
        break

3. Merging Checkpoints

How to merge the parallelized checkpoints?

3.1. Create model

Usually we create a GPT2 model like this:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

However, it is okay to create the randomly initialized model because we will load the local checkpoints after creation. Here’s how to crate a randomly initialized model:

from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config.from_pretrained("gpt2")
model = GPT2LMHeadModel(config)

3.2. Parallelize the model

import oslo

model = oslo.initialize(
    model, config={"model_parallelism": {"enable": True, "tensor_parallel_size": NUM_YOUR_GPUS}}
)

3.3 Load parallelized checkpoints

We support from_parallelized method to load parallelized checkpoints. You can load them by just input the save path of parallelized checkpoints.

model = model.from_parallelized("./parallel_ckpt")

3.4. Merge parallelized checkpoints

The save_parallelized method have a special argument named merge_checkpoints. If you set this argument as Ture, the parallelized checkpoints of model will be saved as merged form. We recommend merging them after training because this process is a bit slow.

model.save_parallelized("./merged_ckpt", merge_checkpoints=True)
// merged_ckpt

pytorch_model.bin    config.json

This concludes the tensor model parallelism tutorial. Thank you.