Model Parallelism Tutorial
Contents
Model Parallelism Tutorial¶
Note that currently OSLO only supports tensor model parallelism.
The details of model parallelism cocepts can be found here.
The source code of this tutorial can be found here.
Table of contents¶
0. Distributed Launcher¶
This tutorial must be launched using distributed launcher.
If you have 4 GPUs:
python -m torch.distributed.launch --nproc_per_node=4 YOUR_SCRIPT.py
If you installed DeepSpeed in your environments, the following works the same.
deepspeed --num_gpus=4 YOUR_SCRIPT.py
For more information of the distributed launchers, refer to:
1. Inference¶
How to use the model parallelism for inference?
1.1. Create model and tokenizer¶
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
1.2. Parallelize the model¶
tensor_parallel_size
must be same or smaller than total number of gpus.tensor_parallel_size
must be power of 2. (e.g. 2, 4, 8, 16, …)tensor_parallel_size
must be positive number.
import oslo
model = oslo.initialize(
model, config={"model_parallelism": {"tensor_parallel_size": NUM_YOUR_GPUS}}
)
You can also use json file (example for 4GPUs)
// oslo-config.json
{
"model_parallelism": {
"tensor_parallel_size": 4
}
}
And you can use the json file like this:
model = oslo.initialize(model, config="oslo-config.json")
1.3. Do inference as usual¶
This is an example of text generation. In addition to this, it can be used in various tasks such as sequence classification or masked lm. Likewise, you can write the code as usual.
text = "I don't want a lot for Christmas. There is just one thing"
tokens = tokenizer(text, return_tensors="pt").to("cuda")
print(tokenizer.decode(model.generate(**tokens, num_beams=3)[0]))
I don't want a lot for Christmas. There is just one thing I want to ...
2. Training¶
How to use the model parallelism for training?
2.1. Initialize some variables¶
BATCH_SIZE = 4
SEQ_LEN = 64
SAVE_INTERVAL = 50
TRAIN_STEP = 100
2.2. Load dataset and create data loader¶
In this tutorial, I used datasets
library of Hugging Face.
from datasets import load_dataset
from torch.utils.data import DataLoader
datasets = load_dataset("squad").data["train"]["context"]
datasets = [str(_) for _ in datasets[: TRAIN_STEP * BATCH_SIZE]]
dataloader = DataLoader(datasets, batch_size=BATCH_SIZE, shuffle=True)
2.3. Create model, optimizer and tokenizer¶
from torch.optim import Adam
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
optimizer = Adam(model.parameters(), lr=3e-5)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Add pad token for batch training
# GPT2 tokenizer doesn't have pad token.
tokenizer.pad_token = tokenizer.eos_token
2.4. Parallelize the model¶
import oslo
model = oslo.initialize(
model, config={"model_parallelism": {"tensor_parallel_size": NUM_YOUR_GPUS}}
)
2.5. Do training as usual¶
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
# Make batch
input_batch = tokenizer(
batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=SEQ_LEN,
).to("cuda")
# Forward-Backward-Step
loss = model(**input_batch, labels=input_batch["input_ids"]).loss
loss.backward()
optimizer.step()
2.6. Save the parallelized model¶
We support save_parallelized
method, and this is similar with save_pretrained
in the Transformers.
So, it can be used with the same argument with save_pretrained
.
Then, the checkpoints like pytorch_model_tp_0_pp_0.bin
will be saved in your local path.
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
# Make batch
input_batch = tokenizer(
batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=SEQ_LEN,
).to("cuda")
# Forward-Backward-Step
loss = model(**input_batch, labels=input_batch["input_ids"]).loss
loss.backward()
optimizer.step()
# Save the parallelized model using `save_parallelized`
if step % SAVE_INTERVAL == 0:
model.save_parallelized(save_directory="./parallel_ckpt")
if step > TRAIN_STEP:
break
3. Merging Checkpoints¶
How to merge the parallelized checkpoints?
3.1. Create model¶
Usually we create a GPT2 model like this:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
However, it is okay to create the randomly initialized model because we will load the local checkpoints after creation. Here’s how to crate a randomly initialized model:
from transformers import GPT2Config, GPT2LMHeadModel
config = GPT2Config.from_pretrained("gpt2")
model = GPT2LMHeadModel(config)
3.2. Parallelize the model¶
import oslo
model = oslo.initialize(
model, config={"model_parallelism": {"tensor_parallel_size": NUM_YOUR_GPUS}}
)
3.3 Load parallelized checkpoints¶
We support from_parallelized
method to load parallelized checkpoints.
You can load them by just input the save path of parallelized checks.
model = model.from_parallelized("./parallel_ckpt")
3.4. Merge parallelized checkpoints¶
The save_parallelized
method have a special argument named merge_checkpoints
.
If you set this argument as Ture, the parallelized checkpoints of model will be saved as merged form.
We recommend merging them after training because this process is a bit slow.
model.save_parallelized("./merged_ckpt", merge_checkpoints=True)
// merged_ckpt
pytorch_model.bin config.json
This concludes the model parallelism tutorial. Thank you.