--- title: Using the Low-level fastai API keywords: fastai sidebar: home_sidebar summary: "This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark to train, evalulate, and do inference." description: "This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark to train, evalulate, and do inference." nb_path: "nbs/99c_text-examples-glue-plain-pytorch.ipynb" ---
We'll use the "distilroberta-base" checkpoint for this example, but if you want to try an architecture that returns token_type_ids
for example, you can use something like bert-cased.
task = 'mrpc'
task_meta = glue_tasks[task]
train_ds_name = task_meta['dataset_names']["train"]
valid_ds_name = task_meta['dataset_names']["valid"]
test_ds_name = task_meta['dataset_names']["test"]
task_inputs = task_meta['inputs']
task_target = task_meta['target']
task_metrics = task_meta['metric_funcs']
pretrained_model_name = "distilroberta-base" # bert-base-cased | distilroberta-base
bsz = 16
val_bsz = bsz *2
Let's start by building our DataBlock
. We'll load the MRPC datset from huggingface's datasets
library which will be cached after downloading via the load_dataset
method. For more information on the datasets
API, see the documentation here.
raw_datasets = load_dataset('glue', task)
print(f'{raw_datasets}\n')
print(f'{raw_datasets[train_ds_name][0]}\n')
print(f'{raw_datasets[train_ds_name].features}\n')
My #1 answer as to the question, "Why aren't my transformers training?", is that you likely don't have num_labels
set correctly. The default for sequence classification tasks is 2, and even though that is what we have here, let's show how to set this either way.
n_lbls = raw_datasets[train_ds_name].features[task_target].num_classes
n_lbls
model_cls = AutoModelForSequenceClassification
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name,
model_cls=model_cls,
config_kwargs={'num_labels': n_lbls})
print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))
Tokenize (and numericalize) the raw text using the datasets.map
function, and then remove unnecessary and/or problematic attributes from the resulting tokenized dataset (e.g., things like strings that can't be converted to a tensor)
def tokenize_function(example):
return hf_tokenizer(*[example[inp] for inp in task_inputs ], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(task_inputs + ['idx'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets["train"].column_names
BlurrBatchCreator
augments the default DataCollatorWithPadding
method to return a tuple of inputs/targets. As huggingface returns a BatchEncoding
object after the call to DataCollatorWithPadding
, this class will convert it to a dict
so that fastai can put the batches on the correct device for training
Build the plain ol' PyTorch DataLoaders
data_collator = TextBatchCreator(hf_arch, hf_config, hf_tokenizer, hf_model)
train_dataloader = torch.utils.data.DataLoader(tokenized_datasets[train_ds_name], shuffle=True, batch_size=bsz,
collate_fn=data_collator)
eval_dataloader = torch.utils.data.DataLoader(tokenized_datasets[valid_ds_name], batch_size=val_bsz,
collate_fn=data_collator)
dls = DataLoaders(train_dataloader, eval_dataloader)
for b in dls.train: break
b[0]['input_ids'].shape, b[1].shape, b[0]['input_ids'].device, b[1].device
With our plain ol' PyTorch DataLoaders built, we can now build our Learner
and train.
Note: Certain fastai methods like dls.one_batch
, get_preds
and dls.test_dl
won't work with standard PyTorch DataLoaders ... but we'll show how to remedy that in a moment :)
model = BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam),
loss_func=PreCalculatedCrossEntropyLoss(),
metrics=task_metrics,
cbs=[BaseModelCallback],
splitter=blurr_splitter).to_fp16()
learn.freeze()
learn.summary()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=2e-3)
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
How did we do?
val_res = learn.validate()
val_res_d = { 'loss': val_res[0]}
for idx, m in enumerate(learn.metrics): val_res_d[m.name] = val_res[idx+1]
val_res_d
# preds, targs = learn.get_preds() # ... won't work :(
Let's do item inference on an example from our test dataset
raw_test_df = raw_datasets[test_ds_name].to_pandas()
raw_test_df.head()
test_ex_idx = 0
test_ex = raw_test_df.iloc[test_ex_idx][task_inputs].values.tolist()
inputs = hf_tokenizer(*test_ex, return_tensors="pt").to('cuda:1')
outputs = hf_model(**inputs)
outputs.logits
torch.argmax(torch.softmax(outputs.logits, dim=-1))
Let's do batch inference on the entire test dataset
test_dataloader = torch.utils.data.DataLoader(tokenized_datasets[test_ds_name],
shuffle=False, batch_size=val_bsz,
collate_fn=data_collator)
hf_model.eval()
probs, preds = [], []
for xb,yb in test_dataloader:
xb = to_device(xb,'cuda')
with torch.no_grad():
outputs = hf_model(**xb)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
probs.append(logits)
preds.append(predictions)
all_probs = torch.cat(probs, dim=0)
all_preds = torch.cat(preds, dim=0)
print(all_probs.shape, all_preds.shape)
Let's start with a fresh set of huggingface objects
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name,
model_cls=model_cls,
config_kwargs={'num_labels': n_lbls})
... and the fix is this simple! Instead of using the PyTorch Dataloaders, let's use the fastai flavor like this ...
train_dataloader = TextDataLoader(tokenized_datasets[train_ds_name],
hf_arch, hf_config, hf_tokenizer, hf_model,
preproccesing_func=preproc_hf_dataset, shuffle=True, batch_size=bsz)
eval_dataloader = TextDataLoader(tokenized_datasets[valid_ds_name],
hf_arch, hf_config, hf_tokenizer, hf_model,
preproccesing_func=preproc_hf_dataset, batch_size=val_bsz)
dls = DataLoaders(train_dataloader, eval_dataloader)
Everything else is the same ... but now we get both our fast.ai AND Blurr features back!
dls.show_batch(dataloaders=dls, trunc_at=500, max_n=2)
b = dls.one_batch()
b[0]['input_ids'].shape, b[1].shape, b[0]['input_ids'].device, b[1].device
model = BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam),
loss_func=PreCalculatedCrossEntropyLoss(),
metrics=task_metrics,
cbs=[BaseModelCallback],
splitter=blurr_splitter).to_fp16()
learn.freeze()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=2e-3)
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
learn.show_results(learner=learn, trunc_at=500, max_n=2)
How did we do?
val_res = learn.validate()
val_res_d = { 'loss': val_res[0]}
for idx, m in enumerate(learn.metrics): val_res_d[m.name] = val_res[idx+1]
val_res_d
Now we can use Learner.get_preds()
preds, targs = learn.get_preds()
print(preds.shape, targs.shape)
print(accuracy(preds, targs))
Let's do item inference on an example from our test dataset
raw_test_df = raw_datasets[test_ds_name].to_pandas()
raw_test_df.head()
test_ex_idx = 0
test_ex = raw_test_df.iloc[test_ex_idx][task_inputs].values.tolist()
inputs = hf_tokenizer(*test_ex, return_tensors="pt").to('cuda:1')
outputs = hf_model(**inputs)
outputs.logits
torch.argmax(torch.softmax(outputs.logits, dim=-1))
Let's do batch inference on the entire test dataset using dls.test_dl
test_dl = dls.test_dl(tokenized_datasets[test_ds_name])
preds = learn.get_preds(dl=test_dl)
preds
So you can see, with one simple swap of the DataLoader
objects, you can get back a lot of that nice fastai functionality folks using the mid/high-level APIs have at their disposal. Nevertheless, if you're hell bent on using the standard PyTorch DataLoaders
, you're still good to go with using the fastai Learner
, it's callbacks, etc...