--- title: GLUE classification tasks keywords: fastai sidebar: home_sidebar summary: "This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark tasks." description: "This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark tasks." nb_path: "nbs/99b_text-examples-glue.ipynb" ---
We'll use the "distilroberta-base" checkpoint for this example, but if you want to try an architecture that returns token_type_ids
for example, you can use something like bert-cased.
task = "mrpc"
task_meta = glue_tasks[task]
train_ds_name = task_meta["dataset_names"]["train"]
valid_ds_name = task_meta["dataset_names"]["valid"]
test_ds_name = task_meta["dataset_names"]["test"]
task_inputs = task_meta["inputs"]
task_target = task_meta["target"]
task_metrics = task_meta["metric_funcs"]
pretrained_model_name = "distilroberta-base" # bert-base-cased | distilroberta-base
bsz = 16
val_bsz = bsz * 2
Let's start by building our DataBlock
. We'll load the MRPC datset from huggingface's datasets
library which will be cached after downloading via the load_dataset
method. For more information on the datasets
API, see the documentation here.
raw_datasets = load_dataset("glue", task)
print(f"{raw_datasets}\n")
print(f"{raw_datasets[train_ds_name][0]}\n")
print(f"{raw_datasets[train_ds_name].features}\n")
There are a variety of ways we can preprocess the dataset for DataBlock consumption. For example, we could push the data into a DataFrame, add a boolean is_valid
column, and use the ColSplitter
method to define our train/validation splits like this:
raw_train_df = pd.DataFrame(raw_datasets[train_ds_name], columns=list(raw_datasets[train_ds_name].features.keys()))
raw_train_df["is_valid"] = False
raw_valid_df = pd.DataFrame(raw_datasets[valid_ds_name], columns=list(raw_datasets[train_ds_name].features.keys()))
raw_valid_df["is_valid"] = True
raw_df = pd.concat([raw_train_df, raw_valid_df])
print(len(raw_df))
raw_df.head()
Another option is to capture the indexes for both train and validation sets, use the datasets concatenate_datasets
to put them into a single dataset, and finally use the IndexSplitter
method to define our train/validation splits as such:
n_train, n_valid = raw_datasets[train_ds_name].num_rows, raw_datasets[valid_ds_name].num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([raw_datasets[train_ds_name], raw_datasets[valid_ds_name]])
How many classes are we working with? Depending on your approach above, you can do one of the two approaches below.
n_lbls = raw_df[task_target].nunique()
n_lbls
n_lbls = len(set([item[task_target] for item in raw_ds]))
n_lbls
model_cls = AutoModelForSequenceClassification
config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = n_lbls
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls, config=config)
print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock())
def get_x(r, attr):
return r[attr] if (isinstance(attr, str)) else tuple(r[inp] for inp in attr)
dblock = DataBlock(blocks=blocks, get_x=partial(get_x, attr=task_inputs), get_y=ItemGetter(task_target), splitter=IndexSplitter(valid_idxs))
dls = dblock.dataloaders(raw_ds, bs=bsz, val_bs=val_bsz)
b = dls.one_batch()
len(b), b[0]["input_ids"].shape, b[1].shape
if "token_type_ids" in b[0]:
print(
[
(hf_tokenizer.convert_ids_to_tokens(inp_id.item()), inp_id.item(), tt_id.item())
for inp_id, tt_id in zip(b[0]["input_ids"][0], b[0]["token_type_ids"][0])
if inp_id != hf_tokenizer.pad_token_id
]
)
dls.show_batch(dataloaders=dls, max_n=5)
With our DataLoaders built, we can now build our Learner
and train. We'll use mixed precision so we can train with bigger batches
model = BaseModelWrapper(hf_model)
learn = Learner(
dls,
model,
opt_func=partial(Adam),
loss_func=CrossEntropyLossFlat(),
metrics=task_metrics,
cbs=[BaseModelCallback],
splitter=blurr_splitter,
).to_fp16()
learn.freeze()
learn.summary()
preds = model(b[0])
preds.logits.shape, preds
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=2e-3)
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
learn.show_results(learner=learn, max_n=5)
How did we do?
val_res = learn.validate()
val_res_d = {"loss": val_res[0]}
for idx, m in enumerate(learn.metrics):
val_res_d[m.name] = val_res[idx + 1]
val_res_d
preds, targs, losses = learn.get_preds(with_loss=True)
print(preds.shape, targs.shape, losses.shape)
print(losses.mean(), accuracy(preds, targs))
Let's do item inference on an example from our test dataset
raw_test_df = pd.DataFrame(raw_datasets[test_ds_name], columns=list(raw_datasets[test_ds_name].features.keys()))
raw_test_df.head(10)
learn.blurr_predict(raw_test_df.iloc[9].to_dict())
Let's do batch inference on the entire test dataset
test_dl = dls.test_dl(raw_datasets[test_ds_name])
preds = learn.get_preds(dl=test_dl)
preds
With the high-level API, we can create our DataBlock, DataLoaders, and Blearner in one line of code
dl_kwargs = {"bs": bsz, "val_bs": val_bsz}
learn_kwargs = {"metrics": task_metrics}
learn = BlearnerForSequenceClassification.from_data(
raw_df, pretrained_model_name, text_attr=task_inputs, label_attr=task_target, dl_kwargs=dl_kwargs, learner_kwargs=learn_kwargs
)
learn.fit_one_cycle(1, lr_max=2e-3)
learn.show_results(learner=learn, max_n=5)
The general flow of this notebook was inspired by Zach Mueller's "Text Classification with Transformers" example that can be found in the wonderful Walk With Fastai docs. Take a look there for another approach to working with fast.ai and Hugging Face on GLUE tasks.