--- title: modeling.question_answering keywords: fastai sidebar: home_sidebar summary: "This module contains custom models, loss functions, custom splitters, etc... for question answering tasks" description: "This module contains custom models, loss functions, custom splitters, etc... for question answering tasks" nb_path: "nbs/02b_modeling-question-answering.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Again, we'll use a subset of pre-processed SQUAD v2 for our purposes below.
# squad_df = pd.read_csv('./data/task-question-answering/squad_cleaned.csv'); len(squad_df)
# sample
squad_df = pd.read_csv('./squad_sample.csv'); len(squad_df)
squad_df.head(2)
pretrained_model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'
hf_model_cls = BertForQuestionAnswering
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name,
model_cls=hf_model_cls)
# # here's a pre-trained roberta model for squad you can try too
# pretrained_model_name = "ahotrod/roberta_large_squad2"
# hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name,
# task=HF_TASKS_AUTO.ForQuestionAnswering)
# # here's a pre-trained xlm model for squad you can try too
# pretrained_model_name = 'xlm-mlm-ende-1024'
# hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name,
# task=HF_TASKS_AUTO.ForQuestionAnswering)
squad_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)
max_seq_len= 128
squad_df = squad_df[(squad_df.tokenized_input_len < max_seq_len) & (squad_df.is_impossible == False)]
vocab = list(range(max_seq_len))
# vocab = dict(enumerate(range(max_seq_len)));
trunc_strat = 'only_second' if (hf_tokenizer.padding_side == 'right') else 'only_first'
blocks = (
HF_TextBlock(hf_arch, hf_tokenizer,
hf_batch_tfm=HF_QABatchTransform(hf_arch, hf_tokenizer),
max_length=max_seq_len, truncation=trunc_strat,
tok_kwargs={ 'return_special_tokens_mask': True }),
CategoryBlock(vocab=vocab),
CategoryBlock(vocab=vocab)
)
def get_x(x):
return (x.question, x.context) if (hf_tokenizer.padding_side == 'right') else (x.context, x.question)
dblock = DataBlock(blocks=blocks,
get_x=get_x,
get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
splitter=RandomSplitter(),
n_inp=1)
dls = dblock.dataloaders(squad_df, bs=4)
len(dls.vocab), dls.vocab[0], dls.vocab[1]
dls.show_batch(dataloaders=dls, max_n=2)
Here we create a question/answer specific subclass of HF_BaseModelCallback
in order to get all the start and end prediction. We also add here a new loss function that can handle multiple targets
And here we provide a custom loss function our question answer task, expanding on some techniques learned from here and here.
In fact, this new loss function can be used in many other multi-modal architectures, with any mix of loss functions. For example, this can be ammended to include the is_impossible
task, as well as the start/end token tasks in the SQUAD v2 dataset.
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam, decouple_wd=True),
cbs=[HF_QstAndAnsModelCallback],
splitter=hf_splitter)
learn.loss_func=MultiTargetLoss()
learn.create_opt() # -> will create your layer groups based on your "splitter" function
learn.freeze()
Notice above how I had to define the loss function after creating the Learner
object. I'm not sure why, but the MultiTargetLoss
above prohibits the learner from being exported if I do.
print(len(learn.opt.param_groups))
x, y_start, y_end = dls.one_batch()
preds = learn.model(x)
len(preds),preds[0].shape
learn.lr_find(suggestions=True)
learn.fit_one_cycle(3, lr_max=1e-3)
learn.show_results(learner=learn, skip_special_tokens=True, max_n=2)
inf_df = pd.DataFrame.from_dict([{
'question': 'What did George Lucas make?',
'context': 'George Lucas created Star Wars in 1977. He directed and produced it.'
}],
orient='columns')
learn.blurr_predict(inf_df.iloc[0])
inp_ids = hf_tokenizer.encode('What did George Lucas make?',
'George Lucas created Star Wars in 1977. He directed and produced it.')
hf_tokenizer.convert_ids_to_tokens(inp_ids, skip_special_tokens=False)[11:13]
Note that there is a bug currently in fastai v2 (or with how I'm assembling everything) that currently prevents us from seeing the decoded predictions and probabilities for the "end" token.
inf_df = pd.DataFrame.from_dict([{
'question': 'When was Star Wars made?',
'context': 'George Lucas created Star Wars in 1977. He directed and produced it.'
}],
orient='columns')
test_dl = dls.test_dl(inf_df)
inp = test_dl.one_batch()[0]['input_ids']
probs, _, preds = learn.get_preds(dl=test_dl, with_input=False, with_decoded=True)
hf_tokenizer.convert_ids_to_tokens(inp.tolist()[0],
skip_special_tokens=False)[torch.argmax(probs[0]):torch.argmax(probs[1])]
learn.unfreeze()
learn.fit_one_cycle(3, lr_max=slice(1e-7, 1e-4))
learn.recorder.plot_loss()
learn.show_results(learner=learn, max_n=2)
learn.blurr_predict(inf_df.iloc[0])
preds, pred_classes, probs = learn.blurr_predict(inf_df.iloc[0])
preds
inp_ids = hf_tokenizer.encode('When was Star Wars made?',
'George Lucas created Star Wars in 1977. He directed and produced it.')
hf_tokenizer.convert_ids_to_tokens(inp_ids, skip_special_tokens=False)[int(preds[0]):int(preds[1])]
Note that I had to replace the loss function because of the above-mentioned issue to exporting the model with the MultiTargetLoss
loss function. After getting our inference learner, we put it back and we're good to go!
learn.loss_func = nn.CrossEntropyLoss()
learn.export(fname='q_and_a_learn_export.pkl')
inf_learn = load_learner(fname='q_and_a_learn_export.pkl')
inf_learn.loss_func = MultiTargetLoss()
inf_df = pd.DataFrame.from_dict([
{'question': 'Who created Star Wars?',
'context': 'George Lucas created Star Wars in 1977. He directed and produced it.'}],
orient='columns')
inf_learn.blurr_predict(inf_df.iloc[0])
inp_ids = hf_tokenizer.encode('Who created Star Wars?',
'George Lucas created Star Wars in 1977. He directed and produced it.')
hf_tokenizer.convert_ids_to_tokens(inp_ids, skip_special_tokens=False)[7:9]