--- title: data.question_answering keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." nb_path: "nbs/01b_data-question-answering.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)
We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.
squad_df.head(2)
task = HF_TASKS_AUTO.QuestionAnswering
pretrained_model_name = 'roberta-base' #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
The pre_process_squad
method is structured around how we've setup the squad DataFrame above.
squad_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)
max_seq_len= 128
squad_df = squad_df[(squad_df.tok_answer_end < max_seq_len) & (squad_df.is_impossible == False)]
vocab = dict(enumerate(range(max_seq_len)))
We'll return a HF_QuestionAnswerInput
from our custom HF_BeforeBatchTransform
so that we can customize the show_batch/results methods for this task.
By overriding HF_BeforeBatchTransform
we can add other inputs to each example for this particular task.
before_batch_tfm = HF_QABeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
max_length=max_seq_len, truncation='only_second',
tok_kwargs={ 'return_special_tokens_mask': True })
blocks = (
HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_QuestionAnswerInput),
CategoryBlock(vocab=vocab),
CategoryBlock(vocab=vocab)
)
dblock = DataBlock(blocks=blocks,
get_x=lambda x: (x.question, x.context),
get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
splitter=RandomSplitter(),
n_inp=1)
dls = dblock.dataloaders(squad_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
b[0]['input_ids'].shape, b[0]['attention_mask'].shape, b[1].shape, b[2].shape
The show_batch
method above allows us to create a more interpretable view of our question/answer data.
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)