--- title: text.data.question_answering keywords: fastai sidebar: home_sidebar summary: "Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question). The objective is to predict the start/end tokens of the answer in the context). This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." description: "Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question). The objective is to predict the start/end tokens of the answer in the context). This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." nb_path: "nbs/14_text-data-question-answering.ipynb" ---
raw_datasets = load_dataset("squad_v2", split=["train[:1000]", "validation[:200]"])
raw_train_ds, raw_valid_ds = raw_datasets[0], raw_datasets[1]
raw_train_df = pd.DataFrame(raw_train_ds)
raw_valid_df = pd.DataFrame(raw_valid_ds)
raw_train_df["is_valid"] = False
raw_valid_df["is_valid"] = True
print(len(raw_train_df))
print(len(raw_valid_df))
raw_train_df.head(2)
raw_valid_df.head(2)
squad_df = pd.concat([raw_train_df, raw_valid_df])
len(squad_df)
squad_df["ans_start_char_idx"] = squad_df.answers.apply(lambda v: v["answer_start"][0] if len(v["answer_start"]) > 0 else "0")
squad_df["answer_text"] = squad_df.answers.apply(lambda v: v["text"][0] if len(v["text"]) > 0 else "")
squad_df["ans_end_char_idx"] = squad_df["ans_start_char_idx"].astype(int) + squad_df["answer_text"].str.len()
print(len(squad_df))
squad_df[squad_df.is_valid == True].head(2)
model_cls = AutoModelForQuestionAnswering
pretrained_model_name = "roberta-base" #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
max_seq_len = 128
vocab = dict(enumerate(range(max_seq_len)))
With version 2.0.0 of BLURR
, we include a Preprocessor
for question answering that can either truncate texts or else chunk long documents into multiple examples.
Note: Unlike other NLP tasks in BLURR, extractive question answering requires preprocessing in order to convert our raw start/end character indices into start/end token indices unless your dataset includes the later. Token indicies, rather than character indices, will be used as our targets and are dependent on your tokenizer of choice.
tok_kwargs = {"return_overflowing_tokens": True, "max_length": max_seq_len, "stride": 64}
preprocessor = QAPreprocessor(hf_tokenizer, id_attr="id", tok_kwargs=tok_kwargs)
proc_df = preprocessor.process_df(squad_df)
print(len(proc_df))
proc_df.head(4)
sampled_df = proc_df.sample(n=10)
for row_idx, row in sampled_df.iterrows():
test_example = row
inputs = hf_tokenizer(row.proc_question, row.proc_context)
if test_example.is_answerable:
# print(test_example.answer_text)
test_eq(
test_example.answer_text,
hf_tokenizer.decode(inputs["input_ids"][test_example.ans_start_token_idx : test_example.ans_end_token_idx]).strip(),
)
else:
test_eq(test_example.ans_start_token_idx, 0)
test_eq(test_example.ans_end_token_idx, 0)
If you want to remove texts longer than your model will hold (and include only answerable contexts)
preprocessor = QAPreprocessor(hf_tokenizer, tok_kwargs={"return_overflowing_tokens": False, "max_length": max_seq_len})
proc2_df = preprocessor.process_df(squad_df)
proc2_df = proc2_df[(proc2_df.ans_end_token_idx < max_seq_len) & (proc2_df.is_answerable)]
print(len(proc2_df))
proc2_df.head(2)
pretrained_model_name = "distilroberta-base"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=AutoModelForQuestionAnswering)
max_seq_len = 128
vocab = dict(enumerate(range(max_seq_len)))
tok_kwargs = {"return_overflowing_tokens": True, "max_length": max_seq_len, "stride": 24}
preprocessor = QAPreprocessor(hf_tokenizer, id_attr="id", tok_kwargs=tok_kwargs)
proc_df = preprocessor.process_df(squad_df)
proc_df.head(1)
before_batch_tfm = QABatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, max_length=max_seq_len)
blocks = (
TextBlock(batch_tokenize_tfm=before_batch_tfm, input_return_type=QATextInput),
CategoryBlock(vocab=vocab),
CategoryBlock(vocab=vocab),
)
dblock = DataBlock(
blocks=blocks,
get_x=lambda x: (x.proc_question, x.proc_context),
get_y=[ColReader("ans_start_token_idx"), ColReader("ans_end_token_idx")],
splitter=ColSplitter(),
n_inp=1,
)
dls = dblock.dataloaders(proc_df, bs=4)
len(dls.train), len(dls.valid)
b = dls.one_batch()
len(b), len(b[0]), len(b[1]), len(b[2])
b[0]["input_ids"].shape, b[0]["attention_mask"].shape, b[1].shape, b[2].shape
b[0]["start_positions"], b[0]["end_positions"]
The show_batch
method above allows us to create a more interpretable view of our question/answer data.
dls.show_batch(dataloaders=dls, max_n=4)
As mentioned in the data.core
module documentation, BLURR now also allows you to pass extra information alongside your inputs in the form of a dictionary. If we are splitting long documents into chunks but want to predict/aggregation by example (rather than by chunk), we'll need to include a unique identifier for each example. When we look at modeling.question_answer
module, we'll see how the question answering bits can use such an Id for this purpose.
pretrained_model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=AutoModelForQuestionAnswering)
max_seq_len = 128
vocab = dict(enumerate(range(max_seq_len)))
preprocessor = QAPreprocessor(
hf_tokenizer, id_attr="id", tok_kwargs={"return_overflowing_tokens": True, "max_length": max_seq_len, "stride": 64}
)
proc_df = preprocessor.process_df(squad_df)
proc_df.head(1)
before_batch_tfm = QABatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, max_length=max_seq_len)
blocks = (
TextBlock(batch_tokenize_tfm=before_batch_tfm, input_return_type=QATextInput),
CategoryBlock(vocab=vocab),
CategoryBlock(vocab=vocab),
)
# since its preprocessed, we include an "text" key with the values of our question and context
def get_x(item):
return {"text": (item.proc_question, item.proc_context), "id": item.id}
dblock = DataBlock(
blocks=blocks,
get_x=get_x,
get_y=[ItemGetter("ans_start_token_idx"), ItemGetter("ans_end_token_idx")],
splitter=ColSplitter(),
n_inp=1,
)
dls = dblock.dataloaders(proc_df, bs=4)
len(dls.train), len(dls.valid)
b = dls.one_batch()
len(b), len(b[0]), len(b[1]), len(b[2])
b[0].keys()
b[0]["input_ids"].shape, b[0]["attention_mask"].shape, b[1].shape, b[2].shape
We can see that any additional data is now located in the inputs dictionary
b[0]["id"]
dls.show_batch(dataloaders=dls, max_n=4)