--- title: data.question_answering keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks." nb_path: "nbs/01b_data-question-answering.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
{% endraw %}

Question/Answering tokenization, batch transform, and DataBlock methods

Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question). The objective is to predict the start/end tokens of the answer in the context)

{% raw %}
path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)
1000
{% endraw %}

We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.

{% raw %}
squad_df.head(2)
id title context question answers ds_type answer_text is_impossible
0 56be85543aeaaa14008c9063 Beyoncé Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G... When did Beyonce start becoming popular? {'text': ['in the late 1990s'], 'answer_start': [269]} train in the late 1990s False
1 56be85543aeaaa14008c9065 Beyoncé Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G... What areas did Beyonce compete in when she was growing up? {'text': ['singing and dancing'], 'answer_start': [207]} train singing and dancing False
{% endraw %} {% raw %}
task = HF_TASKS_AUTO.QuestionAnswering

pretrained_model_name = 'roberta-base' #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
{% endraw %} {% raw %}
{% endraw %} {% raw %}

pre_process_squad[source]

pre_process_squad(row, hf_arch, hf_tokenizer)

{% endraw %}

The pre_process_squad method is structured around how we've setup the squad DataFrame above.

{% raw %}
squad_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)
{% endraw %} {% raw %}
max_seq_len= 128
{% endraw %} {% raw %}
squad_df = squad_df[(squad_df.tok_answer_end < max_seq_len) & (squad_df.is_impossible == False)]
{% endraw %} {% raw %}
vocab = dict(enumerate(range(max_seq_len)))
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_QuestionAnswerInput[source]

HF_QuestionAnswerInput(x, **kwargs) :: HF_BaseInput

{% endraw %}

We'll return a HF_QuestionAnswerInput from our HF_BatchTransform so that we can customize the show_batch/results methods for this task.

{% raw %}
{% endraw %} {% raw %}

class HF_QABatchTransform[source]

HF_QABatchTransform(hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, is_pretokenized=False, n_tok_inps=1, hf_input_return_type=HF_QuestionAnswerInput, tok_kwargs={}, **kwargs) :: HF_BatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

{% endraw %}

By overriding HF_BatchTransform we can add other inputs to each example for this particular task.

{% raw %}
hf_batch_tfm = HF_QABatchTransform(hf_arch, hf_tokenizer, max_length=max_seq_len, truncation='only_second', 
                                   tok_kwargs={ 'return_special_tokens_mask': True })

blocks = (
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm), 
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: (x.question, x.context),
                   get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
                   splitter=RandomSplitter(),
                   n_inp=1)
{% endraw %} {% raw %}
dls = dblock.dataloaders(squad_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
(3, 5, 4, 4)
{% endraw %} {% raw %}
b[0]['input_ids'].shape, b[0]['attention_mask'].shape, b[1].shape, b[2].shape
(torch.Size([4, 128]), torch.Size([4, 128]), torch.Size([4]), torch.Size([4]))
{% endraw %} {% raw %}
{% endraw %}

The show_batch method above allows us to create a more interpretable view of our question/answer data.

{% raw %}
dls.show_batch(dataloaders=dls, max_n=2)
text start/end answer
0 How many Grammy nominations does Beyonce have? Beyoncé has won 20 Grammy Awards, both as a solo artist and member of Destiny's Child, making her the second most honored female artist by the Grammys, behind Alison Krauss and the most nominated woman in Grammy Award history with 52 nominations. "Single Ladies (Put a Ring on It)" won Song of the Year in 2010 while "Say My Name" and "Crazy in Love" had previously won Best R&B Song. Dangerously in Love, B'Day and I Am... Sasha Fierce have all won Best Contemporary R&B Album. Beyon (59, 61) 52 nominations
1 in September 2010, what career area did Beyonce start exploring? In September 2010, Beyoncé made her runway modelling debut at Tom Ford's Spring/Summer 2011 fashion show. She was named "World's Most Beautiful Woman" by People and the "Hottest Female Singer of All Time" by Complex in 2012. In January 2013, GQ placed her on its cover, featuring her atop its "100 Sexiest Women of the 21st Century" list. VH1 listed her at number 1 on its 100 Sexiest Artists list. Several wax figures of Beyoncé are found at Madame Tussauds Wax Muse (25, 26) modelling
{% endraw %}

Cleanup