--- title: text.data.seq2seq.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core seq2seq (e.g., language modeling, summarization, translation) bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by Hugging Face transformer implementations." description: "This module contains the core seq2seq (e.g., language modeling, summarization, translation) bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by Hugging Face transformer implementations." nb_path: "nbs/20_text-data-seq2seq-core.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
{% endraw %} {% raw %}
What we're running with at the time this documentation was generated:
torch: 1.10.1+cu111
fastai: 2.5.6
transformers: 4.16.2
{% endraw %}

Setup

{% raw %}
pretrained_model_name = "facebook/bart-large-cnn"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=BartForConditionalGeneration)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('bart',
 transformers.models.bart.configuration_bart.BartConfig,
 transformers.models.bart.tokenization_bart_fast.BartTokenizerFast,
 transformers.models.bart.modeling_bart.BartForConditionalGeneration)
{% endraw %}

Preprocessing

Starting with version 2.0, BLURR provides a preprocessing base class that can be used to build seq2seq preprocessed datasets from pandas DataFrames or Hugging Face Datasets

{% raw %}

class Seq2SeqPreprocessor[source]

Seq2SeqPreprocessor(hf_tokenizer:PreTrainedTokenizerBase, batch_size:int=1000, text_attr:str='text', max_input_tok_length:Optional[int]=None, target_text_attr:str='summary', max_target_tok_length:Optional[int]=None, is_valid_attr:Optional[str]='is_valid', tok_kwargs:dict={}) :: Preprocessor

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
batch_size int 1000 The number of examples to process at a time
text_attr str text The attribute holding the text
max_input_tok_length typing.Optional[int] None The maximum length (# of tokens) allowed for inputs. Will default to the max length allowed
by the model if not provided
target_text_attr str summary The attribute holding the summary
max_target_tok_length typing.Optional[int] None The maximum length (# of tokens) allowed for targets
is_valid_attr typing.Optional[str] is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
tok_kwargs dict None Tokenization kwargs that will be applied with calling the tokenizer
{% endraw %} {% raw %}
{% endraw %}

Mid-level API

Base tokenization, batch transform, and DataBlock methods

{% raw %}

class Seq2SeqTextInput[source]

Seq2SeqTextInput(x, **kwargs) :: TextInput

The base represenation of your inputs; used by the various fastai show methods

{% endraw %} {% raw %}
{% endraw %}

A Seq2SeqTextInput object is returned from the decodes method of Seq2SeqBatchTokenizeTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. The value will the your "input_ids".

{% raw %}

class Seq2SeqBatchTokenizeTransform[source]

Seq2SeqBatchTokenizeTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, include_labels:bool=True, ignore_token_id:int=-100, max_length:int=None, max_target_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, tok_kwargs={}, text_gen_kwargs={}, **kwargs) :: BatchTokenizeTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Type Default Details
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config PretrainedConfig A specific configuration instance you want to use
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
include_labels bool True To control whether the "labels" are included in your inputs. If they are, the loss will be calculated in
the model's forward function and you can simply use PreCalculatedLoss as your Learner's loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
max_length int None To control the length of the padding/truncation of the input sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
max_target_length int None To control the length of the padding/truncation of the target sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding typing.Union[bool, str] True To control the padding applied to your hf_tokenizer during tokenization. If None, will default to
False or `'do_not_pad'.
See Everything you always wanted to know about padding and truncation
truncation typing.Union[bool, str] True To control truncation applied to your hf_tokenizer during tokenization. If None, will default to
False or do_not_truncate.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization. Set this to True
if your inputs are pre-tokenized (not numericalized)
tok_kwargs dict None Any other keyword arguments you want included when using your hf_tokenizer to tokenize your inputs
text_gen_kwargs dict None Any keyword arguments to pass to the hf_model.generate method
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

We create a subclass of BatchTokenizeTransform for summarization tasks to add decoder_input_ids and labels (if we want Hugging Face to calculate the loss for us) to our inputs during training. See here and here for more information on these additional inputs used in summarization, translation, and conversational training tasks. How they should look for particular architectures can be found by looking at those model's forward function's docs (See here for BART for example)

Note also that labels is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) decoder_input_ids.

And lastly, we also update our targets to just be the input_ids of our target sequence so that fastai's Learner.show_results works (again, almost all the fastai bits require returning a single tensor to work).

{% raw %}

class Seq2SeqBatchDecodeTransform[source]

Seq2SeqBatchDecodeTransform(input_return_type:typing.Type=TextInput, hf_arch:Optional[str]=None, hf_config:Optional[PretrainedConfig]=None, hf_tokenizer:Optional[PreTrainedTokenizerBase]=None, hf_model:Optional[PreTrainedModel]=None, **kwargs) :: BatchDecodeTransform

A class used to cast your inputs as input_return_type for fastai show methods

Type Default Details
input_return_type typing.Type TextInput Used by typedispatched show methods
hf_arch typing.Optional[str] None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config typing.Optional[transformers.configuration_utils.PretrainedConfig] None A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] None A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model typing.Optional[transformers.modeling_utils.PreTrainedModel] None A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
kwargs No Content
{% endraw %} {% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}

default_text_gen_kwargs[source]

default_text_gen_kwargs(hf_config, hf_model, task=None)

{% endraw %} {% raw %}
default_text_gen_kwargs(hf_config, hf_model)
{'max_length': 142,
 'min_length': 56,
 'do_sample': False,
 'early_stopping': True,
 'num_beams': 4,
 'temperature': 1.0,
 'top_k': 50,
 'top_p': 1.0,
 'repetition_penalty': 1.0,
 'bad_words_ids': None,
 'bos_token_id': 0,
 'pad_token_id': 1,
 'eos_token_id': 2,
 'length_penalty': 2.0,
 'no_repeat_ngram_size': 3,
 'encoder_no_repeat_ngram_size': 0,
 'num_return_sequences': 1,
 'decoder_start_token_id': 2,
 'use_cache': True,
 'num_beam_groups': 1,
 'diversity_penalty': 0.0,
 'output_attentions': False,
 'output_hidden_states': False,
 'output_scores': False,
 'return_dict_in_generate': False,
 'forced_bos_token_id': 0,
 'forced_eos_token_id': 2,
 'remove_invalid_values': False}
{% endraw %} {% raw %}

class Seq2SeqTextBlock[source]

Seq2SeqTextBlock(hf_arch:str=None, hf_config:PretrainedConfig=None, hf_tokenizer:PreTrainedTokenizerBase=None, hf_model:PreTrainedModel=None, batch_tokenize_tfm:Optional[BatchTokenizeTransform]=None, batch_decode_tfm:Optional[BatchDecodeTransform]=None, max_length:int=None, max_target_length=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, input_return_type=Seq2SeqTextInput, dl_type=SortedDL, batch_tokenize_kwargs:dict={}, batch_decode_kwargs:dict={}, tok_kwargs={}, text_gen_kwargs={}, **kwargs) :: TextBlock

The core TransformBlock to prepare your inputs for training in Blurr with fastai's DataBlock API

Type Default Details
hf_arch str None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config PretrainedConfig None A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer PreTrainedTokenizerBase None A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model PreTrainedModel None A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
batch_tokenize_tfm typing.Optional[blurr.text.data.core.BatchTokenizeTransform] None The before_batch_tfm you want to use to tokenize your raw data on the fly
(defaults to an instance of BatchTokenizeTransform)
batch_decode_tfm typing.Optional[blurr.text.data.core.BatchDecodeTransform] None The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods,
(defaults to BatchDecodeTransform)
max_length int None To control the length of the padding/truncation for the input sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
max_target_length None To control the length of the padding/truncation for the target sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-y
padding typing.Union[bool, str] True To control the padding applied to your hf_tokenizer during tokenization. If None, will default to
False or `'do_not_pad'.
See Everything you always wanted to know about padding and truncation
truncation typing.Union[bool, str] True To control truncation applied to your hf_tokenizer during tokenization. If None, will default to
False or do_not_truncate.
See Everything you always wanted to know about padding and truncation
input_return_type _TensorMeta Seq2SeqTextInput The return type your decoded inputs should be cast too (used by methods such as show_batch)
dl_type type SortedDL The type of DataLoader you want created (defaults to SortedDL)
batch_tokenize_kwargs dict None Any keyword arguments you want applied to your batch_tokenize_tfm
batch_decode_kwargs dict None Any keyword arguments you want applied to your batch_decode_tfm (will be set as a fastai batch_tfms)
tok_kwargs dict None Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
text_gen_kwargs dict None Any keyword arguments you want to have applied with generating text
(default: default_text_gen_kwargs)
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

show_batch

{% raw %}
{% endraw %}