--- title: text.data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable `DataLoaders` for text/NLP tasks" description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable `DataLoaders` for text/NLP tasks" nb_path: "nbs/11_text-data-core.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
{% endraw %} {% raw %}
What we're running with at the time this documentation was generated:
torch: 1.10.1+cu111
fastai: 2.5.6
transformers: 4.16.2
{% endraw %}

Setup

We'll use a subset of imdb to demonstrate how to configure your BLURR for sequence classification tasks

{% raw %}
raw_datasets = load_dataset("imdb", split=["train", "test"])
raw_datasets[0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))

final_ds = concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])
imdb_df = pd.DataFrame(final_ds)
imdb_df.head()
Reusing dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
Loading cached shuffled indices for dataset at /home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-65b5588450d6b196.arrow
text label is_valid
0 This movie was horrible. I swear they didn't even write a script they just kinda winged it through out the whole movie. Ice-T was annoying as hell. *SPOILERS Phht more like reasons not to watch it* They sit down and eat breakfast for 20 minutes. he coulda been long gone. The ground was hard it would of been close to impossible to to track him with out dogs. And when ICE-T is on that Hill and uses that Spaz-15 Assault SHOTGUN like its a sniper rifle (and then cuts down a tree with eight shells?? It would take 1000's of shells to cut down a tree that size.) Shotguns and hand guns are conside... 0 False
1 I have seen this movie at the cinema many years ago, and one thing surprised me so negatively that I could not see any redeeming virtues in the movies: Dennis Quaid was cast as a policeman that never smiles or grin, while his smile and grin are two of his trademarks. Danny Glover was cast as the bad guy, but - again - most viewers' imagination could not go far enough as to believe him in that role. Also, Jared Leto was not believable as the former medicine student. The tension was just not there, since the killer was known very early. The finale was, again, neither dramatic nor tense: nobo... 0 False
2 This is a fantastic series first and foremost. It is very well done and very interesting. As a huge WWII buff, I had learned a lot before seeing this series. One of the best things this has going for it is all the interviews with past individuals back when the war was relatively fresh in their minds, comparatively speaking that is. It is nothing against the men that you see getting interviewed in the programs of today, it is just that most of these men weren't really involved in the upper echelons of what was happening then. One of the best parts is the narrating by Sir Laurence Oliver. I ... 1 False
3 Kurosawa really blew it on this one. Every genius is allowed a failure. The concept is fine but the execution is badly blurred.<br /><br />There is an air of fantasy about this film making it something of an art film. The poverty stricken of Tokyo deserve a fairer and more realistic portrayal. Many of them have interesting stories to tell. A very disappointing film. 0 False
4 MGM were unsure of how to market Garbo when she first arrived in Hollywood. Mayer had a lot of faith in her and her appearance in "Torrent" justified that. She did not speak a word of English so she must have found it difficult to work, also Ricardo Cortez did not make it very easy for her.<br /><br />The torrent of the title is the river Juscar that winds through a sleepy little village in Spain. Leonora (Greta Garbo) hopes someday that her voice will bring great wealth and happiness to her struggling parents. Leonora and Don Rafael (Ricardo Cortez) are in love but he is under his mother'... 1 False
{% endraw %} {% raw %}
labels = raw_datasets[0].features["label"].names
labels
['neg', 'pos']
{% endraw %} {% raw %}
model_cls = AutoModelForSequenceClassification

pretrained_model_name = "roberta-base"  # "bert-base-multilingual-cased"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
    pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification)
{% endraw %}

Preprocessing

Starting with version 2.0, BLURR provides a preprocessing base class that can be used to build task specific pre-processed datasets from pandas DataFrames or Hugging Face Datasets

{% raw %}

class Preprocessor[source]

Preprocessor(hf_tokenizer:PreTrainedTokenizerBase, batch_size:int=1000, text_attr:str='text', text_pair_attr:Optional[str]=None, is_valid_attr:Optional[str]='is_valid', tok_kwargs:dict={})

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
batch_size int 1000 The number of examples to process at a time
text_attr str text The attribute holding the text
text_pair_attr typing.Optional[str] None The attribute holding the text_pair
is_valid_attr typing.Optional[str] is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
tok_kwargs dict None Tokenization kwargs that will be applied with calling the tokenizer
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class ClassificationPreprocessor[source]

ClassificationPreprocessor(hf_tokenizer:PreTrainedTokenizerBase, batch_size:int=1000, is_multilabel:bool=False, id_attr:Optional[str]=None, text_attr:str='text', text_pair_attr:Optional[str]=None, label_attrs:Union[str, typing.List[str]]='label', is_valid_attr:Optional[str]='is_valid', label_mapping:Optional[List[str]]=None, tok_kwargs:dict={}) :: Preprocessor

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
batch_size int 1000 The number of examples to process at a time
is_multilabel bool False Whether the dataset should be processed for multi-label; if True, will ensure label_attrs are
converted to a value of either 0 or 1 indiciating the existence of the class in the example
id_attr typing.Optional[str] None The unique identifier in the dataset
text_attr str text The attribute holding the text
text_pair_attr typing.Optional[str] None The attribute holding the text_pair
label_attrs typing.Union[str, typing.List[str]] label The attribute holding the label(s) of the example
is_valid_attr typing.Optional[str] is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
label_mapping typing.Optional[typing.List[str]] None A list indicating the valid labels for the dataset (optional, defaults to the unique set of labels
found in the full dataset)
tok_kwargs dict None Tokenization kwargs that will be applied with calling the tokenizer
{% endraw %} {% raw %}
{% endraw %}

Starting with version 2.0, BLURR provides a sequence classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets.

This class can be used for preprocessing both multiclass and multilabel classification datasets, and includes a proc_{your_text_attr} and proc_{your_text_pair_attr} (optional) attributes containing your modified text as a result of tokenization (e.g., if you specify a max_length the proc_{your_text_attr} may contain truncated text).

Note: This class works for both slow and fast tokenizers

Using a DataFrame

{% raw %}
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels, tok_kwargs={"max_length": 24})
proc_df = preprocessor.process_df(imdb_df)
proc_df.columns, len(proc_df)
proc_df.head(2)
proc_text text label is_valid label_name text_start_char_idx text_end_char_idx
0 This movie was horrible. I swear they didn't even write a script they just kinda winged it through out This movie was horrible. I swear they didn't even write a script they just kinda winged it through out the whole movie. Ice-T was annoying as hell. *SPOILERS Phht more like reasons not to watch it* They sit down and eat breakfast for 20 minutes. he coulda been long gone. The ground was hard it would of been close to impossible to to track him with out dogs. And when ICE-T is on that Hill and uses that Spaz-15 Assault SHOTGUN like its a sniper rifle (and then cuts down a tree with eight shells?? It would take 1000's of shells to cut down a tree that size.) Shotguns and hand guns are conside... 0 False neg 0 102
1 I have seen this movie at the cinema many years ago, and one thing surprised me so negatively that I could I have seen this movie at the cinema many years ago, and one thing surprised me so negatively that I could not see any redeeming virtues in the movies: Dennis Quaid was cast as a policeman that never smiles or grin, while his smile and grin are two of his trademarks. Danny Glover was cast as the bad guy, but - again - most viewers' imagination could not go far enough as to believe him in that role. Also, Jared Leto was not believable as the former medicine student. The tension was just not there, since the killer was known very early. The finale was, again, neither dramatic nor tense: nobo... 0 False neg 0 106
{% endraw %}

Using a Hugging Face Dataset

{% raw %}
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
Dataset({
    features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
    num_rows: 1200
})
{% endraw %}

Mid-level API

Base tokenization, batch transform, and DataBlock methods

{% raw %}

class TextInput[source]

TextInput(x, **kwargs) :: TensorBase

The base represenation of your inputs; used by the various fastai show methods

{% endraw %} {% raw %}
{% endraw %}

A TextInput object is returned from the decodes method of BatchDecodeTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. The value will the your "input_ids".

{% raw %}

class BatchTokenizeTransform[source]

BatchTokenizeTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, include_labels:bool=True, ignore_token_id:int=-100, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, tok_kwargs:dict={}, **kwargs) :: Transform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Type Default Details
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config PretrainedConfig A specific configuration instance you want to use
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
include_labels bool True To control whether the "labels" are included in your inputs. If they are, the loss will be calculated in
the model's forward function and you can simply use PreCalculatedLoss as your Learner's loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
max_length int None To control the length of the padding/truncation. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding typing.Union[bool, str] True To control the padding applied to your hf_tokenizer during tokenization. If None, will default to
False or `'do_not_pad'.
See Everything you always wanted to know about padding and truncation
truncation typing.Union[bool, str] True To control truncation applied to your hf_tokenizer during tokenization. If None, will default to
False or do_not_truncate.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization. Set this to True
if your inputs are pre-tokenized (not numericalized)
tok_kwargs dict None Any other keyword arguments you want included when using your hf_tokenizer to tokenize your inputs
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

Inspired by this article, BatchTokenizeTransform inputs can come in as raw text, a list of words (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a dictionary that includes extra information you want to use during post-processing.

On-the-fly Batch-Time Tokenization:

Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization before our batch transforms run in a before_batch_tfms transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits:

  1. Less code
  2. Faster mini-batch creation
  3. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets)
  4. Flexibility
{% raw %}

class BatchDecodeTransform[source]

BatchDecodeTransform(input_return_type:typing.Type=TextInput, hf_arch:Optional[str]=None, hf_config:Optional[PretrainedConfig]=None, hf_tokenizer:Optional[PreTrainedTokenizerBase]=None, hf_model:Optional[PreTrainedModel]=None, **kwargs) :: Transform

A class used to cast your inputs as input_return_type for fastai show methods

Type Default Details
input_return_type typing.Type TextInput Used by typedispatched show methods
hf_arch typing.Optional[str] None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config typing.Optional[transformers.configuration_utils.PretrainedConfig] None A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] None A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model typing.Optional[transformers.modeling_utils.PreTrainedModel] None A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

As of fastai 2.1.5, before batch transforms no longer have a decodes method ... and so, I've introduced a standard batch transform here, BatchDecodeTransform, (one that occurs "after" the batch has been created) that will do the decoding for us.

{% raw %}
{% endraw %} {% raw %}

blurr_sort_func[source]

blurr_sort_func(example, hf_tokenizer:PreTrainedTokenizerBase, is_split_into_words:bool=False, tok_kwargs:dict={})

This method is used by the SortedDL to ensure your dataset is sorted after tokenization

Type Default Details
example No Content
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization. Set this to True
if your inputs are pre-tokenized (not numericalized)
tok_kwargs dict None Any other keyword arguments you want to include during tokenization
{% endraw %} {% raw %}

class TextBlock[source]

TextBlock(hf_arch:Optional[str]=None, hf_config:Optional[PretrainedConfig]=None, hf_tokenizer:Optional[PreTrainedTokenizerBase]=None, hf_model:Optional[PreTrainedModel]=None, include_labels:bool=True, ignore_token_id=-100, batch_tokenize_tfm:Optional[BatchTokenizeTransform]=None, batch_decode_tfm:Optional[BatchDecodeTransform]=None, max_length:Optional[int]=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, input_return_type:typing.Type=TextInput, dl_type:Optional[DataLoader]=None, batch_tokenize_kwargs:dict={}, batch_decode_kwargs:dict={}, tok_kwargs:dict={}, text_gen_kwargs:dict={}, **kwargs) :: TransformBlock

The core TransformBlock to prepare your inputs for training in Blurr with fastai's DataBlock API

Type Default Details
hf_arch typing.Optional[str] None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config typing.Optional[transformers.configuration_utils.PretrainedConfig] None A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] None A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model typing.Optional[transformers.modeling_utils.PreTrainedModel] None A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
include_labels bool True To control whether the "labels" are included in your inputs. If they are, the loss will be calculated in
the model's forward function and you can simply use PreCalculatedLoss as your Learner's loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
batch_tokenize_tfm typing.Optional[blurr.text.data.core.BatchTokenizeTransform] None The before_batch_tfm you want to use to tokenize your raw data on the fly
(defaults to an instance of BatchTokenizeTransform)
batch_decode_tfm typing.Optional[blurr.text.data.core.BatchDecodeTransform] None The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods,
(defaults to BatchDecodeTransform)
max_length typing.Optional[int] None To control the length of the padding/truncation. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding typing.Union[bool, str] True To control the padding applied to your hf_tokenizer during tokenization. If None, will default to
False or `'do_not_pad'.
See Everything you always wanted to know about padding and truncation
truncation typing.Union[bool, str] True To control truncation applied to your hf_tokenizer during tokenization. If None, will default to
False or do_not_truncate.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization. Set this to True
if your inputs are pre-tokenized (not numericalized)
input_return_type typing.Type TextInput The return type your decoded inputs should be cast too (used by methods such as show_batch)
dl_type typing.Optional[fastai.data.load.DataLoader] None The type of DataLoader you want created (defaults to SortedDL)
batch_tokenize_kwargs dict None Any keyword arguments you want applied to your batch_tokenize_tfm
batch_decode_kwargs dict None Any keyword arguments you want applied to your batch_decode_tfm (will be set as a fastai batch_tfms)
tok_kwargs dict None Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
text_gen_kwargs dict None Any keyword arguments you want to have applied with generating text
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

A basic DataBlock for our inputs, TextBlock is designed with sensible defaults to minimize user effort in defining their transforms pipeline. It handles setting up your BatchTokenizeTransform and BatchDecodeTransform transforms regardless of data source (e.g., this will work with files, DataFrames, whatever).

Note: You must either pass in your own instance of a BatchTokenizeTransform class or the Hugging Face objects returned from BLURR.get_hf_objects (e.g.,architecture, config, tokenizer, and model). The other args are optional.

We also include a blurr_sort_func that works with SortedDL to properly sort based on the number of tokens in each example.

Utility classes and methods

These methods are use internally for getting blurr transforms associated to your DataLoaders

{% raw %}
{% endraw %} {% raw %}

get_blurr_tfm[source]

get_blurr_tfm(tfms_list:Pipeline, tfm_class:Transform=BatchTokenizeTransform)

Given a fastai DataLoaders batch transforms, this method can be used to get at a transform instance used in your Blurr DataBlock

Type Default Details
tfms_list Pipeline A list of transforms (e.g., dls.after_batch, dls.before_batch, etc...)
tfm_class Transform BatchTokenizeTransform The transform to find
{% endraw %} {% raw %}
{% endraw %} {% raw %}

first_blurr_tfm[source]

first_blurr_tfm(dls:DataLoaders, tfms:List[Transform]=[<class 'blurr.text.data.core.BatchTokenizeTransform'>, <class 'blurr.text.data.core.BatchDecodeTransform'>])

This convenience method will find the first Blurr transform required for methods such as show_batch and show_results. The returned transform should have everything you need to properly decode and 'show' your Hugging Face inputs/targets

Type Default Details
dls DataLoaders Your fast.ai `DataLoaders
tfms typing.List[fastcore.transform.Transform] (BatchTokenizeTransform, BatchDecodeTransform) The Blurr transforms to look for in order
{% endraw %} {% raw %}
{% endraw %}

Mid-level Examples

The following eamples demonstrate several approaches to construct your DataBlock for sequence classication tasks using the mid-level API.

Batch-Time Tokenization

Step 1: Get your Hugging Face objects.

There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via NLP.

{% raw %}
from transformers import AutoModelForSequenceClassification

model_cls = AutoModelForSequenceClassification

pretrained_model_name = "distilroberta-base"  # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
{% endraw %}

Step 2: Create your DataBlock

{% raw %}
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter())
{% endraw %}

Step 3: Build your DataLoaders

{% raw %}
dls = dblock.dataloaders(imdb_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
{% endraw %} {% raw %}
b[0]
{% endraw %}

Let's take a look at the actual types represented by our batch

{% raw %}
explode_types(b)
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
{% endraw %}

Using a preprocessed dataset

Preprocessing your raw data is the more traditional approach to using Transformers. It is required, for example, when you want to work with documents longer than your model will allow. A preprocessed dataset is used in the same way a non-preprocessed dataset is.

Step 1a: Get your Hugging Face objects.

{% raw %}
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
{% endraw %}

Step 1b. Preprocess dataset

{% raw %}
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
{% endraw %}

Step 2: Create your DataBlock

{% raw %}
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ItemGetter("proc_text"), get_y=ItemGetter("label"), splitter=RandomSplitter())
{% endraw %}

Step 3: Build your DataLoaders

{% raw %}
dls = dblock.dataloaders(proc_ds, bs=4)
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
{% endraw %}

Passing extra information

As of v.2, BLURR now also allows you to pass extra information alongside your inputs in the form of a dictionary. If you use this approach, you must assign your text(s) to the text attribute of the dictionary. This is a useful approach when splitting long documents into chunks, but wanting to score/predict by example rather than chunk (for example in extractive question answering tasks).

Note: A good place to access to this extra information during training/validation is in the before_batch method of a Callback.

{% raw %}
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)


def get_x(item):
    return {"text": item.text, "another_val": "testing123"}


dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("label"), splitter=ColSplitter())
{% endraw %} {% raw %}
dls = dblock.dataloaders(imdb_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
{% endraw %}

Low-level API

For working with PyTorch and/or fast.ai Datasets & DataLoaders, the low-level API allows you to get back fast.ai specific features such as show_batch, show_results, etc... when using plain ol' PyTorch Datasets, Hugging Face Datasets, etc...

{% raw %}

class TextBatchCreator[source]

TextBatchCreator(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, data_collator:typing.Type=None)

A class that can be assigned to a TfmdDL.create_batch method; used to in Blurr's low-level API to create batches that can be used in the Blurr library

Type Default Details
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config PretrainedConfig A specific configuration instance you want to use
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
data_collator typing.Type None Defaults to use Hugging Face's DataCollatorWithPadding(tokenizer=hf_tokenizer)
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class TextDataLoader[source]

TextDataLoader(dataset:Union[Dataset, Datasets], hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, batch_creator:Optional[TextBatchCreator]=None, batch_decode_tfm:Optional[BatchDecodeTransform]=None, input_return_type:typing.Type=TextInput, preproccesing_func:Callable[typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets], PreTrainedTokenizerBase, PreTrainedModel, typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets]]=None, batch_decode_kwargs:dict={}, bs=64, shuffle=False, num_workers=None, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A transformed DataLoader that works with Blurr. From the fastai docs: A TfmDL is described as "a DataLoader that creates Pipeline from a list of Transforms for the callbacks after_item, before_batch and after_batch. As a result, it can decode or show a processed batch.

Type Default Details
dataset typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets] A standard PyTorch Dataset
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config PretrainedConfig A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform
to before_batch_tfm)
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to
before_batch_tfm)
hf_model PreTrainedModel A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to
before_batch_tfm)
batch_creator typing.Optional[blurr.text.data.core.TextBatchCreator] None An instance of BlurrBatchCreator or equivalent (defaults to BlurrBatchCreator)
batch_decode_tfm typing.Optional[blurr.text.data.core.BatchDecodeTransform] None The batch_tfm used to decode Blurr batches (defaults to BatchDecodeTransform)
input_return_type typing.Type TextInput Used by typedispatched show methods
preproccesing_func typing.Callable[[typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets], transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.modeling_utils.PreTrainedModel], typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets]] None (optional) A preprocessing function that will be applied to your dataset
batch_decode_kwargs dict None Keyword arguments to be applied to your batch_decode_tfm
Valid Keyword Arguments
bs int 64 Argument passed to TfmdDL.__init__
shuffle bool False Argument passed to TfmdDL.__init__
num_workers None Argument passed to TfmdDL.__init__
verbose bool False Argument passed to TfmdDL.__init__
do_setup bool True Argument passed to TfmdDL.__init__
pin_memory bool False Argument passed to TfmdDL.__init__
timeout int 0 Argument passed to TfmdDL.__init__
batch_size None Argument passed to TfmdDL.__init__
drop_last bool False Argument passed to TfmdDL.__init__
indexed None Argument passed to TfmdDL.__init__
n None Argument passed to TfmdDL.__init__
device None Argument passed to TfmdDL.__init__
persistent_workers bool False Argument passed to TfmdDL.__init__
wif None Argument passed to TfmdDL.__init__
before_iter None Argument passed to TfmdDL.__init__
after_item None Argument passed to TfmdDL.__init__
before_batch None Argument passed to TfmdDL.__init__
after_batch None Argument passed to TfmdDL.__init__
after_iter None Argument passed to TfmdDL.__init__
create_batches None Argument passed to TfmdDL.__init__
create_item None Argument passed to TfmdDL.__init__
create_batch None Argument passed to TfmdDL.__init__
retain None Argument passed to TfmdDL.__init__
get_idxs None Argument passed to TfmdDL.__init__
sample None Argument passed to TfmdDL.__init__
shuffle_fn None Argument passed to TfmdDL.__init__
do_batch None Argument passed to TfmdDL.__init__
{% endraw %} {% raw %}
{% endraw %}

Low-level Examples

The following example demonstrates how to use the low-level API with standard PyTorch/Hugging Face/fast.ai Datasets and DataLoaders.

Step 1: Build your datasets

{% raw %}
raw_datasets = load_dataset("glue", "mrpc")
{% endraw %} {% raw %}
def tokenize_function(example):
    return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
{% endraw %}

Step 2: Dataset pre-processing (optional)

{% raw %}
{% endraw %} {% raw %}

preproc_hf_dataset[source]

preproc_hf_dataset(dataset:Union[Dataset, Datasets], hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel)

This method can be used to preprocess most Hugging Face Datasets for use in Blurr and other training libraries

Type Default Details
dataset typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets] A standard PyTorch Dataset or fast.ai Datasets
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
{% endraw %}

Step 3: Build your DataLoaders.

Use BlurrDataLoader to build Blurr friendly dataloaders from your datasets. Passing {'labels': label_names} to your batch_tfm_kwargs will ensure that your lable/target names will be displayed in methods like show_batch and show_results (just as it works with the mid-level API)

{% raw %}
label_names = raw_datasets["train"].features["label"].names

trn_dl = TextDataLoader(
    tokenized_datasets["train"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    shuffle=True,
    batch_size=8,
)

val_dl = TextDataLoader(
    tokenized_datasets["validation"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    batch_size=16,
)

dls = DataLoaders(trn_dl, val_dl)
{% endraw %} {% raw %}
b = dls.one_batch()
b[0]["input_ids"].shape
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)
{% endraw %}

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

{% raw %}
{% endraw %}

The text.data.core module contains the fundamental bits for all data preprocessing tasks