--- title: text.data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable `DataLoaders` for text/NLP tasks" description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable `DataLoaders` for text/NLP tasks" nb_path: "nbs/11_text-data-core.ipynb" ---
raw_datasets = load_dataset("imdb", split=["train", "test"])
raw_datasets[0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))
final_ds = concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])
imdb_df = pd.DataFrame(final_ds)
imdb_df.head()
labels = raw_datasets[0].features["label"].names
labels
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "roberta-base" # "bert-base-multilingual-cased"
n_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
Starting with version 2.0, BLURR
provides a sequence classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets.
This class can be used for preprocessing both multiclass and multilabel classification datasets, and includes a proc_{your_text_attr}
and proc_{your_text_pair_attr}
(optional) attributes containing your modified text as a result of tokenization (e.g., if you specify a max_length
the proc_{your_text_attr}
may contain truncated text).
Note: This class works for both slow and fast tokenizers
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels, tok_kwargs={"max_length": 24})
proc_df = preprocessor.process_df(imdb_df)
proc_df.columns, len(proc_df)
proc_df.head(2)
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
A TextInput
object is returned from the decodes method of BatchDecodeTransform
as a means to customize @typedispatch
ed functions like DataLoaders.show_batch
and Learner.show_results
. The value will the your "input_ids".
Inspired by this article, BatchTokenizeTransform
inputs can come in as raw text, a list of words (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a dictionary that includes extra information you want to use during post-processing.
On-the-fly Batch-Time Tokenization:
Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization before our batch transforms run in a before_batch_tfms
transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits:
As of fastai 2.1.5, before batch transforms no longer have a decodes
method ... and so, I've introduced a standard batch transform here, BatchDecodeTransform
, (one that occurs "after" the batch has been created) that will do the decoding for us.
A basic DataBlock
for our inputs, TextBlock
is designed with sensible defaults to minimize user effort in defining their transforms pipeline. It handles setting up your BatchTokenizeTransform
and BatchDecodeTransform
transforms regardless of data source (e.g., this will work with files, DataFrames, whatever).
Note: You must either pass in your own instance of a BatchTokenizeTransform
class or the Hugging Face objects returned from BLURR.get_hf_objects
(e.g.,architecture, config, tokenizer, and model). The other args are optional.
We also include a blurr_sort_func
that works with SortedDL
to properly sort based on the number of tokens in each example.
There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via NLP
.
from transformers import AutoModelForSequenceClassification
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
b[0]
Let's take a look at the actual types represented by our batch
explode_types(b)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ItemGetter("proc_text"), get_y=ItemGetter("label"), splitter=RandomSplitter())
dls = dblock.dataloaders(proc_ds, bs=4)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
As of v.2, BLURR
now also allows you to pass extra information alongside your inputs in the form of a dictionary. If you use this approach, you must assign your text(s) to the text
attribute of the dictionary. This is a useful approach when splitting long documents into chunks, but wanting to score/predict by example rather than chunk (for example in extractive question answering tasks).
Note: A good place to access to this extra information during training/validation is in the before_batch
method of a Callback
.
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
def get_x(item):
return {"text": item.text, "another_val": "testing123"}
dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("label"), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
raw_datasets = load_dataset("glue", "mrpc")
def tokenize_function(example):
return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
DataLoaders
. Use BlurrDataLoader
to build Blurr friendly dataloaders from your datasets. Passing {'labels': label_names}
to your batch_tfm_kwargs
will ensure that your lable/target names will be displayed in methods like show_batch
and show_results
(just as it works with the mid-level API)
label_names = raw_datasets["train"].features["label"].names
trn_dl = TextDataLoader(
tokenized_datasets["train"],
hf_arch,
hf_config,
hf_tokenizer,
hf_model,
preproccesing_func=preproc_hf_dataset,
batch_decode_kwargs={"labels": label_names},
shuffle=True,
batch_size=8,
)
val_dl = TextDataLoader(
tokenized_datasets["validation"],
hf_arch,
hf_config,
hf_tokenizer,
hf_model,
preproccesing_func=preproc_hf_dataset,
batch_decode_kwargs={"labels": label_names},
batch_size=16,
)
dls = DataLoaders(trn_dl, val_dl)
b = dls.one_batch()
b[0]["input_ids"].shape
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)
The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
The text.data.core
module contains the fundamental bits for all data preprocessing tasks