--- title: data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." nb_path: "nbs/01_data-core.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
A HF_BaseInput
object is returned from the decodes method of HF_BatchTransform
as a mean to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. It represents the "input_ids" of a huggingface sequence as a tensor with a show
method that requires a huggingface tokenizer for proper display.
HF_BatchTransform
was inspired by this article. It handles both the tokenization and numericalization traditionally split apart in the fastai text DataBlock API. For huggingface tokenizers that require a prefix space, it will be included automatically. In its current incarnation, HF_BatchTransform
can be used to tokenize multiple inputs (common in seq2seq models for tasks like summarization) and even apply different tokenizers and arguments to each.
Inputs can come in as a string or a list of tokens, the later being for tasks like Named Entity Recognition (NER), where you want to predict the label of each token.
Note: The previous version of the library performed the tokenization/numericalization as a type transform when the raw data was read, and included a couple batch transforms to prepare the data for collation (e.g., to be made into a mini-batch). With this update, everything is done in a single batch transform. Why? Part of the inspiration had to do with the mechanics of the huggingrace tokenizer, in particular how by default it returns a collated mini-batch of data given a list of sequences. And where do we get a list of examples with fastai? In the batch transforms! So I thought, hey, why not do everything dynamically at batch time? And with a bit of tweaking, I got everything to work pretty well. The result is less code, faster mini-batch creation, less RAM utilization and time spent tokenizing (really helps with very large datasets), and more flexibility.
A basic wrapper that links defaults transforms for the data block API
HF_TextBlock
has been dramatically simplified from it's predecessor. It handles setting up your HF_BatchTransform
transform regardless of data source (e.g., this will work with files, DataFrames, whatever). For it to work, you must either pass in your own instance of a HF_BatchTransform
class or the huggingface architecture and tokenizer via the hf_arch
and hf_tokenizer
(the other args are optional).
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
imdb_df.head()
There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER
.
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
Once you have those elements, you can create your DataBlock
as simple as the below.
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch()
b = dls.one_batch(); len(b), len(b[0]['input_ids']), b[0]['input_ids'].shape, len(b[1])
Let's take a look at the actual types represented by our batch
explode_types(b)
dls.show_batch(dataloaders=dls, max_n=2)
The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
BLURR_MODEL_HELPER.get_models(task='SequenceClassification')
pretrained_model_names = [
'albert-base-v1',
'facebook/bart-base',
'bert-base-uncased',
'camembert-base',
'distilbert-base-uncased',
'monologg/electra-small-finetuned-imdb',
'flaubert/flaubert_small_cased',
'allenai/longformer-base-4096',
'google/mobilebert-uncased',
'roberta-base',
'xlm-mlm-en-2048',
'xlm-roberta-base',
'xlnet-base-cased'
]
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
#hide_output
task = HF_TASKS_AUTO.SequenceClassification
bsz = 2
seq_sz = 128
test_results = []
for model_name in pretrained_model_names:
error=None
print(f'=== {model_name} ===\n')
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, task=task)
print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
blocks = (
HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer, padding='max_length', max_length=seq_sz),
CategoryBlock
)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=bsz)
b = dls.one_batch()
try:
print('*** TESTING DataLoaders ***\n')
test_eq(len(b), 2)
test_eq(len(b[0]['input_ids']), bsz)
test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
test_eq(len(b[1]), bsz)
if (hasattr(hf_tokenizer, 'add_prefix_space')):
test_eq(dls.before_batch[0].tok_kwargs['add_prefix_space'], True)
test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
dls.show_batch(dataloaders=dls, max_n=2)
except Exception as err:
test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
raw_data = nlp.load_dataset('civil_comments', split='train[:1%]')
len(raw_data)
toxic_df = pd.DataFrame(raw_data, columns=list(raw_data.features.keys()))
toxic_df.head()
lbl_cols = list(toxic_df.columns[1:]); lbl_cols
toxic_df = toxic_df.round({col: 0 for col in lbl_cols})
toxic_df.head()
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
n_labels = len(lbl_cols)
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name,
task=task,
config_kwargs={'num_labels': n_labels})
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), MultiCategoryBlock(encoded=True, vocab=lbl_cols))
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader(lbl_cols),
splitter=RandomSplitter())
dls = dblock.dataloaders(toxic_df, bs=4)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=2)