--- title: data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." nb_path: "nbs/01_data-core.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
HF_TokenizerTransform
was inspired by this article. It handles both the tokenization and numericalization traditionally split apart in the fastai text DataBlock API. For huggingface tokenizers that require a prefix space, it will be included automatically.
You can pass a string or list into this Transform, the later being for tasks that require two input sequneces (e.g. question answer tasks for example require a "context" and a "question" sequence).
In order to make the tokenization/numericalization process more efficient, this transform has been updated to return a transformers.tokenization_utils_base.BatchEncoding
dictionary with all the required transformer inputs (e.g. input_ids, attention_mask, etc...). Previously, it only returned the raw input_ids for each sequence which were then put together to come up with all the required inputs and padding in a before_batch
transform.
A HF_BaseInput
object is returned from the decodes
method of HF_BatchTransform
as a mean to customize @typedispatched functions like DataLoaders.show_batch
and Learner.show_results
. It encapsulates a list with one item, the input_ids for the sequence.
HF_TextBlock
has been dramatically simplified from it's predecessor. It handles setting up your HF_TokenizerTransform
and HF_BatchTransform
transform regardless of data source (e.g., this will work with files, DataFrames, whatever).
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
imdb_df.head()
There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER
.
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
Once you have those elements, you can create your DataBlock
as simple as the below.
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]['input_ids']), b[0]['input_ids'].shape, len(b[1])
Let's take a look at the actual types represented by our batch
explode_types(b)
dls.show_batch(dataloaders=dls, max_n=2)
The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
BLURR_MODEL_HELPER.get_models(task='SequenceClassification')
pretrained_model_names = [
'albert-base-v1',
'facebook/bart-base',
'bert-base-uncased',
'camembert-base',
'distilbert-base-uncased',
'monologg/electra-small-finetuned-imdb',
'flaubert/flaubert_small_cased',
'allenai/longformer-base-4096',
'google/mobilebert-uncased',
'roberta-base',
'xlm-mlm-en-2048',
'xlm-roberta-base',
'xlnet-base-cased'
]
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
#hide_output
task = HF_TASKS_AUTO.SequenceClassification
bsz = 2
test_results = []
for model_name in pretrained_model_names:
error=None
print(f'=== {model_name} ===\n')
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, task=task)
print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer, padding='max_length', max_length=128),
CategoryBlock)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=bsz)
b = dls.one_batch()
try:
print('*** TESTING DataLoaders ***\n')
test_eq(len(b), 2)
test_eq(len(b[0]['input_ids']), bsz)
test_eq(b[0]['input_ids'].shape, torch.Size([bsz, 128]))
test_eq(len(b[1]), bsz)
if (hasattr(hf_tokenizer, 'add_prefix_space')):
test_eq(dls.tfms[0].kwargs['add_prefix_space'], True)
test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
dls.show_batch(dataloaders=dls, max_n=2)
except Exception as err:
test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
raw_data = nlp.load_dataset('civil_comments', split='train[:1%]')
len(raw_data)
toxic_df = pd.DataFrame(raw_data, columns=list(raw_data.features.keys()))
toxic_df.head()
lbl_cols = list(toxic_df.columns[1:]); lbl_cols
toxic_df = toxic_df.round({col: 0 for col in lbl_cols})
toxic_df.head()
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
n_labels = len(lbl_cols)
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name,
task=task,
config_kwargs={'num_labels': n_labels})
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), MultiCategoryBlock(encoded=True, vocab=lbl_cols))
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader(lbl_cols),
splitter=RandomSplitter())
dls = dblock.dataloaders(toxic_df, bs=4)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=2)