--- title: data keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." ---
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
The HF_BaseInput
object is used to encapsulate all the inputs required by whatever huggingface model we are using. We use it as a container for the input_ids
, token_type_ids
, and attention_mask
tensors required by most models, and also as a mean to customize @typedispatched functions like DataLoaders.show_batch
and Learner.show_results
.
HF_Tokenizer
complies with the requirements of a basic tokenization function in fastai. See here.
We've updated the _tokenize
method to operate on a string or a list (the later being very handy for tasks like token classification whereby the examples consist of a list of tokens and a list of labels for each).
build_hf_input
uses fastai's @typedispatched decorator to provide for complete flexibility in terms of how your numericalized tokens are assembled, and also what you return via HF_BaseInput
and as your targets. You can override this implementation as needed by assigning a type to the task
argument (and optionally the tokenizer
argument as well).
What you return here is what will be fed into your huggingface model.
Currently, we've only implemented building this block from a pandas DataFrame. It handles single and multiple text inputs so that it can be used out-of-the-box against any model in the huggingface arsenal (e.g. sequence classification, question-answer, summarization, token classification, etc...).
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
imdb_df.head()
There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER
.
task = HF_TASKS_AUTO.ForSequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
config = AutoConfig.from_pretrained(pretrained_model_name)
hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name,
task=task,
config=config)
Once you have those elements, you can create your DataBlock
as simple as the below. Note that you can use multiple columns in your DataFrame to make up the single text element required by HF_TextBlock
below.
# single input
blocks = (
HF_TextBlock.from_df(text_cols_lists=[['text']], hf_arch=hf_arch, hf_tokenizer=hf_tokenizer),
CategoryBlock
)
dblock = DataBlock(blocks=blocks,
get_x=lambda x: x.text0,
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
# dblock.summary(imdb_df)
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]), len(b[1])
b[0][0].shape, b[0][1].shape, b[0][2].shape, b[1].shape
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)
We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.
path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)
squad_df.head(2)
max_seq_len= 512
squad_df = squad_df[(squad_df.answer_end < max_seq_len) & (squad_df.is_impossible == False)]
task = HF_TASKS_AUTO.ForQuestionAnswering
pretrained_model_name = "roberta-base"
config = AutoConfig.from_pretrained(pretrained_model_name)
hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name,
task=task,
config=config)
vocab = dict(enumerate(range(max_seq_len)));
Below we utilize the @typedispatch decorator to completely change how we'll tokenize the data for the ForQuestionAnsweringTask
.
And here we demonstrate some more of the extensibility bits of the framework, by passing in our own instance of HF_BatchTransform
.
# (optional): override HF_BatchTransform defaults
hf_batch_tfm = HF_BatchTransform(hf_arch, hf_tokenizer, task=ForQuestionAnsweringTask(),
max_seq_len=128, truncation_strategy=None)
blocks = (
HF_TextBlock.from_df(text_cols_lists=[['question_text'],['context']],
hf_arch=hf_arch,
hf_tokenizer=hf_tokenizer,
hf_batch_tfm=hf_batch_tfm),
CategoryBlock(vocab=vocab),
CategoryBlock(vocab=vocab)
)
dblock = DataBlock(blocks=blocks,
get_x=lambda x: (x.text0, x.text1),
get_y=[ColReader('answer_start'), ColReader('answer_end')],
splitter=RandomSplitter(),
n_inp=1)
# dblock.summary(squad_df)
dls = dblock.dataloaders(squad_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
b[0][0].shape, b[0][1].shape, b[0][2].shape, b[1].shape, b[2].shape
dls.show_batch(hf_tokenizer=hf_tokenizer, skip_special_tokens=False, max_n=2)
# germ_eval_df = pd.read_csv('./data/task-token-classification/germeval2014ner/germeval2014ner_cleaned.csv')
germ_eval_df = pd.read_csv('./germeval2014_sample.csv')
germ_eval_df.head()
Note: n_tokens
represents the number of sub-workd tokens the BertTokenizer
uses to tokenize the token
. If you use a different tokenizer, you'll need to re-calcuate this column assuming you use the same approach as I do here.
germ_eval_df.dropna(inplace=True)
germ_eval_df[germ_eval_df.token.isna()]
labels = sorted(germ_eval_df.tag1.unique())
print(labels)
task = HF_TASKS_AUTO.ForTokenClassification
pretrained_model_name = "bert-base-multilingual-cased"
config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = len(labels)
hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name,
task=task,
config=config)
hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)
germ_eval_df = germ_eval_df.groupby(by='seq_id').agg(list).reset_index()
germ_eval_df.head()
HF_TokenCategorize
modifies the fastai Categorize
transform in a couple of ways. First, it allows your targets to consist of a Category
per token, and second, it uses the idea of an ignore_token
to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token
as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.
We need a custom build_hf_input
because we need to align the target tokens with the input tokens (e.g., if there are 512 input tokens there need to be 512 targets)
# single input
blocks = (
HF_TextBlock.from_df(text_cols_lists=[['token']],
hf_arch=hf_arch,
hf_tokenizer=hf_tokenizer,
tok_func_mode='list',
task=ForTokenClassificationTask()),
HF_TokenCategoryBlock(vocab=labels)
)
def get_y(inp):
return [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.token, inp.tag1) ]
dblock = DataBlock(blocks=blocks,
get_x=lambda x: x.text0,
get_y=get_y,
splitter=RandomSplitter())
Note in the example above we had to define a get_y
in order to return both the entity we want to predict a category for, as well as, how many subtokens are used by the hf_tokenizer
to represent it. This is necessary for the input/target alignment discussed above.
# dblock.summary(test_df)
dls = dblock.dataloaders(germ_eval_df, bs=4)
b = dls.one_batch()
len(b), b[0][0].shape, b[1].shape
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)