--- title: data keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
{% endraw %}

Base tokenization, batch transform, and DataBlock methods

{% raw %}
{% endraw %} {% raw %}

class HF_BaseInput[source]

HF_BaseInput(iterable=()) :: list

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

{% endraw %}

The HF_BaseInput object is used to encapsulate all the inputs required by whatever huggingface model we are using. We use it as a container for the input_ids, token_type_ids, and attention_mask tensors required by most models, and also as a mean to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results.

{% raw %}
{% endraw %} {% raw %}

class HF_Tokenizer[source]

HF_Tokenizer(hf_arch, hf_tokenizer, mode='str', list_split_func='split', **kwargs)

huggingface friendly tokenization function.

{% endraw %}

HF_Tokenizer complies with the requirements of a basic tokenization function in fastai. See here.

We've updated the _tokenize method to operate on a string or a list (the later being very handy for tasks like token classification whereby the examples consist of a list of tokens and a list of labels for each).

{% raw %}
{% endraw %}

build_hf_input uses fastai's @typedispatched decorator to provide for complete flexibility in terms of how your numericalized tokens are assembled, and also what you return via HF_BaseInput and as your targets. You can override this implementation as needed by assigning a type to the task argument (and optionally the tokenizer argument as well).

What you return here is what will be fed into your huggingface model.

{% raw %}
{% endraw %} {% raw %}

class HF_BatchTransform[source]

HF_BatchTransform(hf_arch, hf_tokenizer, max_seq_len=512, truncation_strategy='longest_first', task=None) :: Transform

Handles everything you need to assemble a mini-batch of inputs and targets

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_TextBlock[source]

HF_TextBlock(tok_tfms, hf_arch, hf_tokenizer, hf_batch_tfm=None, vocab=None, task=None, max_seq_len=512, min_freq=3, max_vocab=60000, special_toks=None) :: TransformBlock

A basic wrapper that links defaults transforms for the data block API

{% endraw %} {% raw %}

HF_TextBlock.from_df[source]

HF_TextBlock.from_df(text_cols_lists, hf_arch, hf_tokenizer, hf_batch_tfm=None, vocab=None, task=None, tok_func_mode='str', res_col_names=None, max_seq_len=512, tok_func='SpacyTokenizer', rules=None, sep=' ', n_workers=16, mark_fields=None, res_col_name='text', **kwargs)

Creates a HF_TextBlock via a pandas DataFrame

{% endraw %}

Currently, we've only implemented building this block from a pandas DataFrame. It handles single and multiple text inputs so that it can be used out-of-the-box against any model in the huggingface arsenal (e.g. sequence classification, question-answer, summarization, token classification, etc...).

Examples

Sequence classification (e.g., models that require a single text input)

{% raw %}
path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
{% endraw %} {% raw %}
imdb_df.head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False
{% endraw %}

There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER.

{% raw %}
task = HF_TASKS_AUTO.ForSequenceClassification

pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
config = AutoConfig.from_pretrained(pretrained_model_name)

hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name, 
                                                                                    task=task, 
                                                                                    config=config)
{% endraw %}

Once you have those elements, you can create your DataBlock as simple as the below. Note that you can use multiple columns in your DataFrame to make up the single text element required by HF_TextBlock below.

{% raw %}
# single input
blocks = (
    HF_TextBlock.from_df(text_cols_lists=[['text']], hf_arch=hf_arch, hf_tokenizer=hf_tokenizer),
    CategoryBlock
)

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: x.text0,
                   get_y=ColReader('label'), 
                   splitter=ColSplitter(col='is_valid'))
{% endraw %} {% raw %}
# dblock.summary(imdb_df)
{% endraw %} {% raw %}
dls = dblock.dataloaders(imdb_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch(); len(b), len(b[0]), len(b[1]) 
(2, 3, 4)
{% endraw %} {% raw %}
b[0][0].shape, b[0][1].shape, b[0][2].shape, b[1].shape
(torch.Size([4, 512]),
 torch.Size([4, 1]),
 torch.Size([4, 512]),
 torch.Size([4]))
{% endraw %} {% raw %}
{% endraw %} {% raw %}
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)
text category
0 Raising Victor Vargas: A Review<br /><br />You know, Raising Victor Vargas is like sticking your hands into a big, steaming bowl of oatmeal. It's warm and gooey, but you're not sure if it feels right. Try as I might, no matter how warm and gooey Raising Victor Vargas became I was always aware that something didn't quite feel right. Victor Vargas suffers from a certain overconfidence on the director's part. Apparently, the director thought that the ethnic backdrop of a Latino family on the lower east side, and an idyllic storyline would make the film critic proof. He was right, but it didn't fool me. Raising Victor Vargas is the story about a seventeen-year old boy called, you guessed it, Victor Vargas (Victor Rasuk) who lives his teenage years chasing more skirt than the Rolling Stones could do in all the years they've toured. The movie starts off in `Ugly Fat' Donna's bedroom where Victor is sure to seduce her, but a cry from outside disrupts his plans when his best-friend Harold (Kevin Rivera) comes-a-looking for him. Caught in the attempt by Harold and his sister, Victor Vargas runs off for damage control. Yet even with the embarrassing implication that he's been boffing the homeliest girl in the neighborhood, nothing dissuades young Victor from going off on the hunt for more fresh meat. On a hot, New York City day they make way to the local public swimming pool where Victor's eyes catch a glimpse of the lovely young nymph Judy (Judy Marte), who's not just pretty, but a strong and independent too. The relationship that develops between Victor and Judy becomes the focus of the film. The story also focuses on Victor's family that is comprised of his grandmother or abuelita (Altagracia Guzman), his brother Nino (also played by real life brother to Victor, Silvestre Rasuk) and his sister Vicky (Krystal Rodriguez). The action follows Victor between scenes with Judy and scenes with his family. Victor tries to cope with being an oversexed pimp-daddy, his feelings for Judy and his grandmother's conservative Catholic upbringing.<br /><br />The problems that arise from Raising Victor Vargas are a few, but glaring errors. Throughout the film you get to know certain characters like Vicky, Nino, Grandma, negative
1 THE SHOP AROUND THE CORNER is one of the sweetest and most feel-good romantic comedies ever made. There's just no getting around that, and it's hard to actually put one's feeling for this film into words. It's not one of those films that tries too hard, nor does it come up with the oddest possible scenarios to get the two protagonists together in the end. In fact, all its charm is innate, contained within the characters and the setting and the plot... which is highly believable to boot. It's easy to think that such a love story, as beautiful as any other ever told, *could* happen to you... a feeling you don't often get from other romantic comedies, however sweet and heart-warming they may be. <br /><br />Alfred Kralik (James Stewart) and Clara Novak (Margaret Sullavan) don't have the most auspicious of first meetings when she arrives in the shop (Matuschek & Co.) he's been working in for the past nine years, asking for a job. They clash from the very beginning, mostly over a cigarette box that plays music when it's opened--he thinks it's a ludicrous idea; she makes one big sell of it and gets hired. Their bickering takes them through the next six months, even as they both (unconsciously, of course!) fall in love with each other when they share their souls and minds in letters passed through PO Box 237. This would be a pretty thin plotline to base an entire film on, except that THE SHOP AROUND THE CORNER is expertly fleshed-out with a brilliant supporting cast made up of entirely engaging characters, from the fatherly but lonely Hugo Matuschek (Frank Morgan) himself, who learns that his shop really is his home; Pirovitch (Felix Bressart), Kralik's sidekick and friend who always skitters out of the room when faced with the possibility of being asked for his honest opinion; smarmy pimp-du-jour Vadas (Joseph Schildkraut) who ultimately gets his comeuppance from a gloriously righteous Kralik; and ambitious errand boy Pepi Katona (William Tracy) who wants nothing more than to be promoted to the position of clerk for Matuschek & Co. The unpretentious love story between 'Dear Friends' is played out in this little shop in positive
{% endraw %}

Question Answering (e.g., models that require two text inputs)

We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.

{% raw %}
path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)
1000
{% endraw %} {% raw %}
squad_df.head(2)
title context question_id question_text is_impossible answer_text answer_start answer_end
0 New_York_City The New York City Fire Department (FDNY), provides fire protection, technical rescue, primary response to biological, chemical, and radioactive hazards, and emergency medical services for the five boroughs of New York City. The New York City Fire Department is the largest municipal fire department in the United States and the second largest in the world after the Tokyo Fire Department. The FDNY employs approximately 11,080 uniformed firefighters and over 3,300 uniformed EMTs and paramedics. The FDNY's motto is New York's Bravest. 56d1076317492d1400aab78c What does FDNY stand for? False New York City Fire Department 4 33
1 Cyprus Following the death in 1473 of James II, the last Lusignan king, the Republic of Venice assumed control of the island, while the late king's Venetian widow, Queen Catherine Cornaro, reigned as figurehead. Venice formally annexed the Kingdom of Cyprus in 1489, following the abdication of Catherine. The Venetians fortified Nicosia by building the Venetian Walls, and used it as an important commercial hub. Throughout Venetian rule, the Ottoman Empire frequently raided Cyprus. In 1539 the Ottomans destroyed Limassol and so fearing the worst, the Venetians also fortified Famagusta and Kyrenia. 572e7f8003f98919007566df In what year did the Ottomans destroy Limassol? False 1539 481 485
{% endraw %} {% raw %}
max_seq_len= 512
{% endraw %} {% raw %}
squad_df = squad_df[(squad_df.answer_end < max_seq_len) & (squad_df.is_impossible == False)]
{% endraw %} {% raw %}
task = HF_TASKS_AUTO.ForQuestionAnswering

pretrained_model_name = "roberta-base"
config = AutoConfig.from_pretrained(pretrained_model_name)

hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name, 
                                                                                    task=task, 
                                                                                    config=config)
{% endraw %} {% raw %}
vocab = dict(enumerate(range(max_seq_len)));
{% endraw %}

Below we utilize the @typedispatch decorator to completely change how we'll tokenize the data for the ForQuestionAnsweringTask.

{% raw %}
{% endraw %}

And here we demonstrate some more of the extensibility bits of the framework, by passing in our own instance of HF_BatchTransform.

{% raw %}
# (optional): override HF_BatchTransform defaults
hf_batch_tfm = HF_BatchTransform(hf_arch, hf_tokenizer, task=ForQuestionAnsweringTask(),
                                 max_seq_len=128, truncation_strategy=None)

blocks = (
    HF_TextBlock.from_df(text_cols_lists=[['question_text'],['context']],
                         hf_arch=hf_arch, 
                         hf_tokenizer=hf_tokenizer, 
                         hf_batch_tfm=hf_batch_tfm),
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: (x.text0, x.text1),
                   get_y=[ColReader('answer_start'), ColReader('answer_end')],
                   splitter=RandomSplitter(),
                   n_inp=1)
{% endraw %} {% raw %}
# dblock.summary(squad_df)
{% endraw %} {% raw %}
dls = dblock.dataloaders(squad_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
(3, 3, 4, 4)
{% endraw %} {% raw %}
b[0][0].shape, b[0][1].shape, b[0][2].shape, b[1].shape, b[2].shape
(torch.Size([4, 128]),
 torch.Size([4, 1]),
 torch.Size([4, 128]),
 torch.Size([4]),
 torch.Size([4]))
{% endraw %} {% raw %}
dls.show_batch(hf_tokenizer=hf_tokenizer, skip_special_tokens=False, max_n=2)
text category category_
0 <s> Which type of decoder must support bitrate switching on a per-frame basis?</s></s> A more sophisticated MP3 encoder can produce variable bitrate audio. MPEG audio may use bitrate switching on a per-frame basis, but only layer III decoders must support it. VBR is used when the goal is to achieve a fixed level of quality. The final file size of a VBR encoding is less predictable than with constant bitrate. Average bitrate is VBR implemented as a compromise between the two: the bitrate is allowed to vary for more consistent quality, but is controlled to remain near an average</s> 137 146
1 <s> What is the most common bird found in the Alps?</s></s> Many rodents such as voles live underground. Marmots live almost exclusively above the tree line as high as 2,700 m (8,858 ft). They hibernate in large groups to provide warmth, and can be found in all areas of the Alps, in large colonies they build beneath the alpine pastures. Golden eagles and bearded vultures are the largest birds to be found in the Alps; they nest high on rocky ledges and can be found at altitudes of 2,400 m (7,874 ft). The</s> 469 486
{% endraw %}

Token classification (e.g., NER tasks)

{% raw %}
# germ_eval_df = pd.read_csv('./data/task-token-classification/germeval2014ner/germeval2014ner_cleaned.csv')
germ_eval_df = pd.read_csv('./germeval2014_sample.csv')
germ_eval_df.head()
pos token tag1 tag2 ds_type seq_id n_tokens
0 1 Schartau B-PER O train 1 3
1 2 sagte O O train 1 1
2 3 dem O O train 1 1
3 4 " O O train 1 1
4 5 Tagesspiegel B-ORG O train 1 3
{% endraw %}

Note: n_tokens represents the number of sub-workd tokens the BertTokenizer uses to tokenize the token. If you use a different tokenizer, you'll need to re-calcuate this column assuming you use the same approach as I do here.

{% raw %}
germ_eval_df.dropna(inplace=True)
germ_eval_df[germ_eval_df.token.isna()]
pos token tag1 tag2 ds_type seq_id n_tokens
{% endraw %} {% raw %}
labels = sorted(germ_eval_df.tag1.unique())
print(labels)
['B-LOC', 'B-LOCderiv', 'B-LOCpart', 'B-ORG', 'B-ORGpart', 'B-OTH', 'B-PER', 'B-PERpart', 'I-LOC', 'I-LOCderiv', 'I-ORG', 'I-ORGpart', 'I-OTH', 'I-PER', 'O']
{% endraw %} {% raw %}
task = HF_TASKS_AUTO.ForTokenClassification

pretrained_model_name = "bert-base-multilingual-cased"
config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = len(labels)

hf_arch, hf_tokenizer, hf_config, hf_model = BLURR_MODEL_HELPER.get_auto_hf_objects(pretrained_model_name, 
                                                                                    task=task, 
                                                                                    config=config)
hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)
('bert',
 transformers.tokenization_bert.BertTokenizer,
 transformers.configuration_bert.BertConfig,
 transformers.modeling_bert.BertForTokenClassification)
{% endraw %} {% raw %}
germ_eval_df = germ_eval_df.groupby(by='seq_id').agg(list).reset_index()
germ_eval_df.head()
seq_id pos token tag1 tag2 ds_type n_tokens
0 1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] [Schartau, sagte, dem, ", Tagesspiegel, ", vom, Freitag, ,, Fischer, sei, ", in, einer, Weise, aufgetreten, ,, die, alles, andere, als, überzeugend, war, ", .] [B-PER, O, O, O, B-ORG, O, O, O, O, B-PER, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train] [3, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1]
1 2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] [Firmengründer, Wolf, Peter, Bree, arbeitete, Anfang, der, siebziger, Jahre, als, Möbelvertreter, ,, als, er, einen, fliegenden, Händler, aus, dem, Libanon, traf, .] [O, B-PER, I-PER, I-PER, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O] [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train] [3, 1, 1, 2, 1, 1, 1, 3, 1, 1, 4, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1]
2 3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] [Ob, sie, dabei, nach, dem, Runden, Tisch, am, 23., April, in, Berlin, durch, ein, pädagogisches, Konzept, unterstützt, wird, ,, ist, allerdings, zu, bezweifeln, .] [O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O] [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train, train] [1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 3, 1]
3 4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [Bayern, München, ist, wieder, alleiniger, Top-, Favorit, auf, den, Gewinn, der, deutschen, Fußball-Meisterschaft, .] [B-ORG, I-ORG, O, O, O, O, O, O, O, O, O, B-LOCderiv, O, O] [B-LOC, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O] [train, train, train, train, train, train, train, train, train, train, train, train, train, train] [1, 1, 1, 1, 2, 2, 3, 1, 1, 1, 1, 1, 3, 1]
4 5 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [Dabei, hätte, der, tapfere, Schlussmann, allen, Grund, gehabt, ,, sich, viel, früher, aufzuregen, .] [O, O, O, O, O, O, O, O, O, O, O, O, O, O] [O, O, O, O, O, O, O, O, O, O, O, O, O, O] [train, train, train, train, train, train, train, train, train, train, train, train, train, train] [1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1]
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_TokenTensorCategory[source]

HF_TokenTensorCategory(x, **kwargs) :: TensorBase

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_TokenCategorize[source]

HF_TokenCategorize(vocab=None, ignore_token=None, ignore_token_id=None) :: Transform

Reversible transform of a list of category string to vocab id

{% endraw %}

HF_TokenCategorize modifies the fastai Categorize transform in a couple of ways. First, it allows your targets to consist of a Category per token, and second, it uses the idea of an ignore_token to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.

{% raw %}
{% endraw %} {% raw %}

HF_TokenCategoryBlock[source]

HF_TokenCategoryBlock(vocab=None, ignore_token=None, ignore_token_id=None)

TransformBlock for single-label categorical targets

{% endraw %} {% raw %}
{% endraw %}

We need a custom build_hf_input because we need to align the target tokens with the input tokens (e.g., if there are 512 input tokens there need to be 512 targets)

{% raw %}
# single input
blocks = (
    HF_TextBlock.from_df(text_cols_lists=[['token']], 
                         hf_arch=hf_arch, 
                         hf_tokenizer=hf_tokenizer, 
                         tok_func_mode='list', 
                         task=ForTokenClassificationTask()),
    HF_TokenCategoryBlock(vocab=labels)
)

def get_y(inp):
    return [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.token, inp.tag1) ]

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: x.text0,
                   get_y=get_y,
                   splitter=RandomSplitter())
{% endraw %}

Note in the example above we had to define a get_y in order to return both the entity we want to predict a category for, as well as, how many subtokens are used by the hf_tokenizer to represent it. This is necessary for the input/target alignment discussed above.

{% raw %}
# dblock.summary(test_df)
{% endraw %} {% raw %}
dls = dblock.dataloaders(germ_eval_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
{% endraw %} {% raw %}
len(b), b[0][0].shape, b[1].shape
(2, torch.Size([4, 512]), torch.Size([4, 512]))
{% endraw %} {% raw %}
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)
text category
0 Das SS - Freiwilligen - Grenadier - Regiment 88 wurde im März aus einer Kampfgruppe der SS - Führerschule des Wirtschafts - und Verwaltungsdienstes und Teilen des I. / SS - Polizei - Regiments 34 der Ordnungspolizei, Heeresangehörigen und Volkssturm gebildet. ['O', 'B-ORGpart', 'I-ORGpart', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORGpart', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORGpart', 'I-ORGpart', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
1 Scenes of a Sexual Nature ( GB 2006 ) - Regie : Ed Blum Shortbus ( USA 2006 ) - Regie : John Cameron Mitchell : Film über den gleichnamigen New Yorker Club, der verschiedensten Paaren eine Plattform zur Aufarbeitung ihrer Probleme bietet. ['B-OTH', 'I-OTH', 'I-OTH', 'I-OTH', 'I-OTH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'B-OTH', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-LOCderiv', 'I-LOCderiv', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
{% endraw %}

Cleanup