--- title: data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." nb_path: "nbs/01_data-core.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
{% endraw %}

Base tokenization, batch transform, and DataBlock methods

{% raw %}
{% endraw %} {% raw %}

class HF_TokenizerTransform[source]

HF_TokenizerTransform(hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, is_pretokenized=False, **kwargs) :: ItemTransform

huggingface friendly tokenization transform.

{% endraw %}

HF_TokenizerTransform was inspired by this article. It handles both the tokenization and numericalization traditionally split apart in the fastai text DataBlock API. For huggingface tokenizers that require a prefix space, it will be included automatically.

You can pass a string or list into this Transform, the later being for tasks that require two input sequneces (e.g. question answer tasks for example require a "context" and a "question" sequence).

In order to make the tokenization/numericalization process more efficient, this transform has been updated to return a transformers.tokenization_utils_base.BatchEncoding dictionary with all the required transformer inputs (e.g. input_ids, attention_mask, etc...). Previously, it only returned the raw input_ids for each sequence which were then put together to come up with all the required inputs and padding in a before_batch transform.

{% raw %}
{% endraw %} {% raw %}

class HF_BaseInput[source]

HF_BaseInput(iterable=()) :: list

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

{% endraw %}

A HF_BaseInput object is returned from the decodes method of HF_BatchTransform as a mean to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. It encapsulates a list with one item, the input_ids for the sequence.

{% raw %}
{% endraw %} {% raw %}

class HF_BatchTransform[source]

HF_BatchTransform(hf_arch, hf_tokenizer, hf_input_return_type=HF_BaseInput, **kwargs) :: Transform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode HF_TokenizerTransform inputs

{% endraw %} {% raw %}
{% endraw %} {% raw %}

pad_hf_inputs[source]

pad_hf_inputs(samples, arch, hf_input_idxs=[0], pad_idx=0, pad_first=False)

Add this to your batch transforms if you are using dynamic padding with HF_TokenizerTransform (e.g., padding is set to anything except 'max_length') to ensure all HF tensors are sized to longest input in the batch.

Note: This is automatically included as necessary by HF_TextBlock

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_TextBlock[source]

HF_TextBlock(hf_arch, hf_tokenizer, hf_tok_tfm=None, max_length=512, padding=True, truncation=True, is_pretokenized=False, hf_batch_tfm=None, hf_input_return_type=HF_BaseInput, hf_input_idxs=[0], dl_type=SortedDL, tok_kwargs={}, batch_kwargs={}, **kwargs) :: TransformBlock

A basic wrapper that links defaults transforms for the data block API

{% endraw %}

HF_TextBlock has been dramatically simplified from it's predecessor. It handles setting up your HF_TokenizerTransform and HF_BatchTransform transform regardless of data source (e.g., this will work with files, DataFrames, whatever).

{% raw %}
{% endraw %}

Sequence classification

Below demonstrates how to contruct your DataBlock for a sequence classification task (e.g., a model that requires a single text input)

{% raw %}
path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
{% endraw %} {% raw %}
imdb_df.head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False
{% endraw %}

There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER.

{% raw %}
task = HF_TASKS_AUTO.SequenceClassification

pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
{% endraw %}

Once you have those elements, you can create your DataBlock as simple as the below.

{% raw %}
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('text'), 
                   get_y=ColReader('label'), 
                   splitter=ColSplitter(col='is_valid'))
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
dls = dblock.dataloaders(imdb_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch(); len(b), len(b[0]['input_ids']), b[0]['input_ids'].shape, len(b[1]) 
(2, 4, torch.Size([4, 177]), 4)
{% endraw %}

Let's take a look at the actual types represented by our batch

{% raw %}
explode_types(b)
{tuple: [dict, fastai.torch_core.TensorCategory]}
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2)
text category
0 Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! negative
1 Everyday we can watch a great number of film, soap... on tv. Sometimes a miracle happens. A great film, with real feelings, with great actors, with a great realisator-director. For me there are two films that everyone needs to see : the first is the Pacula? "Sophie's choice" with Meryl Streep. The second is "Journey of Hope". As human beings, we need to learn about humility, about love of the others, about acceptation of other civilisation, other way of living. We also have to struggle against racism and fascim. We must avoid judging, criticize; we only have to love our earth companion. This wonderful film, helps us reaching John (Lennon) his dream : Imagine all the people living live in peace. These two films are difficult to see : watch these, but sure you will be hurt, but better. Great film, great actors, terrible story, pain and cry guarantee, but also better understanding of the others. Enjoy it. positive
{% endraw %}

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

{% raw %}
BLURR_MODEL_HELPER.get_models(task='SequenceClassification')
[transformers.modeling_albert.AlbertForSequenceClassification,
 transformers.modeling_auto.AutoModelForSequenceClassification,
 transformers.modeling_bart.BartForSequenceClassification,
 transformers.modeling_bert.BertForSequenceClassification,
 transformers.modeling_camembert.CamembertForSequenceClassification,
 transformers.modeling_distilbert.DistilBertForSequenceClassification,
 transformers.modeling_electra.ElectraForSequenceClassification,
 transformers.modeling_flaubert.FlaubertForSequenceClassification,
 transformers.modeling_longformer.LongformerForSequenceClassification,
 transformers.modeling_mobilebert.MobileBertForSequenceClassification,
 transformers.modeling_roberta.RobertaForSequenceClassification,
 transformers.modeling_xlm.XLMForSequenceClassification,
 transformers.modeling_xlm_roberta.XLMRobertaForSequenceClassification,
 transformers.modeling_xlnet.XLNetForSequenceClassification]
{% endraw %} {% raw %}
pretrained_model_names = [
    'albert-base-v1',
    'facebook/bart-base',
    'bert-base-uncased',
    'camembert-base',
    'distilbert-base-uncased',
    'monologg/electra-small-finetuned-imdb',
    'flaubert/flaubert_small_cased', 
    'allenai/longformer-base-4096',
    'google/mobilebert-uncased',
    'roberta-base',
    'xlm-mlm-en-2048',
    'xlm-roberta-base',
    'xlnet-base-cased'
]
{% endraw %} {% raw %}
path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
{% endraw %} {% raw %}
#hide_output
task = HF_TASKS_AUTO.SequenceClassification
bsz = 2

test_results = []
for model_name in pretrained_model_names:
    error=None
    
    print(f'=== {model_name} ===\n')
    
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, task=task)
    
    print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
    
    blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer, padding='max_length', max_length=128), 
              CategoryBlock)

    dblock = DataBlock(blocks=blocks, 
                       get_x=ColReader('text'), 
                       get_y=ColReader('label'), 
                       splitter=ColSplitter(col='is_valid'))
    
    dls = dblock.dataloaders(imdb_df, bs=bsz) 
    b = dls.one_batch()
    
    try:
        print('*** TESTING DataLoaders ***\n')
        test_eq(len(b), 2)
        test_eq(len(b[0]['input_ids']), bsz)
        test_eq(b[0]['input_ids'].shape, torch.Size([bsz, 128]))
        test_eq(len(b[1]), bsz)

        if (hasattr(hf_tokenizer, 'add_prefix_space')):
            test_eq(dls.tfms[0].kwargs['add_prefix_space'], True)
            
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
        dls.show_batch(dataloaders=dls, max_n=2)
        
    except Exception as err:
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
{% endraw %} {% raw %}
arch tokenizer model_name result error
0 albert AlbertTokenizer albert-base-v1 PASSED
1 bart BartTokenizer facebook/bart-base PASSED
2 bert BertTokenizer bert-base-uncased PASSED
3 camembert CamembertTokenizer camembert-base PASSED
4 distilbert DistilBertTokenizer distilbert-base-uncased PASSED
5 electra ElectraTokenizer monologg/electra-small-finetuned-imdb PASSED
6 flaubert FlaubertTokenizer flaubert/flaubert_small_cased PASSED
7 longformer LongformerTokenizer allenai/longformer-base-4096 PASSED
8 mobilebert MobileBertTokenizer google/mobilebert-uncased PASSED
9 roberta RobertaTokenizer roberta-base PASSED
10 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
11 xlm_roberta XLMRobertaTokenizer xlm-roberta-base PASSED
12 xlnet XLNetTokenizer xlnet-base-cased PASSED
{% endraw %}

Example: Multi-label classification

Below demonstrates how to contruct your DataBlock for a multi-label classification task

{% raw %}
raw_data = nlp.load_dataset('civil_comments', split='train[:1%]') 
len(raw_data)
Using custom data configuration default
18049
{% endraw %} {% raw %}
toxic_df = pd.DataFrame(raw_data, columns=list(raw_data.features.keys()))
toxic_df.head()
text toxicity severe_toxicity obscene threat insult identity_attack sexual_explicit
0 This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done! 0.000000 0.000000 0.0 0.0 0.00000 0.000000 0.0
1 Thank you!! This would make my life a lot less anxiety-inducing. Keep it up, and don't let anyone get in your way! 0.000000 0.000000 0.0 0.0 0.00000 0.000000 0.0
2 This is such an urgent design problem; kudos to you for taking it on. Very impressive! 0.000000 0.000000 0.0 0.0 0.00000 0.000000 0.0
3 Is this something I'll be able to install on my site? When will you be releasing it? 0.000000 0.000000 0.0 0.0 0.00000 0.000000 0.0
4 haha you guys are a bunch of losers. 0.893617 0.021277 0.0 0.0 0.87234 0.021277 0.0
{% endraw %} {% raw %}
lbl_cols = list(toxic_df.columns[1:]); lbl_cols
['toxicity',
 'severe_toxicity',
 'obscene',
 'threat',
 'insult',
 'identity_attack',
 'sexual_explicit']
{% endraw %} {% raw %}
toxic_df = toxic_df.round({col: 0 for col in lbl_cols})
toxic_df.head()
text toxicity severe_toxicity obscene threat insult identity_attack sexual_explicit
0 This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done! 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Thank you!! This would make my life a lot less anxiety-inducing. Keep it up, and don't let anyone get in your way! 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 This is such an urgent design problem; kudos to you for taking it on. Very impressive! 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Is this something I'll be able to install on my site? When will you be releasing it? 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 haha you guys are a bunch of losers. 1.0 0.0 0.0 0.0 1.0 0.0 0.0
{% endraw %} {% raw %}
task = HF_TASKS_AUTO.SequenceClassification

pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
n_labels = len(lbl_cols)

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               task=task, 
                                                                               config_kwargs={'num_labels': n_labels})
{% endraw %} {% raw %}
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), MultiCategoryBlock(encoded=True, vocab=lbl_cols))

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('text'), 
                   get_y=ColReader(lbl_cols), 
                   splitter=RandomSplitter())
{% endraw %} {% raw %}
dls = dblock.dataloaders(toxic_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([4, 90]), torch.Size([4, 7]))
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2)
text None
0 Keeping it simple is a good principle to follow in so many areas.
1 Oregon Live is reporting today that the City Council of Portland just approved an allocation of $30,000 to bus homeless people out of Portland. That's one way to reduce the number of homeless people in the community that maybe Eugene should be looking at. Even at $150 a ticket you could bus 200 people out with that kind of money.
{% endraw %}

Cleanup