--- title: data.core keywords: fastai sidebar: home_sidebar summary: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." description: "This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by huggingface transformer implementations." nb_path: "nbs/01_data-core.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
{% endraw %}

Base tokenization, batch transform, and DataBlock methods

{% raw %}
{% endraw %} {% raw %}

class HF_BaseInput[source]

HF_BaseInput(x, **kwargs) :: TensorBase

{% endraw %}

A HF_BaseInput object is returned from the decodes method of HF_BatchTransform as a mean to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. It represents the "input_ids" of a huggingface sequence as a tensor with a show method that requires a huggingface tokenizer for proper display.

{% raw %}
{% endraw %} {% raw %}

class HF_BatchTransform[source]

HF_BatchTransform(hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, is_split_into_words=False, n_tok_inps=1, hf_input_return_type=HF_BaseInput, tok_kwargs={}, **kwargs) :: Transform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

{% endraw %}

HF_BatchTransform was inspired by this article. It handles both the tokenization and numericalization traditionally split apart in the fastai text DataBlock API. For huggingface tokenizers that require a prefix space, it will be included automatically. In its current incarnation, HF_BatchTransform can be used to tokenize multiple inputs (common in seq2seq models for tasks like summarization) and even apply different tokenizers and arguments to each.

Inputs can come in as a string or a list of tokens, the later being for tasks like Named Entity Recognition (NER), where you want to predict the label of each token.

Note: The previous version of the library performed the tokenization/numericalization as a type transform when the raw data was read, and included a couple batch transforms to prepare the data for collation (e.g., to be made into a mini-batch). With this update, everything is done in a single batch transform. Why? Part of the inspiration had to do with the mechanics of the huggingrace tokenizer, in particular how by default it returns a collated mini-batch of data given a list of sequences. And where do we get a list of examples with fastai? In the batch transforms! So I thought, hey, why not do everything dynamically at batch time? And with a bit of tweaking, I got everything to work pretty well. The result is less code, faster mini-batch creation, less RAM utilization and time spent tokenizing (really helps with very large datasets), and more flexibility.

{% raw %}
{% endraw %} {% raw %}

class HF_TextBlock[source]

HF_TextBlock(hf_arch=None, hf_tokenizer=None, hf_batch_tfm=None, max_length=512, padding=True, truncation=True, is_split_into_words=False, n_tok_inps=1, tok_kwargs={}, hf_input_return_type=HF_BaseInput, dl_type=SortedDL, batch_kwargs={}, **kwargs) :: TransformBlock

A basic wrapper that links defaults transforms for the data block API

{% endraw %}

A basic wrapper that links defaults transforms for the data block API

HF_TextBlock has been dramatically simplified from it's predecessor. It handles setting up your HF_BatchTransform transform regardless of data source (e.g., this will work with files, DataFrames, whatever). For it to work, you must either pass in your own instance of a HF_BatchTransform class or the huggingface architecture and tokenizer via the hf_arch and hf_tokenizer (the other args are optional).

{% raw %}
{% endraw %}

Sequence classification

Below demonstrates how to contruct your DataBlock for a sequence classification task (e.g., a model that requires a single text input)

{% raw %}
path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
{% endraw %} {% raw %}
imdb_df.head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False
{% endraw %}

There are a bunch of ways we can get at the four huggingface elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR_MODEL_HELPER.

{% raw %}
task = HF_TASKS_AUTO.SequenceClassification

pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
{% endraw %}

Once you have those elements, you can create your DataBlock as simple as the below.

{% raw %}
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter(col='is_valid'))
{% endraw %} {% raw %}
dls = dblock.dataloaders(imdb_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
{% endraw %} {% raw %}
b = dls.one_batch(); len(b), len(b[0]['input_ids']), b[0]['input_ids'].shape, len(b[1]) 
(2, 4, torch.Size([4, 512]), 4)
{% endraw %}

Let's take a look at the actual types represented by our batch

{% raw %}
explode_types(b)
{tuple: [dict, fastai.torch_core.TensorCategory]}
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2)
text category
0 Raising Victor Vargas: A Review<br /><br />You know, Raising Victor Vargas is like sticking your hands into a big, steaming bowl of oatmeal. It's warm and gooey, but you're not sure if it feels right. Try as I might, no matter how warm and gooey Raising Victor Vargas became I was always aware that something didn't quite feel right. Victor Vargas suffers from a certain overconfidence on the director's part. Apparently, the director thought that the ethnic backdrop of a Latino family on the lower east side, and an idyllic storyline would make the film critic proof. He was right, but it didn't fool me. Raising Victor Vargas is the story about a seventeen-year old boy called, you guessed it, Victor Vargas (Victor Rasuk) who lives his teenage years chasing more skirt than the Rolling Stones could do in all the years they've toured. The movie starts off in `Ugly Fat' Donna's bedroom where Victor is sure to seduce her, but a cry from outside disrupts his plans when his best-friend Harold (Kevin Rivera) comes-a-looking for him. Caught in the attempt by Harold and his sister, Victor Vargas runs off for damage control. Yet even with the embarrassing implication that he's been boffing the homeliest girl in the neighborhood, nothing dissuades young Victor from going off on the hunt for more fresh meat. On a hot, New York City day they make way to the local public swimming pool where Victor's eyes catch a glimpse of the lovely young nymph Judy (Judy Marte), who's not just pretty, but a strong and independent too. The relationship that develops between Victor and Judy becomes the focus of the film. The story also focuses on Victor's family that is comprised of his grandmother or abuelita (Altagracia Guzman), his brother Nino (also played by real life brother to Victor, Silvestre Rasuk) and his sister Vicky (Krystal Rodriguez). The action follows Victor between scenes with Judy and scenes with his family. Victor tries to cope with being an oversexed pimp-daddy, his feelings for Judy and his grandmother's conservative Catholic upbringing.<br /><br />The problems that arise from Raising Victor Vargas are a few, but glaring errors. Throughout the film you get to know certain characters like Vicky, Nino, Grandma, negative
1 Now that Che(2008) has finished its relatively short Australian cinema run (extremely limited release:1 screen in Sydney, after 6wks), I can guiltlessly join both hosts of "At The Movies" in taking Steven Soderbergh to task.<br /><br />It's usually satisfying to watch a film director change his style/subject, but Soderbergh's most recent stinker, The Girlfriend Experience(2009), was also missing a story, so narrative (and editing?) seem to suddenly be Soderbergh's main challenge. Strange, after 20-odd years in the business. He was probably never much good at narrative, just hid it well inside "edgy" projects.<br /><br />None of this excuses him this present, almost diabolical failure. As David Stratton warns, "two parts of Che don't (even) make a whole". <br /><br />Epic biopic in name only, Che(2008) barely qualifies as a feature film! It certainly has no legs, inasmuch as except for its uncharacteristic ultimate resolution forced upon it by history, Soderbergh's 4.5hrs-long dirge just goes nowhere.<br /><br />Even Margaret Pomeranz, the more forgiving of Australia's At The Movies duo, noted about Soderbergh's repetitious waste of (HD digital storage): "you're in the woods...you're in the woods...you're in the woods...". I too am surprised Soderbergh didn't give us another 2.5hrs of THAT somewhere between his existing two Parts, because he still left out massive chunks of Che's "revolutionary" life! <br /><br />For a biopic of an important but infamous historical figure, Soderbergh unaccountably alienates, if not deliberately insults, his audiences by<br /><br />1. never providing most of Che's story; <br /><br />2. imposing unreasonable film lengths with mere dullard repetition; <br /><br />3. ignoring both true hindsight and a narrative of events; <br /><br />4. barely developing an idea, or a character; <br /><br />5. remaining claustrophobically episodic; <br /><br />6. ignoring proper context for scenes---whatever we do get is mired in disruptive timeshifts; <br /><br />7. linguistically negative
{% endraw %}

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

{% raw %}
BLURR_MODEL_HELPER.get_models(task='SequenceClassification')
[transformers.modeling_albert.AlbertForSequenceClassification,
 transformers.modeling_auto.AutoModelForSequenceClassification,
 transformers.modeling_bart.BartForSequenceClassification,
 transformers.modeling_bert.BertForSequenceClassification,
 transformers.modeling_camembert.CamembertForSequenceClassification,
 transformers.modeling_deberta.DebertaForSequenceClassification,
 transformers.modeling_distilbert.DistilBertForSequenceClassification,
 transformers.modeling_electra.ElectraForSequenceClassification,
 transformers.modeling_flaubert.FlaubertForSequenceClassification,
 transformers.modeling_funnel.FunnelForSequenceClassification,
 transformers.modeling_gpt2.GPT2ForSequenceClassification,
 transformers.modeling_longformer.LongformerForSequenceClassification,
 transformers.modeling_mobilebert.MobileBertForSequenceClassification,
 transformers.modeling_openai.OpenAIGPTForSequenceClassification,
 transformers.modeling_reformer.ReformerForSequenceClassification,
 transformers.modeling_roberta.RobertaForSequenceClassification,
 transformers.modeling_squeezebert.SqueezeBertForSequenceClassification,
 transformers.modeling_xlm.XLMForSequenceClassification,
 transformers.modeling_xlm_roberta.XLMRobertaForSequenceClassification,
 transformers.modeling_xlnet.XLNetForSequenceClassification]
{% endraw %} {% raw %}
pretrained_model_names = [
    'albert-base-v1',
    'facebook/bart-base',
    'bert-base-uncased',
    'camembert-base',
    'distilbert-base-uncased',
    'monologg/electra-small-finetuned-imdb',
    'flaubert/flaubert_small_cased', 
    'allenai/longformer-base-4096',
    'google/mobilebert-uncased',
    'roberta-base',
    'xlm-mlm-en-2048',
    'xlm-roberta-base',
    'xlnet-base-cased'
]
{% endraw %} {% raw %}
path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
{% endraw %} {% raw %}
#hide_output
task = HF_TASKS_AUTO.SequenceClassification
bsz = 2
seq_sz = 128

test_results = []
for model_name in pretrained_model_names:
    error=None
    
    print(f'=== {model_name} ===\n')

    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, task=task)    
    print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
    
    blocks = (
        HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer, padding='max_length', max_length=seq_sz), 
        CategoryBlock
    )

    dblock = DataBlock(blocks=blocks, 
                       get_x=ColReader('text'), 
                       get_y=ColReader('label'), 
                       splitter=ColSplitter(col='is_valid'))
    
    dls = dblock.dataloaders(imdb_df, bs=bsz) 
    b = dls.one_batch()
    
    try:
        print('*** TESTING DataLoaders ***\n')
        test_eq(len(b), 2)
        test_eq(len(b[0]['input_ids']), bsz)
        test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
        test_eq(len(b[1]), bsz)

        if (hasattr(hf_tokenizer, 'add_prefix_space')):
            test_eq(dls.before_batch[0].tok_kwargs['add_prefix_space'], True)
            
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
        dls.show_batch(dataloaders=dls, max_n=2)
        
    except Exception as err:
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
{% endraw %} {% raw %}
arch tokenizer model_name result error
0 albert AlbertTokenizer albert-base-v1 PASSED
1 bart BartTokenizer facebook/bart-base PASSED
2 bert BertTokenizer bert-base-uncased PASSED
3 camembert CamembertTokenizer camembert-base PASSED
4 distilbert DistilBertTokenizer distilbert-base-uncased PASSED
5 electra ElectraTokenizer monologg/electra-small-finetuned-imdb PASSED
6 flaubert FlaubertTokenizer flaubert/flaubert_small_cased PASSED
7 longformer LongformerTokenizer allenai/longformer-base-4096 PASSED
8 mobilebert MobileBertTokenizer google/mobilebert-uncased PASSED
9 roberta RobertaTokenizer roberta-base PASSED
10 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
11 xlm_roberta XLMRobertaTokenizer xlm-roberta-base PASSED
12 xlnet XLNetTokenizer xlnet-base-cased PASSED
{% endraw %}

Cleanup