--- title: text.data.token_classification keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...)" description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...)" nb_path: "nbs/13_text-data-token-classification.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
{% endraw %} {% raw %}
What we're running with at the time this documentation was generated:
torch: 1.10.1+cu111
fastai: 2.5.6
transformers: 4.16.2
{% endraw %}

Setup

We'll use a subset of conll2003 to demonstrate how to configure your blurr code for token classification

{% raw %}
raw_datasets = load_dataset("conll2003")
raw_datasets
Reusing dataset conll2003 (/home/wgilliam/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)
DatasetDict({
    train: Dataset({
        features: ['chunk_tags', 'id', 'ner_tags', 'pos_tags', 'tokens'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['chunk_tags', 'id', 'ner_tags', 'pos_tags', 'tokens'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['chunk_tags', 'id', 'ner_tags', 'pos_tags', 'tokens'],
        num_rows: 3453
    })
})
{% endraw %}

We need to get a list of the distinct entities we want to predict. If they are represented as list in their raw/readable form in another attribute/column in our dataset, we could use something like this to build a sorted list of distinct values as such: labels = sorted(list(set([lbls for sublist in germ_eval_df.labels.tolist() for lbls in sublist]))).

Fortunately, the conll2003 dataset allows us to get at this list directly using the code below.

{% raw %}
print(raw_datasets["train"].features["chunk_tags"].feature.names[:20])
print(raw_datasets["train"].features["ner_tags"].feature.names[:20])
print(raw_datasets["train"].features["pos_tags"].feature.names[:20])
['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP']
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS']
{% endraw %} {% raw %}
labels = raw_datasets["train"].features["ner_tags"].feature.names
labels
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
{% endraw %} {% raw %}
conll2003_df = pd.DataFrame(raw_datasets["train"])
{% endraw %} {% raw %}
model_cls = AutoModelForTokenClassification

pretrained_model_name = "roberta-base"  # "bert-base-multilingual-cased"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
    pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForTokenClassification)
{% endraw %}

Preprocessing

Starting with version 2.0, BLURR provides a token classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets. We also introduce a novel way of handling long documents for this task that ensures tokens associated to a word is not split up in "chunked" documents. See below for an example.

{% raw %}

class TokenClassPreprocessor[source]

TokenClassPreprocessor(hf_tokenizer:PreTrainedTokenizerBase, chunk_examples:bool=False, word_stride:int=2, ignore_token_id:int=-100, label_names:Optional[List[str]]=None, batch_size:int=1000, id_attr:Optional[str]=None, word_list_attr:str='tokens', label_list_attr:str='labels', is_valid_attr:Optional[str]='is_valid', slow_word_ids_func:Optional[typing.Callable]=None, tok_kwargs:dict={}) :: Preprocessor

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
chunk_examples bool False Set to True if the preprocessor should chunk examples that exceed max_length
word_stride int 2 Like "stride" except for words (not tokens)
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
label_names typing.Optional[typing.List[str]] None The label names (if not specified, will build from DataFrame)
batch_size int 1000 The number of examples to process at a time
id_attr typing.Optional[str] None The unique identifier in the dataset
word_list_attr str tokens The attribute holding the list of words
label_list_attr str labels The attribute holding the list of labels (one for each word in word_list_attr)
is_valid_attr typing.Optional[str] is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
slow_word_ids_func typing.Optional[typing.Callable] None If using a slow tokenizer, users will need to prove a slow_word_ids_func that accepts a
tokenizzer, example index, and a batch encoding as arguments and in turn returnes the
equavlient of fast tokenizer's word_ids
tok_kwargs dict None Tokenization kwargs that will be applied with calling the tokenizer
{% endraw %} {% raw %}
{% endraw %}

labels are Ids

{% raw %}
preprocessor = TokenClassPreprocessor(
    hf_tokenizer,
    chunk_examples=True,
    word_stride=2,
    label_names=labels,
    id_attr="id",
    word_list_attr="tokens",
    label_list_attr="ner_tags",
    tok_kwargs={"max_length": 8},
)
proc_df = preprocessor.process_df(conll2003_df)

print(len(proc_df))
print(preprocessor.label_names)
proc_df.head(4)
61298
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
proc_tokens proc_ner_tags chunk_tags id ner_tags pos_tags tokens
0 [EU, rejects, German, call, to, boycott] [3, 0, 7, 0, 0, 0] [11, 21, 11, 12, 21, 22, 11, 12, 0] 0 [3, 0, 7, 0, 0, 0, 7, 0, 0] [22, 42, 16, 21, 35, 37, 16, 21, 7] [EU, rejects, German, call, to, boycott, British, lamb, .]
1 [to, boycott, British, lamb, .] [0, 0, 7, 0, 0] [11, 21, 11, 12, 21, 22, 11, 12, 0] 0 [3, 0, 7, 0, 0, 0, 7, 0, 0] [22, 42, 16, 21, 35, 37, 16, 21, 7] [EU, rejects, German, call, to, boycott, British, lamb, .]
2 [Peter, Blackburn] [1, 2] [11, 12] 1 [1, 2] [22, 22] [Peter, Blackburn]
3 [BRUSSELS, 1996-08-22] [5, 0] [11, 12] 2 [5, 0] [22, 11] [BRUSSELS, 1996-08-22]
{% endraw %}

labels are entity names

{% raw %}
conll2003_labeled_df = conll2003_df.copy()
conll2003_labeled_df.ner_tags = conll2003_labeled_df.ner_tags.apply(lambda v: [labels[lbl_id] for lbl_id in v])
conll2003_labeled_df.head(5)
chunk_tags id ner_tags pos_tags tokens
0 [11, 21, 11, 12, 21, 22, 11, 12, 0] 0 [B-ORG, O, B-MISC, O, O, O, B-MISC, O, O] [22, 42, 16, 21, 35, 37, 16, 21, 7] [EU, rejects, German, call, to, boycott, British, lamb, .]
1 [11, 12] 1 [B-PER, I-PER] [22, 22] [Peter, Blackburn]
2 [11, 12] 2 [B-LOC, O] [22, 11] [BRUSSELS, 1996-08-22]
3 [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0] 3 [O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O, O, O, O, O, B-MISC, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7] [The, European, Commission, said, on, Thursday, it, disagreed, with, German, advice, to, consumers, to, shun, British, lamb, until, scientists, determine, whether, mad, cow, disease, can, be, transmitted, to, sheep, .]
4 [11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 12, 21, 13, 11, 12, 21, 22, 11, 13, 11, 1, 13, 11, 17, 11, 12, 12, 21, 1, 0] 4 [B-LOC, O, O, O, O, B-ORG, I-ORG, O, O, O, B-PER, I-PER, O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, O, O, O, O, O] [22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 22, 38, 15, 22, 24, 20, 37, 21, 15, 24, 16, 15, 22, 15, 12, 16, 21, 38, 17, 7] [Germany, 's, representative, to, the, European, Union, 's, veterinary, committee, Werner, Zwingmann, said, on, Wednesday, consumers, should, buy, sheepmeat, from, countries, other, than, Britain, until, the, scientific, advice, was, clearer, .]
{% endraw %} {% raw %}
preprocessor = TokenClassPreprocessor(
    hf_tokenizer, label_names=labels, id_attr="id", word_list_attr="tokens", label_list_attr="ner_tags", tok_kwargs={"max_length": 8}
)
proc_df = preprocessor.process_df(conll2003_labeled_df)

print(len(proc_df))
print(preprocessor.label_names)
proc_df.head(4)
14041
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
proc_tokens proc_ner_tags chunk_tags id ner_tags pos_tags tokens
0 [EU, rejects, German, call, to, boycott] [B-ORG, O, B-MISC, O, O, O] [11, 21, 11, 12, 21, 22, 11, 12, 0] 0 [B-ORG, O, B-MISC, O, O, O, B-MISC, O, O] [22, 42, 16, 21, 35, 37, 16, 21, 7] [EU, rejects, German, call, to, boycott, British, lamb, .]
1 [Peter, Blackburn] [B-PER, I-PER] [11, 12] 1 [B-PER, I-PER] [22, 22] [Peter, Blackburn]
2 [BRUSSELS, 1996-08-22] [B-LOC, O] [11, 12] 2 [B-LOC, O] [22, 11] [BRUSSELS, 1996-08-22]
3 [The, European, Commission, said, on, Thursday] [O, B-ORG, I-ORG, O, O, O] [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0] 3 [O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O, O, O, O, O, B-MISC, O, O, O, O, O, O, O, O, O, O, O, O, O, O] [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7] [The, European, Commission, said, on, Thursday, it, disagreed, with, German, advice, to, consumers, to, shun, British, lamb, until, scientists, determine, whether, mad, cow, disease, can, be, transmitted, to, sheep, .]
{% endraw %}

Labeling strategies

{% raw %}

class BaseLabelingStrategy[source]

BaseLabelingStrategy(hf_tokenizer:PreTrainedTokenizerBase, label_names:Optional[List[str]], non_entity_label:str='O', ignore_token_id:int=-100)

{% endraw %} {% raw %}
{% endraw %}

Here we include a BaseLabelingStrategy abstract class and several different strategies for assigning labels to your tokenized inputs. The "only first token" and "B/I" labeling strategies are discussed in the "Token Classification" section in part 7 of the Hugging Face's Transformers course.

{% raw %}

class OnlyFirstTokenLabelingStrategy[source]

OnlyFirstTokenLabelingStrategy(hf_tokenizer:PreTrainedTokenizerBase, label_names:Optional[List[str]], non_entity_label:str='O', ignore_token_id:int=-100) :: BaseLabelingStrategy

Only the first token of word is associated with the label (all other subtokens with the ignore_index_id). Works where labels are Ids or strings (in the later case we'll use the label_names to look up it's Id)

{% endraw %} {% raw %}

class SameLabelLabelingStrategy[source]

SameLabelLabelingStrategy(hf_tokenizer:PreTrainedTokenizerBase, label_names:Optional[List[str]], non_entity_label:str='O', ignore_token_id:int=-100) :: BaseLabelingStrategy

Every token associated with a given word is associated with the word's label. Works where labels are Ids or strings (in the later case we'll use the label_names to look up it's Id)

{% endraw %} {% raw %}

class BILabelingStrategy[source]

BILabelingStrategy(hf_tokenizer:PreTrainedTokenizerBase, label_names:Optional[List[str]], non_entity_label:str='O', ignore_token_id:int=-100) :: BaseLabelingStrategy

If using B/I labels, the first token assoicated to a given word gets the "B" label while all other tokens related to that same word get "I" labels. If "I" labels don't exist, this strategy behaves like the OnlyFirstTokenLabelingStrategy. Works where labels are Ids or strings (in the later case we'll use the label_names to look up it's Id)

{% endraw %} {% raw %}
{% endraw %}

Reconstructing inputs/labels

The utility methods below allow blurr users to reconstruct the original word/label associations from the input_ids/label associations. For example, these are used in our token classification show_batch method below.

{% raw %}
{% endraw %} {% raw %}
for idx in range(3):
    raw_word_list = conll2003_df.iloc[idx]["tokens"]
    raw_label_list = conll2003_df.iloc[idx]["ner_tags"]

    be = hf_tokenizer(raw_word_list, is_split_into_words=True)
    input_ids = be["input_ids"]
    targ_ids = [-100 if (word_id == None) else raw_label_list[word_id] for word_id in be.word_ids()]

    tok_labels = get_token_labels_from_input_ids(hf_tokenizer, input_ids, targ_ids, labels)

    for tok_label, targ_id in zip(tok_labels, [label_id for label_id in targ_ids if label_id != -100]):
        test_eq(tok_label[1], labels[targ_id])
{% endraw %} {% raw %}

get_token_labels_from_input_ids[source]

get_token_labels_from_input_ids(hf_tokenizer:PreTrainedTokenizerBase, input_ids:List[int], token_label_ids:List[int], vocab:List[str], ignore_token_id:int=-100, ignore_token:str='[xIGNx]')

Given a list of input IDs, the label ID associated to each, and the labels vocab, this method will return a list of tuples whereby each tuple defines the "token" and its label name. For example: [('ĠWay', B-PER), ('de', B-PER), ('ĠGill', I-PER), ('iam', I-PER), ('Ġloves'), ('ĠHug', B-ORG), ('ging', B-ORG), ('ĠFace', I-ORG)]

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
input_ids typing.List[int] List of input_ids for the tokens in a single piece of processed text
token_label_ids typing.List[int] List of label indexs for each token
vocab typing.List[str] List of label names from witch the label indicies can be used to find the name of the label
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
ignore_token str [xIGNx] The token used to identifiy ignored tokens (default: [xIGNx])
{% endraw %} {% raw %}
{% endraw %} {% raw %}
for idx in range(5):
    raw_word_list = conll2003_df.iloc[idx]["tokens"]
    raw_label_list = conll2003_df.iloc[idx]["ner_tags"]

    be = hf_tokenizer(raw_word_list, is_split_into_words=True)
    input_ids = be["input_ids"]
    targ_ids = [-100 if (word_id == None) else raw_label_list[word_id] for word_id in be.word_ids()]

    tok_labels = get_token_labels_from_input_ids(hf_tokenizer, input_ids, targ_ids, labels)
    word_labels = get_word_labels_from_token_labels(hf_arch, hf_tokenizer, tok_labels)

    for word_label, raw_word, raw_label_id in zip(word_labels, raw_word_list, raw_label_list):
        test_eq(word_label[0], raw_word)
        test_eq(word_label[1], labels[raw_label_id])
{% endraw %} {% raw %}

get_word_labels_from_token_labels[source]

get_word_labels_from_token_labels(hf_arch:str, hf_tokenizer:PreTrainedTokenizerBase, tok_labels)

Given a list of tuples where each tuple defines a token and its label, return a list of tuples whereby each tuple defines the "word" and its label. Method assumes that model inputs are a list of words, and in conjunction with the align_labels_with_tokens method, allows the user to reconstruct the orginal raw inputs and labels.

Type Default Details
hf_arch str No Content
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
tok_labels A list of tuples, where each represents a token and its label (e.g., [('ĠHug', B-ORG), ('ging', B-ORG), ('ĠFace', I-ORG), ...])
{% endraw %}

Mid-level API

{% raw %}

class TokenTensorCategory[source]

TokenTensorCategory(x, **kwargs) :: TensorBase

A Tensor which support subclass pickling, and maintains metadata when casting or after methods

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class TokenCategorize[source]

TokenCategorize(vocab:List[str]=None, ignore_token:str='[xIGNx]', ignore_token_id:int=-100) :: Transform

Reversible transform of a list of category string to vocab id

Type Default Details
vocab typing.List[str] None The unique list of entities (e.g., B-LOC) (default: CategoryMap(vocab))
ignore_token str [xIGNx] The token used to identifiy ignored tokens (default: xIGNx)
ignore_token_id int -100 The token ID that should be ignored when calculating the loss (default: CrossEntropyLossFlat().ignore_index)
{% endraw %} {% raw %}
{% endraw %}

TokenCategorize modifies the fastai Categorize transform in a couple of ways.

First, it allows your targets to consist of a Category per token, and second, it uses the idea of an ignore_token_id to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token_id as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.

{% raw %}
{% endraw %} {% raw %}

TokenCategoryBlock[source]

TokenCategoryBlock(vocab:Optional[List[str]]=None, ignore_token:str='[xIGNx]', ignore_token_id:int=-100)

TransformBlock for per-token categorical targets

Type Default Details
vocab typing.Optional[typing.List[str]] None The unique list of entities (e.g., B-LOC) (default: CategoryMap(vocab))
ignore_token str [xIGNx] The token used to identifiy ignored tokens (default: xIGNx)
ignore_token_id int -100 The token ID that should be ignored when calculating the loss (default: CrossEntropyLossFlat().ignore_index)
{% endraw %} {% raw %}

class TokenClassTextInput[source]

TokenClassTextInput(x, **kwargs) :: TextInput

The base represenation of your inputs; used by the various fastai show methods

{% endraw %} {% raw %}
{% endraw %}

Again, we define a custom class, TokenClassTextInput, for the @typedispatched methods to use so that we can override how token classification inputs/targets are assembled, as well as, how the data is shown via methods like show_batch and show_results.

{% raw %}

class TokenClassBatchTokenizeTransform[source]

TokenClassBatchTokenizeTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, include_labels:bool=True, ignore_token_id:int=-100, labeling_strategy_cls:BaseLabelingStrategy=OnlyFirstTokenLabelingStrategy, target_label_names:Optional[List[str]]=None, non_entity_label:str='O', max_length:Optional[int]=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=True, slow_word_ids_func:Optional[typing.Callable]=None, tok_kwargs:dict={}, **kwargs) :: BatchTokenizeTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Type Default Details
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config PretrainedConfig A specific configuration instance you want to use
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
include_labels bool True To control whether the "labels" are included in your inputs. If they are, the loss will be calculated in
the model's forward function and you can simply use PreCalculatedLoss as your Learner's loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
labeling_strategy_cls BaseLabelingStrategy OnlyFirstTokenLabelingStrategy The labeling strategy you want to apply when associating labels with word tokens
target_label_names typing.Optional[typing.List[str]] None the target label names
non_entity_label str O the label for non-entity
max_length typing.Optional[int] None To control the length of the padding/truncation. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding typing.Union[bool, str] True To control the padding applied to your hf_tokenizer during tokenization. If None, will default to
False or `'do_not_pad'.
See Everything you always wanted to know about padding and truncation
truncation typing.Union[bool, str] True To control truncation applied to your hf_tokenizer during tokenization. If None, will default to
False or do_not_truncate.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool True The is_split_into_words argument applied to your hf_tokenizer during tokenization. Set this to True
if your inputs are pre-tokenized (not numericalized)
slow_word_ids_func typing.Optional[typing.Callable] None If using a slow tokenizer, users will need to prove a slow_word_ids_func that accepts a
tokenizzer, example index, and a batch encoding as arguments and in turn returnes the
equavlient of fast tokenizer's `word_ids``
tok_kwargs dict None Any other keyword arguments you want included when using your hf_tokenizer to tokenize your inputs
kwargs No Content
{% endraw %} {% raw %}
{% endraw %}

TokenClassBatchTokenizeTransform is used to exclude any of the target's tokens we don't want to include in the loss calcuation (e.g. padding, cls, sep, etc...).

Note also that we default is_split_into_words = True since token classification tasks expect a list of words and labels for each word.

Examples

Using the mid-level API

Batch-Time Tokenization

Step 1: Get your Hugging Face objects.
{% raw %}
pretrained_model_name = "distilroberta-base"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
    pretrained_model_name, model_cls=AutoModelForTokenClassification, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForTokenClassification)
{% endraw %}
Step 2: Create your DataBlock
{% raw %}
batch_tok_tfm = TokenClassBatchTokenizeTransform(
    hf_arch, hf_config, hf_tokenizer, hf_model, labeling_strategy_cls=BILabelingStrategy, target_label_names=labels
)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=TokenClassTextInput), TokenCategoryBlock(vocab=labels))

dblock = DataBlock(blocks=blocks, get_x=ColReader("tokens"), get_y=ColReader("ner_tags"), splitter=RandomSplitter())
{% endraw %}
Step 3: Build your DataLoaders
{% raw %}
dls = dblock.dataloaders(conll2003_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
{% endraw %} {% raw %}
len(b), b[0]["input_ids"].shape, b[1].shape
(2, torch.Size([4, 88]), torch.Size([4, 88]))
{% endraw %} {% raw %}
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=5, trunc_at=20)
word / target label
0 [('15', 'O'), ('-', 'O'), ('Christian', 'B-PER'), ('Cullen', 'I-PER'), (',', 'O'), ('14', 'O'), ('-', 'O'), ('Jeff', 'B-PER'), ('Wilson', 'I-PER'), (',', 'O'), ('13', 'O'), ('-', 'O'), ('Walter', 'B-PER'), ('Little', 'I-PER'), (',', 'O'), ('12', 'O'), ('-', 'O'), ('Frank', 'B-PER'), ('Bunce', 'I-PER'), (',', 'O')]
1 [('In', 'O'), ('New', 'B-LOC'), ('York', 'I-LOC'), (',', 'O'), ('Garret', 'B-PER'), ('Anderson', 'I-PER'), ('and', 'O'), ('Gary', 'B-PER'), ('DiSarcina', 'I-PER'), ('drove', 'O'), ('in', 'O'), ('two', 'O'), ('runs', 'O'), ('apiece', 'O'), ('in', 'O'), ('a', 'O'), ('five-run', 'O'), ('first', 'O'), ('inning', 'O'), ('and', 'O')]
2 [('But', 'O'), ('the', 'O'), ('official', 'O'), (',', 'O'), ('Aryeh', 'B-PER'), ('Shumer', 'I-PER'), (',', 'O'), ('said', 'O'), ('it', 'O'), ('was', 'O'), ('only', 'O'), ('fitting', 'O'), ('that', 'O'), ('Weizman', 'B-PER'), ('and', 'O'), ('Arafat', 'B-PER'), ('should', 'O'), ('talk', 'O'), ('after', 'O'), ('the', 'O')]
3 [('Serbian', 'B-MISC'), ('officials', 'O'), ('have', 'O'), ('denied', 'O'), ('any', 'O'), ('abuses', 'O'), ('occurred', 'O'), ('during', 'O'), ('a', 'O'), ('10-day', 'O'), ('registration', 'O'), ('period', 'O'), ('and', 'O'), ('the', 'O'), ('Bosnian', 'B-MISC'), ('Serbs', 'I-MISC'), (',', 'O'), ('angry', 'O'), ('at', 'O'), ('the', 'O')]
{% endraw %}

Passing extra infromation

Step 1b: Get your Hugging Face objects.
{% raw %}
pretrained_model_name = "distilroberta-base"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
    pretrained_model_name, model_cls=AutoModelForTokenClassification, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForTokenClassification)
{% endraw %}
Step 1b. Preprocess dataset
{% raw %}
preprocessor = TokenClassPreprocessor(
    hf_tokenizer,
    label_names=labels,
    id_attr="id",
    word_list_attr="tokens",
    label_list_attr="ner_tags",
    tok_kwargs={"max_length": 128},
)
proc_df = preprocessor.process_df(conll2003_df)
proc_df.head(2)
proc_tokens proc_ner_tags chunk_tags id ner_tags pos_tags tokens
0 [EU, rejects, German, call, to, boycott, British, lamb, .] [3, 0, 7, 0, 0, 0, 7, 0, 0] [11, 21, 11, 12, 21, 22, 11, 12, 0] 0 [3, 0, 7, 0, 0, 0, 7, 0, 0] [22, 42, 16, 21, 35, 37, 16, 21, 7] [EU, rejects, German, call, to, boycott, British, lamb, .]
1 [Peter, Blackburn] [1, 2] [11, 12] 1 [1, 2] [22, 22] [Peter, Blackburn]
{% endraw %}
Step 2: Create your DataBlock
{% raw %}
batch_tok_tfm = TokenClassBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, target_label_names=labels)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=TokenClassTextInput), TokenCategoryBlock(vocab=labels))


def get_x(item):
    return {"id": item.id, "text": item.proc_tokens}


dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("proc_ner_tags"), splitter=RandomSplitter())
{% endraw %}
Step 3: Build your DataLoaders
{% raw %}
dls = dblock.dataloaders(proc_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
b[0].keys()
dict_keys(['input_ids', 'attention_mask', 'id', 'labels'])
{% endraw %} {% raw %}
len(b), b[0]["input_ids"].shape, b[1].shape
(2, torch.Size([4, 130]), torch.Size([4, 130]))
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=5, trunc_at=20)
word / target label
0 [('MARKET', 'O'), ('TALK', 'O'), ('-', 'O'), ('USDA', 'B-ORG'), ('net', 'O'), ('change', 'O'), ('in', 'O'), ('weekly', 'O'), ('export', 'O'), ('commitments', 'O'), ('for', 'O'), ('the', 'O'), ('week', 'O'), ('ended', 'O'), ('August', 'O'), ('22', 'O'), (',', 'O'), ('includes', 'O'), ('old', 'O'), ('crop', 'O')]
1 [('The', 'O'), ('Brady', 'B-PER'), ('bill', 'O'), (',', 'O'), ('calling', 'O'), ('for', 'O'), ('a', 'O'), ('waiting', 'O'), ('period', 'O'), ('before', 'O'), ('someone', 'O'), ('could', 'O'), ('buy', 'O'), ('a', 'O'), ('gun', 'O'), ('so', 'O'), ('a', 'O'), ('background', 'O'), ('check', 'O'), ('could', 'O')]
2 [('The', 'O'), ('mayor', 'O'), ('of', 'O'), ('Acatepec', 'B-LOC'), (',', 'O'), ('a', 'O'), ('small', 'O'), ('town', 'O'), ('some', 'O'), ('310', 'O'), ('miles', 'O'), ('(', 'O'), ('500', 'O'), ('km', 'O'), (')', 'O'), ('south', 'O'), ('of', 'O'), ('Mexico', 'B-LOC'), ('City', 'I-LOC'), (',', 'O')]
3 [('A', 'O'), ('few', 'O'), ('years', 'O'), ('ago', 'O'), (',', 'O'), ('barter', 'O'), ('deals', 'O'), ('accounted', 'O'), ('for', 'O'), ('up', 'O'), ('to', 'O'), ('25-30', 'O'), ('percent', 'O'), ('of', 'O'), ('Russian', 'B-MISC'), ('exports', 'O'), ('because', 'O'), ('"', 'O'), ('thousands', 'O'), ('(', 'O')]
{% endraw %}

Tests

The tests below to ensure the core DataBlock code above works for all pretrained token classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

{% raw %}
raw_datasets = load_dataset("conll2003")
conll2003_df = pd.DataFrame(raw_datasets["train"])

labels = raw_datasets["train"].features["ner_tags"].feature.names
Reusing dataset conll2003 (/home/wgilliam/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)
{% endraw %} {% raw %}
arch tokenizer model_name result error
0 albert AlbertTokenizerFast hf-internal-testing/tiny-albert PASSED
1 bert BertTokenizerFast hf-internal-testing/tiny-bert PASSED
2 big_bird BigBirdTokenizerFast google/bigbird-roberta-base PASSED
3 camembert CamembertTokenizerFast camembert-base PASSED
4 convbert ConvBertTokenizerFast YituTech/conv-bert-base PASSED
5 deberta DebertaTokenizerFast hf-internal-testing/tiny-deberta PASSED
6 bert BertTokenizerFast sshleifer/tiny-distilbert-base-cased PASSED
7 electra ElectraTokenizerFast hf-internal-testing/tiny-electra PASSED
8 funnel FunnelTokenizerFast huggingface/funnel-small-base PASSED
9 gpt2 GPT2TokenizerFast sshleifer/tiny-gpt2 PASSED
10 layoutlm LayoutLMTokenizerFast hf-internal-testing/tiny-layoutlm PASSED
11 longformer LongformerTokenizerFast allenai/longformer-base-4096 PASSED
12 mpnet MPNetTokenizerFast microsoft/mpnet-base PASSED
13 ibert RobertaTokenizerFast kssteven/ibert-roberta-base PASSED
14 mobilebert MobileBertTokenizerFast google/mobilebert-uncased PASSED
15 rembert RemBertTokenizerFast google/rembert PASSED
16 roformer RoFormerTokenizerFast junnyu/roformer_chinese_sim_char_ft_small PASSED
17 roberta RobertaTokenizerFast roberta-base PASSED
18 squeezebert SqueezeBertTokenizerFast squeezebert/squeezebert-uncased PASSED
19 xlm_roberta XLMRobertaTokenizerFast xlm-roberta-base PASSED
20 xlnet XLNetTokenizerFast xlnet-base-cased PASSED
{% endraw %}