--- title: text.data.token_classification keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...)" description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...)" nb_path: "nbs/13_text-data-token-classification.ipynb" ---
raw_datasets = load_dataset("conll2003")
raw_datasets
We need to get a list of the distinct entities we want to predict. If they are represented as list in their raw/readable form in another attribute/column in our dataset, we could use something like this to build a sorted list of distinct values as such: labels = sorted(list(set([lbls for sublist in germ_eval_df.labels.tolist() for lbls in sublist])))
.
Fortunately, the conll2003
dataset allows us to get at this list directly using the code below.
print(raw_datasets["train"].features["chunk_tags"].feature.names[:20])
print(raw_datasets["train"].features["ner_tags"].feature.names[:20])
print(raw_datasets["train"].features["pos_tags"].feature.names[:20])
labels = raw_datasets["train"].features["ner_tags"].feature.names
labels
conll2003_df = pd.DataFrame(raw_datasets["train"])
model_cls = AutoModelForTokenClassification
pretrained_model_name = "roberta-base" # "bert-base-multilingual-cased"
n_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
Starting with version 2.0, BLURR
provides a token classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets. We also introduce a novel way of handling long documents for this task that ensures tokens associated to a word is not split up in "chunked" documents. See below for an example.
preprocessor = TokenClassPreprocessor(
hf_tokenizer,
chunk_examples=True,
word_stride=2,
label_names=labels,
id_attr="id",
word_list_attr="tokens",
label_list_attr="ner_tags",
tok_kwargs={"max_length": 8},
)
proc_df = preprocessor.process_df(conll2003_df)
print(len(proc_df))
print(preprocessor.label_names)
proc_df.head(4)
conll2003_labeled_df = conll2003_df.copy()
conll2003_labeled_df.ner_tags = conll2003_labeled_df.ner_tags.apply(lambda v: [labels[lbl_id] for lbl_id in v])
conll2003_labeled_df.head(5)
preprocessor = TokenClassPreprocessor(
hf_tokenizer, label_names=labels, id_attr="id", word_list_attr="tokens", label_list_attr="ner_tags", tok_kwargs={"max_length": 8}
)
proc_df = preprocessor.process_df(conll2003_labeled_df)
print(len(proc_df))
print(preprocessor.label_names)
proc_df.head(4)
Here we include a BaseLabelingStrategy
abstract class and several different strategies for assigning labels to your tokenized inputs. The "only first token" and "B/I" labeling strategies are discussed in the "Token Classification" section in part 7 of the Hugging Face's Transformers course.
for idx in range(3):
raw_word_list = conll2003_df.iloc[idx]["tokens"]
raw_label_list = conll2003_df.iloc[idx]["ner_tags"]
be = hf_tokenizer(raw_word_list, is_split_into_words=True)
input_ids = be["input_ids"]
targ_ids = [-100 if (word_id == None) else raw_label_list[word_id] for word_id in be.word_ids()]
tok_labels = get_token_labels_from_input_ids(hf_tokenizer, input_ids, targ_ids, labels)
for tok_label, targ_id in zip(tok_labels, [label_id for label_id in targ_ids if label_id != -100]):
test_eq(tok_label[1], labels[targ_id])
for idx in range(5):
raw_word_list = conll2003_df.iloc[idx]["tokens"]
raw_label_list = conll2003_df.iloc[idx]["ner_tags"]
be = hf_tokenizer(raw_word_list, is_split_into_words=True)
input_ids = be["input_ids"]
targ_ids = [-100 if (word_id == None) else raw_label_list[word_id] for word_id in be.word_ids()]
tok_labels = get_token_labels_from_input_ids(hf_tokenizer, input_ids, targ_ids, labels)
word_labels = get_word_labels_from_token_labels(hf_arch, hf_tokenizer, tok_labels)
for word_label, raw_word, raw_label_id in zip(word_labels, raw_word_list, raw_label_list):
test_eq(word_label[0], raw_word)
test_eq(word_label[1], labels[raw_label_id])
TokenCategorize
modifies the fastai Categorize
transform in a couple of ways.
First, it allows your targets to consist of a Category
per token, and second, it uses the idea of an ignore_token_id
to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token_id
as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.
Again, we define a custom class, TokenClassTextInput
, for the @typedispatch
ed methods to use so that we can override how token classification inputs/targets are assembled, as well as, how the data is shown via methods like show_batch
and show_results
.
TokenClassBatchTokenizeTransform
is used to exclude any of the target's tokens we don't want to include in the loss calcuation (e.g. padding, cls, sep, etc...).
Note also that we default is_split_into_words = True
since token classification tasks expect a list of words and labels for each word.
pretrained_model_name = "distilroberta-base"
n_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
pretrained_model_name, model_cls=AutoModelForTokenClassification, config_kwargs={"num_labels": n_labels}
)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
batch_tok_tfm = TokenClassBatchTokenizeTransform(
hf_arch, hf_config, hf_tokenizer, hf_model, labeling_strategy_cls=BILabelingStrategy, target_label_names=labels
)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=TokenClassTextInput), TokenCategoryBlock(vocab=labels))
dblock = DataBlock(blocks=blocks, get_x=ColReader("tokens"), get_y=ColReader("ner_tags"), splitter=RandomSplitter())
dls = dblock.dataloaders(conll2003_df, bs=4)
b = dls.one_batch()
len(b), b[0]["input_ids"].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=5, trunc_at=20)
pretrained_model_name = "distilroberta-base"
n_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(
pretrained_model_name, model_cls=AutoModelForTokenClassification, config_kwargs={"num_labels": n_labels}
)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
preprocessor = TokenClassPreprocessor(
hf_tokenizer,
label_names=labels,
id_attr="id",
word_list_attr="tokens",
label_list_attr="ner_tags",
tok_kwargs={"max_length": 128},
)
proc_df = preprocessor.process_df(conll2003_df)
proc_df.head(2)
batch_tok_tfm = TokenClassBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, target_label_names=labels)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=TokenClassTextInput), TokenCategoryBlock(vocab=labels))
def get_x(item):
return {"id": item.id, "text": item.proc_tokens}
dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("proc_ner_tags"), splitter=RandomSplitter())
dls = dblock.dataloaders(proc_df, bs=4)
b = dls.one_batch()
b[0].keys()
len(b), b[0]["input_ids"].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=5, trunc_at=20)
The tests below to ensure the core DataBlock code above works for all pretrained token classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
raw_datasets = load_dataset("conll2003")
conll2003_df = pd.DataFrame(raw_datasets["train"])
labels = raw_datasets["train"].features["ner_tags"].feature.names