--- title: text.modeling.token_classification keywords: fastai sidebar: home_sidebar summary: "This module contains custom models, loss functions, custom splitters, etc... for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...). The objective of token classification is to predict the correct label for each token provided in the input. In the computer vision world, this is akin to what we do in segmentation tasks whereby we attempt to predict the class/label for each pixel in an image." description: "This module contains custom models, loss functions, custom splitters, etc... for token classification tasks (e.g., Named entity recognition (NER), Part-of-speech tagging (POS), etc...). The objective of token classification is to predict the correct label for each token provided in the input. In the computer vision world, this is akin to what we do in segmentation tasks whereby we attempt to predict the class/label for each pixel in an image." nb_path: "nbs/13_text-modeling-token-classification.ipynb" ---
raw_datasets = load_dataset("conll2003")
labels = raw_datasets["train"].features["ner_tags"].feature.names
print(f"Labels: {labels}")
conll2003_df = pd.DataFrame(raw_datasets["train"])
conll2003_df.head()
model_cls = AutoModelForTokenClassification
pretrained_model_name = "roberta-base"
config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls, config=config)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
test_eq(hf_config.num_labels, len(labels))
batch_tok_tfm = TokenClassBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=TokenClassTextInput), TokenCategoryBlock(vocab=labels))
dblock = DataBlock(blocks=blocks, get_x=ColReader("tokens"), get_y=ColReader("ner_tags"), splitter=RandomSplitter())
dls = dblock.dataloaders(conll2003_df, bs=4)
b = dls.one_batch()
dls.show_batch(dataloaders=dls, max_n=2)
model = BaseModelWrapper(hf_model)
learn_cbs = [BaseModelCallback]
fit_cbs = [TokenClassMetricsCallback()]
learn = Learner(dls, model, opt_func=partial(Adam), loss_func=PreCalculatedCrossEntropyLoss(), cbs=learn_cbs, splitter=blurr_splitter)
learn.freeze()
learn.summary()
b = dls.one_batch()
preds = learn.model(b[0])
len(preds), type(preds), preds.keys()
len(b), len(b[0]), b[0]["input_ids"].shape, len(b[1]), b[1].shape
preds.logits.shape
print(preds.logits.view(-1, preds.logits.shape[-1]).shape, b[1].view(-1).shape)
test_eq(preds.logits.view(-1, preds.logits.shape[-1]).shape[0], b[1].view(-1).shape[0])
print(len(learn.opt.param_groups))
learn.unfreeze()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=3e-5, moms=(0.8, 0.7, 0.8), cbs=fit_cbs)
print(learn.token_classification_report)
learn.show_results(learner=learn, max_n=2, trunc_at=10)
The default Learner.predict
method returns a prediction per subtoken, including the special tokens for each architecture's tokenizer. Starting with version 2.0 of BLURR, we bring token prediction in-line with Hugging Face's token classification pipeline, both in terms of supporting the same aggregation strategies via Blurr's TokenAggregationStrategies
class, and also the output via BLURR's @patch
ed Learner
method, blurr_predict_tokens
.
res = learn.blurr_predict_tokens(
items=["My name is Wayde and I live in San Diego and using Hugging Face", "Bayern Munich is a soccer team in Germany"],
aggregation_strategy="max",
)
print(len(res))
print(res[1])
txt = "Hi! My name is Wayde Gilliam from ohmeow.com. I live in California."
txt2 = "I wish covid was over so I could go to Germany and watch Bayern Munich play in the Bundesliga."
res = learn.blurr_predict_tokens(txt)
print(res)
results = learn.blurr_predict_tokens([txt, txt2])
for res in results:
print(f"{res}\n")
export_fname = "tok_class_learn_export"
learn.export(fname=f"{export_fname}.pkl")
inf_learn = load_learner(fname=f"{export_fname}.pkl")
results = inf_learn.blurr_predict_tokens([txt, txt2])
for res in results:
print(f"{res}\n")
Blearner
learn = BlearnerForTokenClassification.from_data(
conll2003_df,
"distilroberta-base",
tokens_attr="tokens",
token_labels_attr="ner_tags",
labels=labels,
dl_kwargs={"bs": 2},
)
learn.unfreeze()
learn.dls.show_batch(dataloaders=learn.dls, max_n=2)
learn.fit_one_cycle(1, lr_max=3e-5, moms=(0.8, 0.7, 0.8), cbs=[BlearnerForTokenClassification.get_metrics_cb()])
learn.show_results(learner=learn, max_n=2, trunc_at=10)
print(learn.token_classification_report)
txt = "Hi! My name is Wayde Gilliam from ohmeow.com. I live in California."
txt2 = "I wish covid was over so I could watch Lewandowski score some more goals for Bayern Munich in the Bundesliga."
results = learn.predict([txt, txt2])
for res in results:
print(f"{res}\n")
The tests below to ensure the token classification training code above works for all pretrained token classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained token classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
raw_datasets = load_dataset("conll2003")
labels = raw_datasets["train"].features["ner_tags"].feature.names
conll2003_df = pd.DataFrame(raw_datasets["train"])