--- title: modeling.token_classification keywords: fastai sidebar: home_sidebar summary: "This module contains custom models, loss functions, custom splitters, etc... for token classification tasks like named entity recognition." description: "This module contains custom models, loss functions, custom splitters, etc... for token classification tasks like named entity recognition." nb_path: "nbs/02a_modeling-token-classification.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
The objective of token classification is to predict the correct label for each token provided in the input. In the computer vision world, this is akin to what we do in segmentation tasks whereby we attempt to predict the class/label for each pixel in an image. Named entity recognition (NER) is an example of token classification in the NLP space
df_converters = {'tokens': ast.literal_eval, 'labels': ast.literal_eval, 'nested-labels': ast.literal_eval}
# full nlp dataset
# germ_eval_df = pd.read_csv('./data/task-token-classification/germeval2014ner_cleaned.csv', converters=df_converters)
# demo nlp dataset
germ_eval_df = pd.read_csv('./germeval2014_sample.csv', converters=df_converters)
print(len(germ_eval_df))
germ_eval_df.head()
We are only going to be working with small sample from the GermEval 2014 data set ... so the results might not be all that great :).
labels = sorted(list(set([lbls for sublist in germ_eval_df.labels.tolist() for lbls in sublist])))
print(labels)
model_cls = AutoModelForTokenClassification
pretrained_model_name = "bert-base-multilingual-cased"
config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = len(labels)
Notice above how I set the config.num_labels
attribute to the number of labels we want our model to be able to predict. The model will update its last layer accordingly (this concept is essentially transfer learning).
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name,
model_cls=model_cls,
config=config)
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
test_eq(hf_config.num_labels, len(labels))
before_batch_tfm = HF_TokenClassBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
is_split_into_words=True,
tok_kwargs={ 'return_special_tokens_mask': True })
blocks = (
HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_TokenClassInput),
HF_TokenCategoryBlock(vocab=labels)
)
def get_y(inp):
return [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.tokens, inp.labels) ]
dblock = DataBlock(blocks=blocks,
get_x=ColReader('tokens'),
get_y=get_y,
splitter=RandomSplitter())
We have to define a get_y
that creates the same number of labels as there are subtokens for a particular token. For example, my name "Wayde" gets split up into two subtokens, "Way" and "##de". The label for "Wayde" is "B-PER" and we just repeat it for the subtokens. This all get cleaned up when we show results and get predictions.
dls = dblock.dataloaders(germ_eval_df, bs=2)
dls.show_batch(dataloaders=dls, max_n=2)
model = HF_BaseModelWrapper(hf_model)
learn_cbs = [HF_BaseModelCallback]
fit_cbs = [HF_TokenClassMetricsCallback()]
learn = Learner(dls, model, opt_func=partial(Adam),cbs=learn_cbs,splitter=hf_splitter)
learn.freeze()
b = dls.one_batch()
preds = learn.model(b[0])
len(preds),preds[0].shape
len(b), len(b[0]), b[0]['input_ids'].shape, len(b[1]), b[1].shape
print(preds[0].view(-1, preds[0].shape[-1]).shape, b[1].view(-1).shape)
test_eq(preds[0].view(-1, preds[0].shape[-1]).shape[0], b[1].view(-1).shape[0])
print(len(learn.opt.param_groups))
learn.unfreeze()
learn.lr_find(suggestions=True)
learn.fit_one_cycle(1, lr_max= 3e-5, moms=(0.8,0.7,0.8), cbs=fit_cbs)
print(learn.token_classification_report)
learn.show_results(learner=learn, max_n=2, trunc_at=10)
res = learn.blurr_predict('My name is Wayde and I live in San Diego'.split())
print(res[0][0])
The default Learner.predict
method returns a prediction per subtoken, including the special tokens for each architecture's tokenizer.
txt ="Hi! My name is Wayde Gilliam from ohmeow.com. I live in California."
txt2 = "I wish covid was over so I could go to Germany and watch Bayern Munich play in the Bundesliga."
res = learn.blurr_predict_tokens(txt.split())
for r in res: print(f'{[(tok, lbl) for tok,lbl in zip(r[0],r[1]) ]}\n')
res = learn.blurr_predict_tokens([txt.split(), txt2.split()])
for r in res: print(f'{[(tok, lbl) for tok,lbl in zip(r[0],r[1]) ]}\n')
It's interesting (and very cool) how well this model performs on English even thought it was trained against a German corpus.
export_fname = 'tok_class_learn_export'
learn.export(fname=f'{export_fname}.pkl')
inf_learn = load_learner(fname=f'{export_fname}.pkl')
res = learn.blurr_predict_tokens([txt.split(), txt2.split()])
for r in res: print(f'{[(tok, lbl) for tok,lbl in zip(r[0],r[1]) ]}\n')
... and onnx
# @patch
# def predict_tokens(self:blurrONNX, items, **kargs):
# hf_before_batch_tfm = get_blurr_tfm(self.dls.before_batch)
# return _blurr_predict_tokens(self.predict, items, hf_before_batch_tfm)
# learn.blurr_to_onnx(export_fname, quantize=True)
# onnx_inf = blurrONNX(export_fname)
# res = onnx_inf.predict_tokens(txt.split())
# for r in res: print(f'{[(tok, lbl) for tok,lbl in zip(r[0],r[1]) ]}\n')
The tests below to ensure the token classification training code above works for all pretrained token classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained token classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
try: del learn; torch.cuda.empty_cache()
except: pass
[ model_type for model_type in BLURR.get_models(task='TokenClassification')
if (not model_type.__name__.startswith('TF')) ]
pretrained_model_names = [
'albert-base-v1',
'bert-base-multilingual-cased',
'camembert-base',
'distilbert-base-uncased',
'google/electra-small-discriminator',
'flaubert/flaubert_small_cased',
'huggingface/funnel-small-base',
'allenai/longformer-base-4096',
'microsoft/mpnet-base',
'google/mobilebert-uncased',
'roberta-base',
'squeezebert/squeezebert-uncased',
'xlm-mlm-en-2048',
'xlm-roberta-base',
'xlnet-base-cased'
]
#hide_output
model_cls = AutoModelForTokenClassification
bsz = 4
seq_sz = 64
test_results = []
for model_name in pretrained_model_names:
error=None
print(f'=== {model_name} ===\n')
config = AutoConfig.from_pretrained(model_name)
config.num_labels = len(labels)
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(model_name,
model_cls=model_cls,
config=config)
print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
# not all architectures include a native pad_token (e.g., gpt2, ctrl, etc...), so we add one here
if (hf_tokenizer.pad_token is None):
hf_tokenizer.add_special_tokens({'pad_token': '<pad>'})
hf_config.pad_token_id = hf_tokenizer.get_vocab()['<pad>']
hf_model.resize_token_embeddings(len(hf_tokenizer))
before_batch_tfm = HF_TokenClassBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
max_length=seq_sz,
padding='max_length',
is_split_into_words=True,
tok_kwargs={ 'return_special_tokens_mask': True })
blocks = (
HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_TokenClassInput),
HF_TokenCategoryBlock(vocab=labels)
)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('tokens'),
get_y= lambda inp: [
(label, len(hf_tokenizer.tokenize(str(entity))))
for entity, label in zip(inp.tokens, inp.labels)
],
splitter=RandomSplitter())
dls = dblock.dataloaders(germ_eval_df, bs=bsz)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam),
cbs=[HF_BaseModelCallback],
splitter=hf_splitter).to_fp16()
learn.create_opt() # -> will create your layer groups based on your "splitter" function
learn.freeze()
b = dls.one_batch()
try:
print('*** TESTING DataLoaders ***')
test_eq(len(b), 2)
test_eq(len(b[0]['input_ids']), bsz)
test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
test_eq(len(b[1]), bsz)
print('*** TESTING Training/Results ***')
learn.fit_one_cycle(1, lr_max= 3e-5, moms=(0.8,0.7,0.8),
cbs=[HF_TokenClassMetricsCallback(tok_metrics=['accuracy'])])
test_results.append((hf_arch, type(hf_tokenizer).__name__, type(hf_model).__name__, 'PASSED', ''))
learn.show_results(learner=learn, max_n=2, trunc_at=10)
except Exception as err:
test_results.append((hf_arch, type(hf_tokenizer).__name__, type(hf_model).__name__, 'FAILED', err))
finally:
# cleanup
del learn; torch.cuda.empty_cache()