--- title: modeling.core keywords: fastai sidebar: home_sidebar summary: "This module contains core custom models, loss functions, and a default layer group splitter for use in applying discriminiative learning rates to your huggingface models trained via fastai" description: "This module contains core custom models, loss functions, and a default layer group splitter for use in applying discriminiative learning rates to your huggingface models trained via fastai" nb_path: "nbs/02_modeling-core.ipynb" ---
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Note that HF_BaseModelWrapper
includes some nifty code for just passing in the things your model needs, as not all transformer architectures require/use the same information.
If you want to let your huggingface model calculate the loss for you, make sure you include the labels
argument in your inputs and use HF_PreCalculatedLoss
as your loss function. Even though we don't really need a loss function per se, we have to provide a custom loss class/function for fastai to function properly (e.g. one with a decodes
and activation
methods). Why? Because these methods will get called in methods like show_results
to get the actual predictions.
We use a Callback
for handling what is returned from the huggingface model. The return type is (ModelOutput
)[https://huggingface.co/transformers/main_classes/output.html#transformers.file_utils.ModelOutput] which makes it easy to return all the goodies we asked for.
Note that your Learner
's loss will be set for you only if the huggingface model returns one and you are using the HF_PreCalculatedLoss
loss function.
Also note that anything else you asked the model to return (for example, last hidden state, etc..) will be available for you via the blurr_model_outputs
property attached to your Learner
. For example, assuming you are using BERT for a classification task ... if you have told your HF_BaseModelWrapper
instance to return attentions, you'd be able to access them via learn.blurr_model_outputs['attentions']
.
path = untar_data(URLs.IMDB_SAMPLE)
imdb_df = pd.read_csv(path/'texts.csv')
imdb_df.head()
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "roberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(dataloaders=dls, max_n=2)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(OptimWrapper, opt=torch.optim.Adam),
loss_func=CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[HF_BaseModelCallback],
splitter=hf_splitter)
learn.freeze()
.to_fp16()
requires a GPU so had to remove for tests to run on github. Let's check that we can get predictions.
We have to create our own summary
methods above because fastai only works where things are represented by a single tensor. But in the case of huggingface transformers, a single sequence is represented by multiple tensors (in a dictionary).
The change to make this work is so minor I think that the fastai library can/will hopefully be updated to support this use case.
print(len(learn.opt.param_groups))
learn.lr_find(suggestions=True)
learn.fit_one_cycle(1, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2, trunc_at=500)
Same as with summary
, we need to replace fastai's Learner.predict
method with the one above which is able to work with inputs that are represented by multiple tensors included in a dictionary.
learn.blurr_predict('I really liked the movie')
learn.blurr_predict(['I really liked the movie', 'I really hated the movie'])
learn.unfreeze()
learn.fit_one_cycle(3, lr_max=slice(1e-7, 1e-4))
learn.recorder.plot_loss()
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.blurr_predict("This was a really good movie")
learn.blurr_predict("Acting was so bad it was almost funny.")
export_fname = 'seq_class_learn_export'
learn.export(fname=f'{export_fname}.pkl')
inf_learn = load_learner(fname=f'{export_fname}.pkl')
inf_learn.blurr_predict("This movie should not be seen by anyone!!!!")
Much of the inspiration for the code below comes from Zach Mueller's excellent fastinference library, and in many places I simply adapted his code to work with blurr and the various huggingface transformers tasks.
# import onnxruntime as ort
# from onnxruntime.quantization import quantize_dynamic, QuantType
# @patch
# def blurr_to_onnx(self:Learner, fname='export', path=None, quantize=False, excluded_input_names=[]):
# """Export model to `ONNX` format"""
# if (path == None): path = self.path
# dummy_b = self.dls.one_batch()
# # inputs
# for n in excluded_input_names:
# if (n in dummy_b[0]): del dummy_b[0][n]
# input_names = list(dummy_b[0].keys())
# dynamic_axes = { n: {0:'batch_size', 1:'sequence'} for n in input_names if n in self.model.hf_model_fwd_args}
# # outputs
# output_names = [ f'output_{i}' for i in range(len(dummy_b) - self.dls.n_inp) ]
# for n in output_names: dynamic_axes[n] = { 0:'batch_size' }
# torch.onnx.export(model=self.model,
# args=dummy_b[:self.dls.n_inp], # everything but the targets
# f=self.path/f'{fname}.onnx', # onnx filename
# opset_version=11, # required for get errors
# input_names=input_names, # transformer dictionary keys for input
# output_names=output_names, # one for each target
# dynamic_axes=dynamic_axes) # see above
# if (quantize):
# quant_model_fpath = self.path/f'{fname}-quant.onnx'
# quant_model = quantize_dynamic(self.path/f'{fname}.onnx', quant_model_fpath, weight_type=QuantType.QUInt8)
# dls_export = self.dls.new_empty()
# dls_export.loss_func = self.loss_func
# dls_export.hf_model_fwd_args = self.model.hf_model_fwd_args # we need this to exclude non-model args in onnx
# torch.save(dls_export, self.path/f'{fname}-dls.pkl', pickle_protocol=2)
# learn.blurr_to_onnx(export_fname, quantize=True)
# class blurrONNX():
# def __init__(self, fname='export', path=Path('.'), use_quant_version=False):
# self.fname, self.path = fname, path
# onnx_fname = f'{fname}-quant.onnx' if (use_quant_version) else f'{fname}.onnx'
# self.ort_session = ort.InferenceSession(str(self.path/onnx_fname))
# self.dls = torch.load(f'{self.path}/{fname}-dls.pkl')
# self.trg_tfms = self.dls.tfms[self.dls.n_inp:]
# self.tok_is_split_into_words = self.dls.before_batch[0].is_split_into_words
# self.hf_model_fwd_args = self.dls.hf_model_fwd_args
# def predict(self, items, rm_type_tfms=None):
# is_split_str = self.tok_is_split_into_words and isinstance(items[0], str)
# is_df = isinstance(items, pd.DataFrame)
# if (not is_df and (is_split_str or not is_listy(items))): items = [items]
# dl = self.dls.test_dl(items, rm_type_tfms=rm_type_tfms, num_workers=0)
# outs = []
# for b in dl:
# xb = b[0]
# inp = self._to_np(xb)
# # remove any args not found in the transformers forward func
# for k in list(inp.keys()):
# if (k not in self.hf_model_fwd_args): del inp[k]
# res = self.ort_session.run(None, inp)
# tensor_res = [ tensor(r) for r in res ]
# probs = L([ self.dls.loss_func.activation(tr) for tr in tensor_res ])
# decoded_preds = L([ self.dls.loss_func.decodes(tr) for tr in tensor_res ])
# for i in range(len(xb['input_ids'])):
# item_probs = probs.itemgot(i)
# item_dec_preds = decoded_preds.itemgot(i)
# item_dec_labels = tuplify([tfm.decode(item_dec_preds[tfm_idx])
# for tfm_idx, tfm in enumerate(self.trg_tfms)])
# outs.append((item_dec_labels, item_dec_preds, item_probs))
# return outs
# #----- utility -----
# def _to_np(self, xb): return { k: v.cpu().numpy() for k,v in xb.items() }
# onnx_inf = blurrONNX(export_fname)
# onnx_inf.predict(['I really liked the movie'])
# %timeit inf_learn.blurr_predict(['I really liked the movie', 'I hated everything in it'])
# %timeit onnx_inf.predict(['I really liked the movie', 'I hated everything in it'])
# onnx_inf = blurrONNX(export_fname, use_quant_version=True)
# onnx_inf.predict(['I hated everything in it'])
# %timeit inf_learn.blurr_predict(['I really liked the movie', 'I hated everything in it'])
# %timeit onnx_inf.predict(['I really liked the movie', 'I hated everything in it'])
The tests below to ensure the core training code above works for all pretrained sequence classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)
try: del learn; torch.cuda.empty_cache()
except: pass
[ model_type for model_type in BLURR.get_models(task='SequenceClassification')
if (not model_type.__name__.startswith('TF')) ]
pretrained_model_names = [
'albert-base-v1',
'facebook/bart-base',
'bert-base-uncased',
'sshleifer/tiny-ctrl',
'camembert-base',
'microsoft/deberta-base',
'distilbert-base-uncased',
'monologg/electra-small-finetuned-imdb',
'flaubert/flaubert_small_cased',
'huggingface/funnel-small-base',
'gpt2',
'allenai/led-base-16384',
'allenai/longformer-base-4096',
'sshleifer/tiny-mbart',
'microsoft/mpnet-base',
'google/mobilebert-uncased',
'openai-gpt',
#'reformer-enwik8', (see model card; does not work with/require a tokenizer so no bueno here)
'roberta-base',
'squeezebert/squeezebert-uncased',
#'google/tapas-base', (requires pip install torch-scatter)
'transfo-xl-wt103',
'xlm-mlm-en-2048',
'xlm-roberta-base',
'xlnet-base-cased'
]
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
#hide_output
model_cls = AutoModelForSequenceClassification
bsz = 2
seq_sz = 128
test_results = []
for model_name in pretrained_model_names:
error=None
print(f'=== {model_name} ===\n')
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(model_name,
model_cls=model_cls,
config_kwargs={'num_labels': 2})
print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\nmodel:\t\t{type(hf_model).__name__}\n')
# not all architectures include a native pad_token (e.g., gpt2, ctrl, etc...), so we add one here
if (hf_tokenizer.pad_token is None):
hf_tokenizer.add_special_tokens({'pad_token': '<pad>'})
hf_config.pad_token_id = hf_tokenizer.get_vocab()['<pad>']
hf_model.resize_token_embeddings(len(hf_tokenizer))
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, max_length=seq_sz, padding='max_length'),
CategoryBlock)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=bsz)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam),
loss_func=CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[HF_BaseModelCallback],
splitter=hf_splitter).to_fp16()
learn.create_opt() # -> will create your layer groups based on your "splitter" function
learn.freeze()
b = dls.one_batch()
try:
print('*** TESTING DataLoaders ***')
test_eq(len(b), bsz)
test_eq(len(b[0]['input_ids']), bsz)
test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
test_eq(len(b[1]), bsz)
# print('*** TESTING One pass through the model ***')
# preds = learn.model(b[0])
# test_eq(len(preds[0]), bsz)
# test_eq(preds[0].shape, torch.Size([bsz, 2]))
print('*** TESTING Training/Results ***')
learn.fit_one_cycle(1, lr_max=1e-3)
test_results.append((hf_arch, type(hf_tokenizer).__name__, type(hf_model).__name__, 'PASSED', ''))
learn.show_results(learner=learn, max_n=2, trunc_at=250)
except Exception as err:
test_results.append((hf_arch, type(hf_tokenizer).__name__, type(hf_model).__name__, 'FAILED', err))
finally:
# cleanup
del learn; torch.cuda.empty_cache()