--- title: blurr keywords: fastai sidebar: home_sidebar summary: "A library that integrates huggingface transformers with version 2 of the fastai framework" description: "A library that integrates huggingface transformers with version 2 of the fastai framework" nb_path: "nbs/index.ipynb" ---
You can now pip install blurr via pip install ohmeow-blurr
Or, even better as this library is under very active development, create an editable install like this:
git clone https://github.com/ohmeow/blurr.git
cd blurr
pip install -e ".[dev]"
The initial release includes everything you need for sequence classification and question answering tasks. Support for token classification and summarization are incoming. Please check the documentation for more thorough examples of how to use this package.
The following two packages need to be installed for blurr to work:
import torch
from transformers import *
from fastai.text.all import *
from blurr.data.all import *
from blurr.modeling.all import *
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(dataloaders=dls, max_n=2)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam, decouple_wd=True),
loss_func=CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[HF_BaseModelCallback],
splitter=hf_splitter)
learn.create_opt()
learn.freeze()
learn.fit_one_cycle(3, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2)
12/31/2020
The "Goodbye 2020" release with lots of goodies for blurr users:
tokenizer.prepare_seq2seq_batch
.fit_one_cycle
, etc.. callbacks rather than attach them to your Learner
.default_text_gen_kwargs
, a method that given a huggingface config, model, and task (optional), will return the default/recommended kwargs for any text generation models.As I'm sure there is plenty I can do to make this library better, please don't hesitate to join in and help the effort by submitting PRs, pointing out problems with my code, or letting me know what and how I can improve things generally. Some models, like mbart and mt5 for example, aren't giving good results and I'd love to get any and all feedback from the community on how to resolve such issues ... so hit me up, I promise I won't bit :)
12/20/2020
Learner.blurr_predict
and Learner.blurr_predict_tokens
to support single or multiple itemsblurrONNX
provides ONNX friendly variants of Learner.blurr_predict
and Learner.blurr_predict_tokens
in the form of blurrONNX.predict
and blurrONNX.predict_tokens
respectively. Like their Learner equivalents, these methods support single or multiple items for inferece. See the docs/code for examples and speed gains you get with ONNX.12/12/2020
Learner.blurr_summary
to work with fast.ai >= 2.1.8add_prefix_space
in tokenizer BLURR_MODEL_HELPER
show_results
for tokenizers that add a prefix space11/12/2020
11/10/2020
before_batch
transforms.10/08/2020
ModelOutput
attributes are assigned to the appropriate fastai bits like Learner.pred
and Learner.loss
and anything else you've requested the huggingface model to return is available via the Learner.blurr_model_outputs
dictionary (see next two bullet items)Learner
. You can get at them via Learner.blurr_model_outputs
dictionary if you tell HF_BaseModelWrapper
to provide them.model_kwargs
to HF_BaseModelWrapper
should you need to request a huggingface model to return something specific to it's type. These outputs will be available via the Learner.blurr_model_outputs
dictionary as well.09/16/2020
09/07/2020
08/20/2020
HF_TokenizerTransform
doesn't add any padding tokens and all huggingface inputs are padded simply to the max sequence length in each batch rather than to the max length (passed in and/or acceptable to the model). This should create efficiencies across the board, from memory consumption to GPU utilization. The old tried and true method of padding during tokenization requires you to pass in padding='max_length
to HF_TextBlock
.08/13/2020
07/06/2020
06/27/2020
BLURR_MODEL_HELPER.get_hf_objects
method to support a wide range of options in terms of building the necessary huggingface objects (architecture, config, tokenizer, and model). Also added cache_dir
for saving pre-trained objects in a custom directory.05/23/2020
05/17/2020
HF_TokenizerTransform
replaces HF_Tokenizer
, handling the tokenization and numericalization in one place. DataBlock code has been dramatically simplified.add_prefix_space=True
.HF_BaseModelCallback
and HF_BaseModelCallback
are required and work together in order to allow developers to tie into any callback friendly event exposed by fastai2 and also pass in named arguments to the huggingface models.show_batch
and show_results
have been updated for Question/Answer and Token Classification models to represent the data and results in a more easily intepretable manner than the defaults.05/06/2020
Learner
object with a predict_tokens
method used specifically in token classificationHF_BaseModelCallback
can be used (or extended) instead of the model wrapper to ensure your inputs into the huggingface model is correct (recommended). See docs for examples (and thanks to fastai's Sylvain for the suggestion!)HF_Tokenizer
can work with strings or a string representation of a list (the later helpful for token classification tasks)show_batch
and show_results
methods have been updated to allow better control on how huggingface tokenized data is represented in those methodsA word of gratitude to the following individuals, repos, and articles upon which much of this work is inspired from: