--- title: blurr keywords: fastai sidebar: home_sidebar summary: "A library that integrates huggingface transformers with version 2 of the fastai framework" description: "A library that integrates huggingface transformers with version 2 of the fastai framework" nb_path: "nbs/index.ipynb" ---
You can now pip install blurr via pip install ohmeow-blurr
Or, even better as this library is under very active development, create an editable install like this:
git clone https://github.com/ohmeow/blurr.git
cd blurr
pip install -e ".[dev]"
The initial release includes everything you need for sequence classification and question answering tasks. Support for token classification and summarization are incoming. Please check the documentation for more thorough examples of how to use this package.
The following two packages need to be installed for blurr to work:
import torch
from transformers import *
from fastai.text.all import *
from blurr.data.all import *
from blurr.modeling.all import *
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
blocks = (HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock)
dblock = DataBlock(blocks=blocks,
get_x=ColReader('text'), get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam, decouple_wd=True),
loss_func=CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[HF_BaseModelCallback],
splitter=hf_splitter)
learn.create_opt()
learn.freeze()
learn.fit_one_cycle(3, lr_max=1e-3)
learn.show_results(hf_tokenizer=hf_tokenizer, max_n=2)
09/07/2020
08/20/2020
HF_TokenizerTransform
doesn't add any padding tokens and all huggingface inputs are padded simply to the max sequence length in each batch rather than to the max length (passed in and/or acceptable to the model). This should create efficiencies across the board, from memory consumption to GPU utilization. The old tried and true method of padding during tokenization requires you to pass in padding='max_length
to HF_TextBlock
.08/13/2020
07/06/2020
06/27/2020
BLURR_MODEL_HELPER.get_hf_objects
method to support a wide range of options in terms of building the necessary huggingface objects (architecture, config, tokenizer, and model). Also added cache_dir
for saving pre-trained objects in a custom directory.05/23/2020
05/17/2020
HF_TokenizerTransform
replaces HF_Tokenizer
, handling the tokenization and numericalization in one place. DataBlock code has been dramatically simplified.add_prefix_space=True
.HF_BaseModelCallback
and HF_BaseModelCallback
are required and work together in order to allow developers to tie into any callback friendly event exposed by fastai2 and also pass in named arguments to the huggingface models.show_batch
and show_results
have been updated for Question/Answer and Token Classification models to represent the data and results in a more easily intepretable manner than the defaults.05/06/2020
Learner
object with a predict_tokens
method used specifically in token classificationHF_BaseModelCallback
can be used (or extended) instead of the model wrapper to ensure your inputs into the huggingface model is correct (recommended). See docs for examples (and thanks to fastai's Sylvain for the suggestion!)HF_Tokenizer
can work with strings or a string representation of a list (the later helpful for token classification tasks)show_batch
and show_results
methods have been updated to allow better control on how huggingface tokenized data is represented in those methodsA word of gratitude to the following individuals, repos, and articles upon which much of this work is inspired from: