--- title: text.modeling.core keywords: fastai sidebar: home_sidebar summary: "This module contains core custom models, loss functions, and a default layer group splitter for use in applying discriminiative learning rates to your Hugging Face models trained via fastai" description: "This module contains core custom models, loss functions, and a default layer group splitter for use in applying discriminiative learning rates to your Hugging Face models trained via fastai" nb_path: "nbs/11_text-modeling-core.ipynb" ---
Note that BaseModelWrapper
includes some nifty code for just passing in the things your model needs, as not all transformer architectures require/use the same information.
We use a Callback
for handling the ModelOutput
returned by Hugging Face transformers. It allows us to associate anything we want from that object to our Learner
.
Note that your Learner
's loss will be set for you only if the Hugging Face model returns one and you are using the PreCalculatedLoss
loss function.
Also note that anything else you asked the model to return (for example, last hidden state, etc..) will be available for you via the blurr_model_outputs
property attached to your Learner
. For example, assuming you are using BERT for a classification task ... if you have told your BaseModelWrapper
instance to return attentions, you'd be able to access them via learn.blurr_model_outputs['attentions']
.
raw_datasets = load_dataset("imdb", split=["train", "test"])
raw_datasets[0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))
final_ds = concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])
imdb_df = pd.DataFrame(final_ds)
imdb_df.head()
labels = raw_datasets[0].features["label"].names
labels
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
set_seed()
blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=RandomSplitter(seed=42))
dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
set_seed()
model = BaseModelWrapper(hf_model)
learn = Learner(
dls,
model,
opt_func=partial(OptimWrapper, opt=torch.optim.Adam),
loss_func=PreCalculatedCrossEntropyLoss(), # CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[BaseModelCallback],
splitter=blurr_splitter,
)
learn.freeze()
learn.summary()
print(len(learn.opt.param_groups))
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
set_seed()
learn.fit_one_cycle(1, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.unfreeze()
set_seed()
learn.fit_one_cycle(2, lr_max=slice(1e-7, 1e-4))
learn.recorder.plot_loss()
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.blurr_predict("I really liked the movie")
learn.blurr_predict("Acting was so bad it was almost funny.")
learn.blurr_predict(["I really liked the movie", "I really hated the movie"])
Though not useful in sequence classification, we will also add a blurr_generate
method to Learner
that uses Hugging Face's PreTrainedModel.generate
for text generation tasks.
For the full list of arguments you can pass in see here. You can also check out their "How To Generate" notebook for more information about how it all works.
export_fname = "seq_class_learn_export"
learn.export(fname=f"{export_fname}.pkl")
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_predict("This movie should not be seen by anyone!!!!")
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
dls = dblock.dataloaders(imdb_df, bs=4)
Instead of constructing our low-level Learner
, we can use the Blearner
class which provides sensible defaults for training
learn = Blearner(dls, hf_model, metrics=[accuracy])
learn.fit_one_cycle(1, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.blurr_predict("This was a really good movie")
learn.export(fname=f"{export_fname}.pkl")
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_predict("This movie should not be seen by anyone!!!!")
We also introduce a classification task specific Blearner
that get you your DataBlock, DataLoaders, and BLearner in one line of code!
learn = BlearnerForSequenceClassification.from_data(
imdb_df, "distilroberta-base", text_attr="text", label_attr="label", dl_kwargs={"bs": 4}
)
learn.fit_one_cycle(1, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.predict("This was a really good movie")
learn.export(fname=f"{export_fname}.pkl")
inf_learn = load_learner(fname=f"{export_fname}.pkl")
inf_learn.blurr_predict("This movie should not be seen by anyone!!!!")
Thanks to the TextDataLoader
, there isn't really anything you have to do to use plain ol' PyTorch or fast.ai Dataset
s and DataLoaders
with Blurr. Let's take a look at fine-tuning a model against Glue's MRPC dataset ...
model_cls = AutoModelForSequenceClassification
pretrained_model_name = "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
from datasets import load_dataset
from blurr.text.data.core import preproc_hf_dataset
raw_datasets = load_dataset("glue", "mrpc")
def tokenize_function(example):
return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
label_names = raw_datasets["train"].features["label"].names
trn_dl = TextDataLoader(
tokenized_datasets["train"],
hf_arch=hf_arch,
hf_config=hf_config,
hf_tokenizer=hf_tokenizer,
hf_model=hf_model,
preproccesing_func=preproc_hf_dataset,
batch_decode_kwargs={"labels": label_names},
shuffle=True,
batch_size=8,
)
val_dl = TextDataLoader(
tokenized_datasets["validation"],
hf_arch=hf_arch,
hf_config=hf_config,
hf_tokenizer=hf_tokenizer,
hf_model=hf_model,
preproccesing_func=preproc_hf_dataset,
batch_decode_kwargs={"labels": label_names},
batch_size=16,
)
dls = DataLoaders(trn_dl, val_dl)
Blearner
learn = BlearnerForSequenceClassification(dls, hf_model, loss_func=PreCalculatedCrossEntropyLoss())
learn.lr_find()
learn.fit_one_cycle(1, lr_max=1e-3)
learn.unfreeze()
learn.fit_one_cycle(2, lr_max=slice(1e-8, 1e-6))
learn.show_results(learner=learn, max_n=2, trunc_at=500)
The tests below to ensure the core training code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)