--- title: text.modeling.language_modeling keywords: fastai sidebar: home_sidebar summary: "This module contains custom models, custom splitters, etc... for both causal and MLM language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus." description: "This module contains custom models, custom splitters, etc... for both causal and MLM language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus." nb_path: "nbs/12_text-modeling-language-modeling.ipynb" ---
wiki_path = untar_data(URLs.WIKITEXT_TINY)
train_df = pd.read_csv(wiki_path / "train.csv", header=None)
valid_df = pd.read_csv(wiki_path / "test.csv", header=None)
train_df["is_valid"] = False
valid_df["is_valid"] = True
df = pd.concat([train_df, valid_df])
print(len(df))
df.head()
In this section, we'll add helpful metrics for calculating accuracy and perplexity for both causal and masked language modeling tasks.
model_cls = AutoModelForCausalLM
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
bbtfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=bbtfm, input_return_type=CausalLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
dls = dblock.dataloaders(proc_df, bs=2)
b = dls.one_batch()
b[0]["input_ids"].shape, b[0]["labels"].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
model = BaseModelWrapper(hf_model)
fit_cbs = [LMMetricsCallback()]
learn = Learner(
dls,
model,
opt_func=partial(Adam),
loss_func=PreCalculatedCrossEntropyLoss(),
cbs=[BaseModelCallback],
metrics=[perplexity],
splitter=blurr_splitter,
).to_fp16()
learn.freeze()
learn.summary()
# preds = learn.model(b[0])
# len(preds),preds[0], preds[1].shape
print(len(learn.opt.param_groups))
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=3e-3, cbs=fit_cbs)
learn.show_results(learner=learn, trunc_at=250)
learn.blurr_generate("Blurr is fun to work with because", max_length=50, do_sample=True, top_k=25)
In masked language modeling (MLM), we are attempting to predict the masked tokens. In Blurr, these are encapsulated by classes implementing the BaseLMStrategy
base class.
For a list of some of the more common strategies, see table 3 of the Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer paper. When fine-tuning a MLM model. you'll want to make sure you use the same approach as the model authors should you be looking to reproduce their results ... but our approach here makes it easy to play with different strategies regardless.
In the example below, we'll tell Blurr we want to use the BERT-style masking strategy.
model_cls = AutoModelForMaskedLM
pretrained_model_name = "distilroberta-base"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
bbtfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=BertMLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=bbtfm, input_return_type=MLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
dls = dblock.dataloaders(proc_df, bs=2)
b = dls.one_batch()
b[0]["input_ids"].shape, b[0]["labels"].shape, b[1].shape
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=250)
model = BaseModelWrapper(hf_model)
fit_cbs = [LMMetricsCallback()]
learn = Learner(
dls,
model,
opt_func=partial(Adam, decouple_wd=True),
loss_func=PreCalculatedCrossEntropyLoss(),
cbs=[BaseModelCallback],
metrics=[perplexity],
splitter=blurr_splitter,
).to_fp16()
learn.freeze()
learn.summary()
print(len(learn.opt.param_groups))
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=1e-4, cbs=fit_cbs)
learn.show_results(learner=learn, trunc_at=250)
While Learner.blurr_generate
will work well for causal LMs designed for text generation, it won't for MLM models designed to predict masked tokens. To accomodate the later, we add Learner.blurr_fill_mask
...
learn.blurr_fill_mask(f"The best place on earth is {hf_tokenizer.mask_token}.", n_preds=5)
We can use the BlearnerForLM
for either Causal or Masked language models. With one line of code, we get our DataBlock, DataLoaders, and Blearner with sensible defaults and ready for training
learn = BlearnerForLM.from_data(df, "gpt2", text_attr=0, dl_kwargs={"bs": 2}).to_fp16()
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, trunc_at=500)
learn.fit_one_cycle(1, lr_max=3e-3, cbs=[BlearnerForLM.get_metrics_cb()])
learn.show_results(learner=learn, trunc_at=250)
learn.blurr_generate("Blurr is fun to work with because", max_length=50, do_sample=True, top_k=25)
learn = BlearnerForLM.from_data(df, "bert-base-cased", lm_strategy_cls=BertMLMStrategy, text_attr=0, dl_kwargs={"bs": 2}).to_fp16()
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, trunc_at=250)
learn.fit_one_cycle(1, lr_max=6e-4, cbs=[BlearnerForLM.get_metrics_cb()])
learn.show_results(learner=learn, trunc_at=250)
batch_tfm = first_blurr_tfm(learn.dls)
learn.blurr_fill_mask(f"The best place on earth is {batch_tfm.hf_tokenizer.mask_token}.", n_preds=5)