--- title: Causal Language Modeling with GPT-2 keywords: fastai sidebar: home_sidebar summary: "This notebook demonstrates how we can use Blurr to train, or fine-tune, a causal language model against examples defined in individual files (similar to how the raw wiki-103 data comes). We demonstrate how to use `get_text_files` and create a custom `splitter` function to build our train and validation datasets." description: "This notebook demonstrates how we can use Blurr to train, or fine-tune, a causal language model against examples defined in individual files (similar to how the raw wiki-103 data comes). We demonstrate how to use `get_text_files` and create a custom `splitter` function to build our train and validation datasets." nb_path: "nbs/99e_text-examples-causal-lm-gpt2.ipynb" ---
raw_data_path = Path('./data/task-language-modeling/pt-2/')
raw_data_path.ls()
(raw_data_path/'train').ls()
len((raw_data_path/'train').ls()), len((raw_data_path/'valid').ls())
model_cls = AutoModelForCausalLM
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
if (hf_tokenizer.pad_token is None):
hf_tokenizer.add_special_tokens({'pad_token': '<pad>'})
hf_config.pad_token_id = hf_tokenizer.get_vocab()['<pad>']
hf_model.resize_token_embeddings(len(hf_tokenizer))
get_wiki_files = partial(get_text_files, folders=['train', 'valid'])
fnames = get_wiki_files(raw_data_path)
fnames[0]
splitter = FuncSplitter(lambda fpath: Path(fpath).parent.name == 'valid')
splitter(fnames)
bbtfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = (TextBlock(before_batch_tfm=bbtfm, input_return_type=CausalLMTextInput), noop)
# our DataBlock
dblock = DataBlock(
blocks=blocks,
get_x=lambda x: x.read_text(), # read each text file
get_items=get_wiki_files, # grab the text files
splitter=splitter # split on parent folder name (validation = 'valid')
)
dls = dblock.dataloaders(raw_data_path, bs=2, val_bs=4)
b = dls.one_batch()
b[0]['input_ids'].shape, b[1].shape
dls.show_batch(dataloaders=dls, trunc_at=500, max_n=2)
model = BaseModelWrapper(hf_model)
fit_cbs = [LMMetricsCallback()]
learn = Learner(dls,
model,
opt_func=partial(Adam),
loss_func=PreCalculatedLoss(),
cbs=[BaseModelCallback],
metrics=[perplexity],
splitter=blurr_splitter).to_fp16()
# learn.freeze()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
learn.fit_one_cycle(1, lr_max=3e-3, cbs=fit_cbs)
learn.show_results(learner=learn, max_n=2, trunc_at=500)
learn.blurr_generate('Itália ( ), oficialmente República Italiana', max_length=100, do_sample=True, top_k=25)
This example demonstrates how to train a causal language model where the raw data examples are in individual files (similar to how the standard wikitext-103 is defined). We also defined a custom splitter
function so as to put all the files under /valid
as part of the validation set and all the files under /train
in the training set.