--- title: text.data.language_modeling keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for causal and masked language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for causal and masked language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus." nb_path: "nbs/12_text-data-language-modeling.ipynb" ---
For this example, we'll use the WIKITEXT_TINY
dataset available from fastai. In addition to using the Datasets
library from Hugging Face, fastai provides a lot of smaller datasets that are really useful when experimenting and/or in the early development of your training/validation/inference coding.
wiki_path = untar_data(URLs.WIKITEXT_TINY)
wiki_path.ls()
train_df = pd.read_csv(wiki_path / "train.csv", header=None)
valid_df = pd.read_csv(wiki_path / "test.csv", header=None)
print(len(train_df), len(valid_df))
train_df.head()
train_df["is_valid"] = False
valid_df["is_valid"] = True
df = pd.concat([train_df, valid_df])
df.head()
model_cls = AutoModelForCausalLM
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
# num_added_toks = hf_tokenizer.add_special_tokens(special_tokens_dict)
# hf_model.resize_token_embeddings(len(hf_tokenizer))
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
Here we include a BaseLMStrategy
abstract class and several different strategies for building your inputs and targets for causal and masked language modeling tasks. With CLMs, the objective is to simply predict the next token, but with MLMs, a variety of masking strategies may be used (e.g., mask random tokens, mask random words, mask spans, etc...). A BertMLMStrategy
is introduced below that follows the "mask random tokens" strategy used in the BERT paper, but users can create their own BaseLMStrategy
subclass to support any masking strategy they desire.
Follows the masking strategy used in the BERT paper for random token masking
Again, we define a custom classes for the @typedispatch
ed methods to use so that we can override how both causal and masked language modeling inputs/targets are assembled, as well as, how the data is shown via methods like show_batch
and show_results
.
Our LMBatchTokenizeTransform
allows us to update the input's labels
and our targets appropriately given any language modeling task.
The labels
argument allows you to forgo calculating the loss yourself by letting Hugging Face return it for you should you choose to do that. Padding tokens are set to -100 by default (e.g., CrossEntropyLossFlat().ignore_index
) and prevent cross entropy loss from considering token prediction for tokens it should ... i.e., the padding tokens. For more information on the meaning of this argument, see the Hugging Face glossary entry for "Labels"
model_cls = AutoModelForCausalLM
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
batch_tok_tfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=CausalLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
dls = dblock.dataloaders(proc_df, bs=4)
b = dls.one_batch()
b[0]["input_ids"].shape, b[0]["labels"].shape, b[1].shape
explode_types(b)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
model_cls = AutoModelForMaskedLM
pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = NLP.get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
batch_tok_tfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=BertMLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=MLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
dls = dblock.dataloaders(proc_df, bs=4)
b = dls.one_batch()
b[0]["input_ids"].shape, b[0]["labels"].shape, b[1].shape
b[0]["input_ids"][0][:20], b[0]["labels"][0][:20], b[1][0][:20]
explode_types(b)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=250)