--- title: text.utils keywords: fastai sidebar: home_sidebar summary: "Various text specific utility classes/functions" description: "Various text specific utility classes/functions" nb_path: "nbs/01_text-utils.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
What we're running with at the time this documentation was generated:
torch: 1.10.1+cu111
fastai: 2.5.6
transformers: 4.16.2
{% endraw %} {% raw %}
{% endraw %} {% raw %}

Singleton object at 0x7f52e4798910>[source]

Singleton object at 0x7f52e4798910>(*args, **kwargs)

{% endraw %}

BlurrText is a Singleton (there exists only one instance, and the same instance is returned upon subsequent instantiation requests). You can get at via the NLP constant below.

{% raw %}
mh = BlurrText()
mh2 = BlurrText()
test_eq(mh, mh2)
{% endraw %}

Provide global helper constant

Users of this library can simply use NLP to access all the BlurrText capabilities without having to fetch an instance themselves.

{% raw %}
{% endraw %}

Here's how you can get at the core Hugging Face objects you need to work with ...

... the task

{% raw %}

BlurrText.get_tasks[source]

BlurrText.get_tasks(arch:str=None)

This method can be used to get a list of all tasks supported by your transformers install, or just those available to a specific architecture

{% endraw %} {% raw %}
print(NLP.get_tasks())
print("")
print(NLP.get_tasks("bart"))
['AudioFrameClassification', 'CTC', 'CausalImageModeling', 'CausalLM', 'Classification', 'ConditionalGeneration', 'EntityClassification', 'EntityPairClassification', 'EntitySpanClassification', 'Generation', 'ImageAndTextRetrieval', 'ImageClassification', 'ImageClassificationConvProcessing', 'ImageClassificationFourier', 'ImageClassificationLearned', 'ImagesAndTextClassification', 'LMHead', 'LMHeadModel', 'MaskedImageModeling', 'MaskedLM', 'MultimodalAutoencoding', 'MultipleChoice', 'NextSentencePrediction', 'OpenQA', 'OpticalFlow', 'PreTraining', 'QuestionAnswering', 'QuestionAnsweringSimple', 'RegionToPhraseAlignment', 'SemanticSegmentation', 'SequenceClassification', 'Teacher', 'TokenClassification', 'VisualReasoning', 'XVector', 'merLayer', 'merModel', 'merPreTrainedModel']

['CausalLM', 'ConditionalGeneration', 'QuestionAnswering', 'SequenceClassification']
{% endraw %}

... the architecture

{% raw %}

BlurrText.get_architectures[source]

BlurrText.get_architectures()

{% endraw %} {% raw %}
print(NLP.get_architectures())
['albert', 'bart', 'barthez', 'bartpho', 'beit', 'bert', 'bert_generation', 'bert_japanese', 'bertweet', 'big_bird', 'bigbird_pegasus', 'blenderbot', 'blenderbot_small', 'byt5', 'camembert', 'canine', 'clip', 'convbert', 'cpm', 'ctrl', 'deberta', 'deberta_v2', 'deit', 'detr', 'distilbert', 'dpr', 'electra', 'encoder_decoder', 'flaubert', 'fnet', 'fsmt', 'funnel', 'gpt2', 'gpt_neo', 'gptj', 'herbert', 'hubert', 'ibert', 'imagegpt', 'layoutlm', 'layoutlmv2', 'layoutxlm', 'led', 'longformer', 'luke', 'lxmert', 'm2m_100', 'marian', 'mbart', 'mbart50', 'megatron_bert', 'mluke', 'mmbt', 'mobilebert', 'mpnet', 'mt5', 'nystromformer', 'openai', 'pegasus', 'perceiver', 'phobert', 'prophetnet', 'qdqbert', 'rag', 'realm', 'reformer', 'rembert', 'retribert', 'roberta', 'roformer', 'segformer', 'sew', 'sew_d', 'speech_encoder_decoder', 'speech_to_text', 'speech_to_text_2', 'splinter', 'squeezebert', 'swin', 't5', 'tapas', 'transfo_xl', 'trocr', 'unispeech', 'unispeech_sat', 'vilt', 'vision_encoder_decoder', 'vision_text_dual_encoder', 'visual_bert', 'vit', 'vit_mae', 'wav2vec2', 'wav2vec2_phoneme', 'wavlm', 'xlm', 'xlm_prophetnet', 'xlm_roberta', 'xlnet', 'yoso']
{% endraw %} {% raw %}

BlurrText.get_model_architecture[source]

BlurrText.get_model_architecture(model_name_or_enum)

Get the architecture for a given model name / enum

{% endraw %} {% raw %}
print(NLP.get_model_architecture("RobertaForSequenceClassification"))
roberta
{% endraw %}

... and lastly the models (optionally for a given task and/or architecture)

{% raw %}

BlurrText.get_models[source]

BlurrText.get_models(arch:str=None, task:str=None)

The transformer models available for use (optional: by architecture | task)

{% endraw %} {% raw %}
print(L(NLP.get_models())[:5])
['AdaptiveEmbedding', 'AlbertForMaskedLM', 'AlbertForMultipleChoice', 'AlbertForPreTraining', 'AlbertForQuestionAnswering']
{% endraw %} {% raw %}
print(NLP.get_models(arch="bert")[:5])
['BertForMaskedLM', 'BertForMultipleChoice', 'BertForNextSentencePrediction', 'BertForPreTraining', 'BertForQuestionAnswering']
{% endraw %} {% raw %}
print(NLP.get_models(task="TokenClassification")[:5])
['AlbertForTokenClassification', 'BertForTokenClassification', 'BigBirdForTokenClassification', 'CamembertForTokenClassification', 'CanineForTokenClassification']
{% endraw %} {% raw %}
print(NLP.get_models(arch="bert", task="TokenClassification"))
['BertForTokenClassification']
{% endraw %}

Here we define some helpful enums to make it easier to get at the task and architecture you're looking for.

{% raw %}
{% endraw %} {% raw %}
print("--- all tasks ---")
print(L(HF_TASKS))
--- all tasks ---
[<HF_TASKS_ALL.AudioFrameClassification: 1>, <HF_TASKS_ALL.CTC: 2>, <HF_TASKS_ALL.CausalImageModeling: 3>, <HF_TASKS_ALL.CausalLM: 4>, <HF_TASKS_ALL.Classification: 5>, <HF_TASKS_ALL.ConditionalGeneration: 6>, <HF_TASKS_ALL.EntityClassification: 7>, <HF_TASKS_ALL.EntityPairClassification: 8>, <HF_TASKS_ALL.EntitySpanClassification: 9>, <HF_TASKS_ALL.Generation: 10>, <HF_TASKS_ALL.ImageAndTextRetrieval: 11>, <HF_TASKS_ALL.ImageClassification: 12>, <HF_TASKS_ALL.ImageClassificationConvProcessing: 13>, <HF_TASKS_ALL.ImageClassificationFourier: 14>, <HF_TASKS_ALL.ImageClassificationLearned: 15>, <HF_TASKS_ALL.ImagesAndTextClassification: 16>, <HF_TASKS_ALL.LMHead: 17>, <HF_TASKS_ALL.LMHeadModel: 18>, <HF_TASKS_ALL.MaskedImageModeling: 19>, <HF_TASKS_ALL.MaskedLM: 20>, <HF_TASKS_ALL.MultimodalAutoencoding: 21>, <HF_TASKS_ALL.MultipleChoice: 22>, <HF_TASKS_ALL.NextSentencePrediction: 23>, <HF_TASKS_ALL.OpenQA: 24>, <HF_TASKS_ALL.OpticalFlow: 25>, <HF_TASKS_ALL.PreTraining: 26>, <HF_TASKS_ALL.QuestionAnswering: 27>, <HF_TASKS_ALL.QuestionAnsweringSimple: 28>, <HF_TASKS_ALL.RegionToPhraseAlignment: 29>, <HF_TASKS_ALL.SemanticSegmentation: 30>, <HF_TASKS_ALL.SequenceClassification: 31>, <HF_TASKS_ALL.Teacher: 32>, <HF_TASKS_ALL.TokenClassification: 33>, <HF_TASKS_ALL.VisualReasoning: 34>, <HF_TASKS_ALL.XVector: 35>, <HF_TASKS_ALL.merLayer: 36>, <HF_TASKS_ALL.merModel: 37>, <HF_TASKS_ALL.merPreTrainedModel: 38>]
{% endraw %} {% raw %}
HF_TASKS.Classification
<HF_TASKS_ALL.Classification: 5>
{% endraw %} {% raw %}
{% endraw %} {% raw %}
print(L(HF_ARCHITECTURES)[:5])
[<HF_ARCHITECTURES.albert: 1>, <HF_ARCHITECTURES.bart: 2>, <HF_ARCHITECTURES.barthez: 3>, <HF_ARCHITECTURES.bartpho: 4>, <HF_ARCHITECTURES.beit: 5>]
{% endraw %}

To get all your Hugging Face objects (arch, config, tokenizer, and model)

{% raw %}

BlurrText.get_hf_objects[source]

BlurrText.get_hf_objects(pretrained_model_name_or_path:Union[str, PathLike, NoneType], model_cls:PreTrainedModel, config:Union[PretrainedConfig, str, PathLike]=None, tokenizer_cls:PreTrainedTokenizerBase=None, config_kwargs:dict={}, tokenizer_kwargs:dict={}, model_kwargs:dict={}, cache_dir:Union[str, PathLike]=None)

Given at minimum a pretrained_model_name_or_path and model_cls (such asAutoModelForSequenceClassification"), this method returns all the Hugging Face objects you need to train a model using Blurr

{% endraw %}

How to use:

{% raw %}
logging.set_verbosity_error()
{% endraw %} {% raw %}
from transformers import AutoModelForMaskedLM

arch, config, tokenizer, model = NLP.get_hf_objects("bert-base-cased-finetuned-mrpc", model_cls=AutoModelForMaskedLM)

print(arch)
print(type(config))
print(type(tokenizer))
print(type(model))
bert
<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
<class 'transformers.models.bert.modeling_bert.BertForMaskedLM'>
{% endraw %} {% raw %}
from transformers import AutoModelForQuestionAnswering

arch, tokenizer, config, model = NLP.get_hf_objects("fmikaelian/flaubert-base-uncased-squad", model_cls=AutoModelForQuestionAnswering)

print(arch)
print(type(config))
print(type(tokenizer))
print(type(model))
flaubert
<class 'transformers.models.flaubert.tokenization_flaubert.FlaubertTokenizer'>
<class 'transformers.models.flaubert.configuration_flaubert.FlaubertConfig'>
<class 'transformers.models.flaubert.modeling_flaubert.FlaubertForQuestionAnsweringSimple'>
{% endraw %} {% raw %}
from transformers import BertTokenizer, BertForNextSentencePrediction

arch, tokenizer, config, model = NLP.get_hf_objects(
    "bert-base-cased-finetuned-mrpc", config=None, tokenizer_cls=BertTokenizer, model_cls=BertForNextSentencePrediction
)
print(arch)
print(type(config))
print(type(tokenizer))
print(type(model))
bert
<class 'transformers.models.bert.tokenization_bert.BertTokenizer'>
<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.bert.modeling_bert.BertForNextSentencePrediction'>
{% endraw %}