--- title: data.summarization keywords: fastai sidebar: home_sidebar summary: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5." description: "This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5." nb_path: "nbs/01e_data-summarization.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
{% endraw %}

Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

{% raw %}
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)
1000
{% endraw %} {% raw %}
cnndm_df.head(2)
article highlights ds_type
0 (CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el... John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in "century of knowledge" train
1 (CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will "hopefully bring some order" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ... NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says . train
{% endraw %} {% raw %}
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)
('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class HF_SummarizationInput[source]

HF_SummarizationInput(x, **kwargs) :: HF_BaseInput

{% endraw %}

We create a subclass of HF_BatchTransform for summarization tasks to add decoder_input_ids and labels to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us. See here for more information on these additional inputs are used in summarization and conversational training tasks.

Note also that labels is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) decoder_input_ids.

And lastly, we also update our targets to just be the input_ids of our target sequence so that fastai's Learner.show_results works (again, almost all the fastai bits require returning a single tensor to work).

{% raw %}
{% endraw %} {% raw %}

class HF_SummarizationBatchTransform[source]

HF_SummarizationBatchTransform(hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, is_split_into_words=False, n_tok_inps=2, hf_input_return_type=HF_SummarizationInput, tok_kwargs={}, **kwargs) :: HF_BatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

{% endraw %}

We had to override the decodes method above because, while both our inputs and targets are technically the same things, we update the later to consist of only the target input_ids so that methods like Learner.show_results work.

{% raw %}
blocks = (HF_TextBlock(hf_batch_tfm=HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader('article'), get_y=ColReader('highlights'), splitter=RandomSplitter())
{% endraw %}

Two lines! Notice we pass in noop for our targets (e.g. our summaries) because the batch transform will take care of both out inputs and targets.

{% raw %}
 
{% endraw %} {% raw %}
dls = dblock.dataloaders(cnndm_df, bs=4)
{% endraw %} {% raw %}
b = dls.one_batch()
{% endraw %} {% raw %}
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([4, 1024]), torch.Size([4, 69]))
{% endraw %} {% raw %}
{% endraw %} {% raw %}
dls.show_batch(dataloaders=dls, max_n=2)
text target
0 (CNN) -- Home to up to 10 percent of all known species, Mexico is recognized as one of the most biodiverse regions on the planet. The twin threats of climate change and human encroachment on natural environments are, however, threatening the existence of the country's rich wildlife. And there is a great deal to lose. In the United Nations Environment Program (UNEP) World Conservation Monitoring Centre's list of megadiverse countries Mexico ranks 11th. The list represents a group of 17 countries that harbor the majority of the Earth's species and are therefore considered extremely biodiverse. From its coral reefs in the Caribbean Sea to its tropical jungles in Chiapas and the Yucatan peninsula and its deserts and prairies in the north, Mexico boasts an incredibly rich variety of flora and fauna. Some 574 out of 717 reptile species found in Mexico -- the most in any country -- can only be encountered within its borders. It is home to 502 types of mammals, 290 species of birds, 1,150 varieties of birds and 26,000 classifications of plants. Pronatura, a non-profit organization that works to promote conservation and sustainable development in Mexico, has selected six species which it says symbolize the problems faced by the destruction of nature. "These are only some of the species which have some degree of conservation," says Eduardo Cota Corona, Director of Conservation at Pronatura. "However, there is a countless number of species in Mexico which find themselves in danger of extinction." Golden Eagle. It is the country's national symbol yet the Golden Eagle is close to extinction in Mexico. One of the largest raptors or birds of prey in the world, the Golden Eagle's wingspan can reach lengths greater than two metres. Only the Bald Eagle and the California Greater exceed it in size in North America. With its powerful hooked bill and long and sharp claws it can sometimes capture prey of a size that is surprising for its size, including crane, wild ungulates and domestic livestock, though more often than not it tends to feed off small mammals such as rabbits, hares, ground squirrels and prairie dogs as well as reptiles and small-to-medium sized birds. Primarily a solitary bird, the Golden Eagle pairs up to breed, building nests made of dry branches in cliffs and escarpments. The female typically lays two eggs which are incubated by both the male and female. Usually, only one of the hatchlings survives. The Golden Eagle can be found in Asia and Europe and mainly in the western part of North America. It was common in Mexico but in recent years has become a rare sight. Its demise has been attributed to the destruction of its habitat and the elimination of its natural prey. Human activity, in the form of hunting, capturing and commercial sale have also contributed to its decline. Pronatura has lobbied for legal protection of this bird that forms part of Mexico's flag and has launched conservation projects in its natural habitat, such as in the Cumbres de Monterrey National Park and the Cuatro CiƩnegas Biosphere Reserve. Gray Whale. Pachico Mayoral, a Mexican fisherman form Baja California, claims to be the first person to have a friendly encounter with a gray whale. Up until then this enormous cetacean -- an adult can reach a length of 16 meters and weigh in at 36 tons - had been known as the devil fish for its aggressive behavior when hunted. The main group of gray whales is found in the northeastern Pacific. Each year a herd of 25,000 whales sets out on what is believed to be the longest migration in the animal kingdom - 12,500 miles - between their feeding grounds in the Bering and Chukchi seas off Alaska and their breeding territory in the warmer waters of the lagoons of Baja California. Over its lifetime, it is estimated that an Eastern Pacific gray whale will travel the equivalent of a return trip to the moon. A smaller herd of about 300 gray whales can be found in the Western Pacific between Korea and the Kamchatka peninsula in Russia. Excessive hunting in the 19th century pushed the gray whale to the brink of extinction but protection mandated by the International Whaling Commission in 1946 and the declaration by the Mexican government of Laguna San Ignacio in 1972 as a Gray Whale refuge means that it is one of the few success stories. Pronatura and the Aztec foundation have raised nearly $4 million with which they hope to guarantee the protection of 20,000 hectares of the gray whale's habitat in Baja California and ensure its survival in the years to come. Jaguar. It may be top of the food chain but this doesn't guarantee the survival of the jaguar in Mexico. The largest cat in the Western Hemisphere (it's nearest rival is the puma), the jaguar can be found anywhere from the southern United States to as far south as northern Argentina. In Mexico, it can be found mainly in the tropical forests of Chiapas and the Yucatan Mexico hosts to up to 10 percent of all known species on Earth.\nIt is home to 502 types of mammals, 290 bird species and 26,000 types of plants.\nHuman development and climate change is placing a big strain on its biodiversity.\nThe Golden Eagle is under threat in spite of being the country's national symbol.
1 I have an uncle who has always been a robust and healthy guy. He drank a glass of skim milk every day, bragged about how many pull-ups he was doing and fit into pants he was wearing 20 years before. He didn't take a single medication and retired early. Given that he had no medical problems and ran his own business, he opted to go several years without health insurance. Eventually, when he turned 65, he picked up Medicare. What happened next was a little strange. He fell off the wagon. He exercised only sporadically, and paid hardly any attention to what he was eating. One day, I saw him eat an entire bag of potato chips. He bemoaned the fact that he was forced to buy new, bigger pants, and he stopped drinking his milk. For him, becoming newly insured had nearly the opposite effect on him of what we doctors hope to achieve. He'd become unhealthier. In many ways, my uncle was demonstrating a concept known as the moral hazard. Two economists wrote about this exact scenario in 2006. They found that many men, at the time they obtained Medicare, started behaving badly. Moral, or morale, hazard is a term largely used by economists to describe the actions of people more willing to take risks because they are insulated from the cost of their actions, in this case because of their recently obtained health insurance. In the case of these men, when they got Medicare, they took worse care of themselves; they actually exercised less. Among those who didn't visit the doctor after getting insurance, the effect was dramatic: Their overall physical activity dropped by 40%; they were 16% more likely to smoke cigarettes and 32% more likely to drink alcohol. Even if that seems extreme, it's still worth asking: Does health insurance make us healthier? The past five years have seen a tumultuous battle over Obamacare, or the Affordable Care Act, culminating in the bitter recriminations this fall over lost policies and the disastrous launch of the HealthCare.gov website. When I interviewed Health and Human Services Secretary Kathleen Sebelius at the end of October, she downplayed the concerns and seemed certain the site would be up and running by the end of November. The website may be working better now, but to me that's not the most important issue. In my mind, the real suspense comes from whether Obamacare will really make us a healthier America, even if it succeeds in its ambitions to dramatically expand coverage. A healthier America: That is the goal we should share as Americans, but access alone won't get us anywhere close. This past spring, the New England Journal of Medicine followed up on an important experiment in Oregon. The state created a remarkable strategy to do a minimal expansion of Medicaid. It decided to conduct random lottery drawings to allocate the limited spots. While it was controversial in its implementation, the situation was a goldmine for researchers. It offered something very rare in these types of studies: a unique opportunity for researchers to compare the newly insured to their highly similar counterparts, who remained uninsured. The results were surprising, and mostly disappointing. The newly insured Medicaid population did go to the doctor more often, used more preventive health services and received more medications. Problem was, in nearly every area, they weren't any healthier. The scientists sat down with more than 12,000 people and compared some of the most important health indicators. They found having insurance did not improve measures of blood pressure, cholesterol or how well diabetics controlled their blood sugar. Furthermore, the 10-year risk of having a heart attack didn't change in those who had Medicaid. It wasn't at all what the proponents of universal access to health insurance hoped they would see. The results remind me of a column I wrote a few years ago, shortly after my own marriage. It seemed like a good time to explore the question of whether marriage was in fact good for one's health. I spent a fair amount of time researching the topic and one of the experts I interviewed gave me an answer I have never forgotten: Marriage is good for your health (long pause)... as long as it's a good marriage. It was a terrific answer, and a metaphor for so many aspects of our lives. As you might imagine, I had quite a bit of fun with that article on marriage, but it taught an important lesson. There is almost always a second beat to any story. Being married all by itself isn't necessarily good or bad for your health. It was the effort required to make it a good or bad marriage that made up the entire difference. The same can be said about health insurance. In this case, I don't mean that "good" or "bad" insurance is the critical factor, but that health insurance alone doesn't lead to better health. None of this works unless we all take personal responsibility, and hold ourselves accountable. To be clear, there will always be some baseline benefit to being insured versus not being insured, even if you account for the moral hazard. A major Institute of Medicine report in 2009 found that uninsured adults are Sanjay Gupta: Moral hazard causes some to neglect health when they get health insurance.\nHe says Obamacare alone won't guarantee good health; personal habits must do that.\nHe says research shows 30 minutes of daily exercise cuts heart attack, stroke risk by a third.\nGupta: It's time to stop playing defense on your health; instead, start optimizing it yourself.
{% endraw %}

Tests

The tests below to ensure the core DataBlock code above works for all pretrained summarization models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained summarization models you are working with ... and if any of your pretrained summarization models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

{% raw %}
BLURR_MODEL_HELPER.get_models(task='ConditionalGeneration')
[transformers.modeling_bart.BartForConditionalGeneration,
 transformers.modeling_blenderbot.BlenderbotForConditionalGeneration,
 transformers.modeling_fsmt.FSMTForConditionalGeneration,
 transformers.modeling_mbart.MBartForConditionalGeneration,
 transformers.modeling_pegasus.PegasusForConditionalGeneration,
 transformers.modeling_prophetnet.ProphetNetForConditionalGeneration,
 transformers.modeling_t5.T5ForConditionalGeneration,
 transformers.modeling_xlm_prophetnet.XLMProphetNetForConditionalGeneration]
{% endraw %} {% raw %}
pretrained_model_names = [
    ('facebook/bart-base',BartForConditionalGeneration),
    ('t5-small', T5ForConditionalGeneration),
    ('google/pegasus-cnn_dailymail', PegasusForConditionalGeneration)
]
{% endraw %} {% raw %}
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv')
{% endraw %} {% raw %}
#hide_output
task = HF_TASKS_ALL.ConditionalGeneration
bsz = 2
seq_sz = 256
trg_seq_sz = 40

test_results = []
for model_name, model_cls in pretrained_model_names:
    error=None
    
    print(f'=== {model_name} ===\n')
    
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, 
                                                                                   task=task, 
                                                                                   model_cls=model_cls)
    print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
    
    hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer, 
                                                  padding='max_length', max_length=[seq_sz, trg_seq_sz])

    blocks = ( 
        HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm), 
        noop
    )

    def add_t5_prefix(inp): return f'summarize: {inp}' if (hf_arch == 't5') else inp

    dblock = DataBlock(blocks=blocks, 
                   get_x=Pipeline([ColReader('article'), add_t5_prefix]), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

    dls = dblock.dataloaders(cnndm_df, bs=bsz) 
    b = dls.one_batch()
    
    try:
        print('*** TESTING DataLoaders ***\n')
        test_eq(len(b), 2)
        test_eq(len(b[0]['input_ids']), bsz)
        test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
        test_eq(len(b[1]), bsz)
        test_eq(b[1].shape, torch.Size([bsz, trg_seq_sz]))

        if (hasattr(hf_tokenizer, 'add_prefix_space')):
            test_eq(dls.before_batch[0].tok_kwargs['add_prefix_space'], True)
            
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
        dls.show_batch(dataloaders=dls, max_n=2)
        
    except Exception as err:
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
{% endraw %} {% raw %}
arch tokenizer model_name result error
0 bart BartTokenizer facebook/bart-base PASSED
1 t5 T5Tokenizer t5-small PASSED
2 pegasus PegasusTokenizer google/pegasus-cnn_dailymail PASSED
{% endraw %}

Cleanup