Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization

A fine-tuned version of allenai/led-large-16384 on the BookSum dataset.

Goal: a model that can generalize well and is useful in summarizing long text in academic and daily usage. The result works well on lots of text and can handle 16384 tokens/batch (if you have the GPU memory to handle that)

See the Colab demo linked above or try the demo on Spaces

Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.

Usage - Basic

use encoder_no_repeat_ngram_size=3 when calling the pipeline object to improve summary quality.
- this forces the model to use new vocabulary and create an abstractive summary, otherwise it may compile the best extractive summary from the input provided.

Load the model into a pipeline object:

import torch
from transformers import pipeline

hf_name = 'pszemraj/led-large-book-summary'

summarizer = pipeline(
    "summarization",
    hf_name,
    device=0 if torch.cuda.is_available() else -1,
)

put words into the pipeline object:

wall_of_text = "your words here"

result = summarizer(
    wall_of_text,
    min_length=16,
    max_length=256,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    early_stopping=True,
)

Note: The global attention mask needs to be used when decoding to generate the best-quality summaries:

🤗 20-lines of code to reproduce SOTA on Arxiv with Longformer Encoder-Decoder (LED) 🤗
https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing)
See the generate_batch function for more details; note the beam search as well; we've pasted the function below for easy reference

import torch

def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")

  global_attention_mask = torch.zeros_like(attention_mask)
  # put global attention on <s> token
  global_attention_mask[:, 0] = 1

  predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512, num_beams=4)
  batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
  return batch

Training and evaluation data

You'll want to train on the BookSum dataset
During training, the input text was the text of the chapter, and the output was summary_text
Eval results can be found here with metrics on the sidebar.

Training procedure

Training completed on the BookSum dataset for 13 total epochs
The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.

Training hyperparameters

Initial Three Epochs

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

In-between Epochs

Unfortunately, don't have all records on-hand for middle epochs; the following should be representative:

learning_rate: 4e-05
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.05
num_epochs: 6 (in addition to prior model)

Final Two Epochs

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.03
num_epochs: 2 (in addition to prior model)

Framework versions

Transformers 4.19.2
Pytorch 1.11.0+cu113
Datasets 2.2.2
Tokenizers 0.12.1

andreaparker
/

long-summ