Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
A fine-tuned version of allenai/led-large-16384 on the BookSum
dataset.
Goal: a model that can generalize well and is useful in summarizing long text in academic and daily usage. The result works well on lots of text and can handle 16384 tokens/batch (if you have the GPU memory to handle that)
- See the Colab demo linked above or try the demo on Spaces
Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
Usage - Basic
- use
encoder_no_repeat_ngram_size=3
when calling the pipeline object to improve summary quality.- this forces the model to use new vocabulary and create an abstractive summary, otherwise it may compile the best extractive summary from the input provided.
Load the model into a pipeline object:
import torch
from transformers import pipeline
hf_name = 'pszemraj/led-large-book-summary'
summarizer = pipeline(
"summarization",
hf_name,
device=0 if torch.cuda.is_available() else -1,
)
- put words into the pipeline object:
wall_of_text = "your words here"
result = summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
encoder_no_repeat_ngram_size=3,
repetition_penalty=3.5,
num_beams=4,
early_stopping=True,
)
Note: The global attention mask needs to be used when decoding to generate the best-quality summaries:
🤗 20-lines of code to reproduce SOTA on Arxiv with Longformer Encoder-Decoder (LED) 🤗
- https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing)
- See the
generate_batch
function for more details; note the beam search as well; we've pasted the function below for easy reference
import torch
def generate_answer(batch):
inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
input_ids = inputs_dict.input_ids.to("cuda")
attention_mask = inputs_dict.attention_mask.to("cuda")
global_attention_mask = torch.zeros_like(attention_mask)
# put global attention on <s> token
global_attention_mask[:, 0] = 1
predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512, num_beams=4)
batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
return batch
Training and evaluation data
- You'll want to train on the BookSum dataset
- During training, the input text was the text of the
chapter
, and the output wassummary_text
- Eval results can be found here with metrics on the sidebar.
Training procedure
- Training completed on the BookSum dataset for 13 total epochs
- The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.
Training hyperparameters
Initial Three Epochs
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
In-between Epochs
Unfortunately, don't have all records on-hand for middle epochs; the following should be representative:
- learning_rate: 4e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 16
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 6 (in addition to prior model)
Final Two Epochs
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 16
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 2 (in addition to prior model)
Framework versions
- Transformers 4.19.2
- Pytorch 1.11.0+cu113
- Datasets 2.2.2
- Tokenizers 0.12.1
- Downloads last month
- 27
Dataset used to train andreaparker/long-summ
Evaluation results
- ROUGE-1 on kmfoda/booksumtest set self-reported31.731
- ROUGE-2 on kmfoda/booksumtest set self-reported5.331
- ROUGE-L on kmfoda/booksumtest set self-reported16.146
- ROUGE-LSUM on kmfoda/booksumtest set self-reported29.088
- loss on kmfoda/booksumtest set self-reported4.816
- gen_len on kmfoda/booksumtest set self-reported154.904
- ROUGE-1 on samsumtest set self-reported33.448
- ROUGE-2 on samsumtest set self-reported10.425
- ROUGE-L on samsumtest set self-reported24.580
- ROUGE-LSUM on samsumtest set self-reported29.823