| language: id | |
| tags: | |
| - pipeline:summarization | |
| - summarization | |
| - bert2bert | |
| datasets: | |
| - id_liputan6 | |
| license: apache-2.0 | |
| # Indonesian BERT2BERT Summarization Model | |
| Finetuned BERT-base summarization model for Indonesian. | |
| ## Finetuning Corpus | |
| `bert2bert-indonesian-summarization` model is based on `cahya/bert-base-indonesian-1.5G` by [cahya](https://huggingface.co/cahya), finetuned using [id_liputan6](https://huggingface.co/datasets/id_liputan6) dataset. | |
| ## Load Finetuned Model | |
| ```python | |
| from transformers import BertTokenizer, EncoderDecoderModel | |
| tokenizer = BertTokenizer.from_pretrained("cahya/bert2bert-indonesian-summarization") | |
| tokenizer.bos_token = tokenizer.cls_token | |
| tokenizer.eos_token = tokenizer.sep_token | |
| model = EncoderDecoderModel.from_pretrained("cahya/bert2bert-indonesian-summarization") | |
| ``` | |
| ## Code Sample | |
| ```python | |
| from transformers import BertTokenizer, EncoderDecoderModel | |
| tokenizer = BertTokenizer.from_pretrained("cahya/bert2bert-indonesian-summarization") | |
| tokenizer.bos_token = tokenizer.cls_token | |
| tokenizer.eos_token = tokenizer.sep_token | |
| model = EncoderDecoderModel.from_pretrained("cahya/bert2bert-indonesian-summarization") | |
| # | |
| ARTICLE_TO_SUMMARIZE = "" | |
| # generate summary | |
| input_ids = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors='pt') | |
| summary_ids = model.generate(input_ids, | |
| min_length=20, | |
| max_length=80, | |
| num_beams=10, | |
| repetition_penalty=2.5, | |
| length_penalty=1.0, | |
| early_stopping=True, | |
| no_repeat_ngram_size=2, | |
| use_cache=True, | |
| do_sample = True, | |
| temperature = 0.8, | |
| top_k = 50, | |
| top_p = 0.95) | |
| summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True) | |
| print(summary_text) | |
| ``` | |
| Output: | |
| ``` | |
| ``` | |