---
library_name: transformers
tags:
- text-generation
- paraphrase
- gpt2
- causal-lm
- transformers
- pytorch
license: mit
datasets:
- HHousen/ParaSCI
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---

# Model Card for `gpt2-parasciparaphrase`

## 🧠 Model Summary

This model is a fine-tuned version of [GPT-2](https://huggingface.co/gpt2) on the [ParaSCI dataset](https://huggingface.co/datasets/HHousen/ParaSCI) for paraphrase generation. It takes a sentence as input and generates a paraphrased version of that sentence.

---

## 📋 Model Details

- **Base model:** GPT-2 (`gpt2`)
- **Task:** Paraphrase generation (Causal Language Modeling)
- **Language:** English
- **Training data:** [HHousen/ParaSCI](https://huggingface.co/datasets/HHousen/ParaSCI)
- **Training steps:** 1 epoch on ~270k examples
- **Precision:** `fp16` mixed precision
- **Hardware used:** Tesla T4 (Kaggle Notebook GPU)
- **Framework:** 🤗 Transformers, PyTorch
- **Trained by:** [Your Name or HF Username]
- **License:** MIT

---

## 💡 Intended Use

### ✅ Direct Use
- Generate paraphrased versions of input English sentences in a general academic/technical writing context.

### 🚫 Out-of-Scope Use
- Not suitable for paraphrasing code, informal language, or other languages (non-English).
- Not tested for fairness, bias, or ethical use in downstream applications.

---

## 📊 Evaluation

- **Qualitative Evaluation:** Manual checks indicate coherent paraphrased outputs.
- **Automatic Metrics:** Not yet reported.

---

## 🛠 Training Details

- **Dataset:** ParaSCI (`sentence1` → `sentence2`)
- **Preprocessing:** Concatenated prompt `paraphrase this sentence: {sentence1}\n{sentence2}`
- **Tokenizer:** GPT-2 tokenizer with `pad_token = eos_token`
- **Batch size:** 8
- **Epochs:** 1
- **Learning rate:** 5e-5
- **Logging and checkpointing:** Every 500 steps, using Weights & Biases (`wandb`)
- **Max sequence length:** 256 tokens

---

## 🏁 How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("your-username/gpt2-parasciparaphrase")
tokenizer = AutoTokenizer.from_pretrained("your-username/gpt2-parasciparaphrase")

input_text = "paraphrase this sentence: AI models can help in automating tasks.\n"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))