FLAN‑T5‑small · PubMedQA (LoRA/QLoRA)

This repository contains a parameter‑efficient fine‑tuning of the FLAN‑T5‑small model for a biomedical question‑answering task.
The model produces one of three answers — yes, no, or maybe — given a question and a short context drawn from biomedical abstracts.
Training uses QLoRA (4‑bit NF4 quantization via bitsandbytes) together with LoRA adapters on the base model, allowing low‑VRAM training and fast inference.

Model Details

Model Description

Architecture: Encoder–decoder transformer (T5 family, FLAN‑T5‑small)
Objective: Sequence‑to‑sequence generation of one of three labels (yes, no, maybe)
Parameter‑efficient training: LoRA adapters trained on top of a 4‑bit‑quantized base model (QLoRA)
Language: English (biomedical literature)
Finetuned from: google/flan‑t5‑small
Intended format: Use as LoRA adapters (recommended) or merge into a full model for deployment.

Model Sources

Base model: google/flan‑t5‑small
Dataset: pubmed_qa (configuration pqa_labeled)

Uses

Direct Use

Biomedical yes/no/maybe question answering on short context passages (e.g., sentences or abstracts).
Deployment via adapters using the PEFT framework, enabling small checkpoints and flexible precision.

Downstream Use

Component in biomedical literature triage systems or heuristic pipelines.
Starting point for further PEFT‑style adaptation on related biomedical QA datasets.

Out‑of‑Scope Use

Clinical decision making: Not a medical device. Do not use for diagnosis or treatment.
Free‑form generation outside the label space (yes, no, maybe).
Long document reasoning without retrieval or summarization.

Bias, Risks, and Limitations

Domain bias: The model is trained solely on PubMedQA; performance may degrade on layperson or cross‑domain biomedical text.
Restricted vocabulary: Optimized to output only yes, no, or maybe.
Hallucination: Like all seq2seq models, it can produce incorrect or over‑confident outputs.
Safety: Do not rely on the model for clinical advice; human oversight is required.

Recommendations:

Clamp outputs to the label set and log confidence proxies (e.g., beam scores).
For longer contexts, consider retrieval‑augmented preprocessing or chunking.

How to Get Started with the Model

You can load the model using the Transformers and PEFT libraries.
The recommended approach is to load the base model and LoRA adapter separately.

Example: Using Adapters

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
from bitsandbytes.config import BitsAndBytesConfig
import torch

base_model_id = "google/flan-t5-small"
adapter_id = "MileStanislavov/flan-t5-small-pubmedqa-lora"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id, quantization_config=bnb_config)
model = PeftModel.from_pretrained(base_model, adapter_id)

def predict_yes_no_maybe(question, context):
    prompt = f"question: {question} context: {context}"
    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=4, num_beams=4, do_sample=False)
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip().lower().replace(".", "")
    return text if text in {"yes", "no", "maybe"} else "maybe"

Example: Using Merged Model

If you have merged the LoRA weights into the base model using merge_and_unload(), load the merged model directly:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("your-username/flan-t5-small-pubmedqa-merged")
tokenizer = AutoTokenizer.from_pretrained("your-username/flan-t5-small-pubmedqa-merged")

Training Details

Training Data

Dataset: PubMedQA pqa_labeled
Preprocessing: Stratified train/validation/test split by label.
- Inputs formatted as question: <q> context: <context> (truncated to 512 tokens)
- Targets truncated to 8 tokens

Training Procedure

Adapters: LoRA with rank 16, α = 32, dropout 0.05
Target Modules: ["q", "k", "v", "o", "wi_0", "wi_1", "wo"] (T5 attention and feed‑forward projections)
Quantization: 4‑bit NF4 with double quantization; compute dtype float16 or bfloat16
Optimizer: Adafactor, constant learning rate of 2e-4
Batching: Effective batch size ≈ 8 (per‑device 4, gradient accumulation 2)
Epochs: 3
Regularization: Gradient checkpointing enabled, prepared for k‑bit training and input gradients enabled

Evaluation metrics: Accuracy and macro‑F1 on validation/test splits.

Evaluation

Evaluation uses the validation and test splits from PubMedQA.
Metrics include accuracy and macro‑F1.

Environmental Impact

The parameter‑efficient approach (QLoRA + LoRA) significantly reduces compute requirements and energy usage compared to full fine‑tuning.

Estimate CO₂ emissions using the Machine Learning Impact calculator.

Model card authored by [Mile Stanislavov].
Please contact [Mile Stanislavov] for questions.

MileStanislavov
/

flan-t5-small-pubmedqa-lora