BERT Base Uncased - SQuAD 1.1 Baseline

This model is a fine-tuned version of bert-base-uncased on the SQuAD 1.1 dataset for extractive question answering.

Model Description

BERT (Bidirectional Encoder Representations from Transformers) fine-tuned on the Stanford Question Answering Dataset (SQuAD 1.1) to perform extractive question answering - finding the answer span within a given context passage.

Model Type: Question Answering (Extractive)
Base Model: bert-base-uncased
Language: English
License: Apache 2.0
Fine-tuned on: SQuAD 1.1
Parameters: 108,893,186 (all trainable)

Intended Use

Primary Use Cases

This model is designed for extractive question answering tasks where:

The answer exists as a continuous span of text within the provided context
Questions are factual and answerable from the context
English language text processing

Example Usage

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("your-username/bert-squad-baseline")
tokenizer = AutoTokenizer.from_pretrained("your-username/bert-squad-baseline")

# Create QA pipeline
qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer
)

# Ask a question
context = """
The Amazon rainforest is a moist broadleaf tropical rainforest in the Amazon biome 
that covers most of the Amazon basin of South America. This basin encompasses 
7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are 
covered by the rainforest.
"""

question = "How large is the Amazon basin?"

result = qa_pipeline(question=question, context=context)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")

Output:

Answer: 7,000,000 km2
Confidence: 0.9234

Direct Model Usage (without pipeline)

import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model = AutoModelForQuestionAnswering.from_pretrained("your-username/bert-squad-baseline")
tokenizer = AutoTokenizer.from_pretrained("your-username/bert-squad-baseline")

question = "What is the capital of France?"
context = "Paris is the capital and largest city of France."

# Tokenize
inputs = tokenizer(question, context, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    
# Get answer span
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end])
)

print(f"Answer: {answer}")

Training Data

Dataset: SQuAD 1.1

The Stanford Question Answering Dataset (SQuAD) v1.1 consists of questions posed by crowdworkers on a set of Wikipedia articles.

Training Set:

Examples: 87,599
Average question length: 10.06 words
Average context length: 119.76 words
Average answer length: 3.16 words

Validation Set:

Examples: 10,570
Average question length: 10.22 words
Average context length: 123.95 words
Average answer length: 3.02 words

Data Preprocessing

Tokenizer: bert-base-uncased
Max sequence length: 384 tokens
Stride: 128 tokens (for handling long contexts)
Padding: Maximum length
Truncation: Only second sequence (context)

Long contexts are split into multiple features with overlapping windows to ensure answers aren't lost at sequence boundaries.

Training Procedure

Training Hyperparameters

Parameter	Value
Base model	bert-base-uncased
Optimizer	AdamW
Learning rate	3e-5
Learning rate schedule	Linear with warmup
Warmup ratio	0.1 (10% of training)
Weight decay	0.01
Batch size (train)	8
Batch size (eval)	8
Number of epochs	1
Mixed precision	FP16 (enabled)
Gradient accumulation	1
Max gradient norm	1.0

Training Environment

Hardware: NVIDIA GPU (CUDA enabled)
Framework: PyTorch with Transformers library
Training time: ~29.5 minutes (1 epoch)
Training samples/second: 44.95
Total FLOPs: 14,541,777 GF

Training Metrics

Final training loss: 1.2236
Evaluation strategy: End of epoch
Metric for best model: Evaluation loss

Performance

Evaluation Results

Evaluated on SQuAD 1.1 validation set (10,570 examples):

Metric	Score
Exact Match (EM)	79.45%
F1 Score	87.41%

Metric Explanations

Exact Match (EM): Percentage of predictions that match the ground truth answer exactly
F1 Score: Token-level F1 score measuring overlap between predicted and ground truth answers

Comparison to BERT Base Performance

Model	EM	F1	Training
This model (1 epoch)	79.45	87.41	29.5 min
BERT Base (original paper, 3 epochs)	80.8	88.5	~2-3 hours
BERT Base (fully trained)	81-84	88-91	~2-3 hours

Note: This is a baseline model trained for only 1 epoch. Performance can be improved with additional training epochs.

Performance by Question Type

The model performs well on:

✅ Factual questions (What, When, Where, Who)
✅ Short answer spans (1-5 words)
✅ Questions with clear context

May struggle with:

⚠️ Questions requiring reasoning across multiple sentences
⚠️ Very long answer spans
⚠️ Ambiguous questions with multiple valid answers
⚠️ Questions requiring world knowledge not in context

Limitations and Biases

Known Limitations

Extractive Only: Can only extract answers present in the context; cannot generate or synthesize answers
Single Answer: Provides only one answer span, even if multiple valid answers exist
Context Dependency: Requires relevant context; cannot answer from general knowledge
Length Constraints: Limited to 384 tokens per context window
English Only: Trained on English text; not suitable for other languages
Training Duration: Only 1 epoch of training; may underfit compared to longer training

Potential Biases

Domain Bias: Trained primarily on Wikipedia articles; may perform worse on other text types (news, technical docs, etc.)
Temporal Bias: Training data from 2016; may have outdated information
Cultural Bias: Reflects biases present in Wikipedia content
Answer Position Bias: May favor answers appearing in certain positions within context
BERT Base Biases: Inherits any biases from the pre-trained BERT base model

Out-of-Scope Use

This model should NOT be used for:

❌ Medical, legal, or financial advice
❌ High-stakes decision making
❌ Generative question answering (creating new answers)
❌ Non-English languages
❌ Yes/no or multiple choice questions (without adaptation)
❌ Questions requiring reasoning beyond the context
❌ Real-time fact checking or verification

Technical Specifications

Model Architecture

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings
    (encoder): BertEncoder (12 layers)
    (pooler): BertPooler
  )
  (qa_outputs): Linear(768 -> 2)  # Start and end position logits
)

Hidden size: 768
Attention heads: 12
Intermediate size: 3072
Hidden layers: 12
Vocabulary size: 30,522
Max position embeddings: 512
Total parameters: 108,893,186

Input Format

The model expects tokenized input with:

Question and context concatenated with [SEP] token
Format: [CLS] question [SEP] context [SEP]
Token type IDs to distinguish question (0) from context (1)
Attention mask to identify real vs padding tokens

Output Format

Returns:

start_logits: Scores for each token being the start of the answer span
end_logits: Scores for each token being the end of the answer span

The predicted answer is the span from token with highest start_logit to token with highest end_logit (where end >= start).

Evaluation Data

SQuAD 1.1 Validation Set

10,570 question-context-answer triples
Same source and format as training data
Used for final performance evaluation

Environmental Impact

Training hardware: 1x NVIDIA GPU
Training time: ~29.5 minutes
Compute region: Not specified
Carbon footprint: Estimated minimal due to short training time

Model Card Authors

[Your Name / Team Name]

Model Card Contact

[Your Email / Contact Information]

Citation

If you use this model, please cite:

@misc{bert-squad-baseline-2025,
  author = {Your Name},
  title = {BERT Base Uncased Fine-tuned on SQuAD 1.1 (Baseline)},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/bert-squad-baseline}}
}

Original BERT Paper

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

SQuAD Dataset

@article{rajpurkar2016squad,
  title={SQuAD: 100,000+ Questions for Machine Comprehension of Text},
  author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
  journal={arXiv preprint arXiv:1606.05250},
  year={2016}
}

Additional Information

Future Improvements

Potential enhancements for this baseline model:

🔄 Train for additional epochs (2-3 epochs recommended)
📈 Increase batch size with gradient accumulation
🎯 Implement learning rate scheduling
🔍 Add answer validation/verification
📊 Ensemble with multiple models
🚀 Distillation to smaller model for deployment

Related Models

bert-base-uncased - Base model
bert-large-uncased-whole-word-masking-finetuned-squad - Larger BERT variant
distilbert-base-uncased-distilled-squad - Smaller, faster variant

Acknowledgments

Google Research for BERT
Stanford NLP for SQuAD dataset
Hugging Face for Transformers library
[Your course/institution if applicable]

Last updated: October 2025
Model version: 1.0 (Baseline)
Status: Baseline model - suitable for development/comparison

Downloads last month: 18

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train G20-CS4248/bert-baseline-qa

Evaluation results

Exact Match on SQuAD 1.1
validation set self-reported

79.450
F1 Score on SQuAD 1.1
validation set self-reported

87.410

View on Papers With Code