|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- question-answering |
|
|
- bert |
|
|
- squad |
|
|
- extractive-qa |
|
|
- baseline |
|
|
datasets: |
|
|
- squad |
|
|
metrics: |
|
|
- f1 |
|
|
- exact_match |
|
|
model-index: |
|
|
- name: bert-base-uncased-squad-baseline |
|
|
results: |
|
|
- task: |
|
|
type: question-answering |
|
|
name: Question Answering |
|
|
dataset: |
|
|
name: SQuAD 1.1 |
|
|
type: squad |
|
|
split: validation |
|
|
metrics: |
|
|
- type: exact_match |
|
|
value: 79.45 |
|
|
name: Exact Match |
|
|
- type: f1 |
|
|
value: 87.41 |
|
|
name: F1 Score |
|
|
--- |
|
|
|
|
|
# BERT Base Uncased - SQuAD 1.1 Baseline |
|
|
|
|
|
This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the SQuAD 1.1 dataset for extractive question answering. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**BERT (Bidirectional Encoder Representations from Transformers)** fine-tuned on the Stanford Question Answering Dataset (SQuAD 1.1) to perform extractive question answering - finding the answer span within a given context passage. |
|
|
|
|
|
- **Model Type:** Question Answering (Extractive) |
|
|
- **Base Model:** `bert-base-uncased` |
|
|
- **Language:** English |
|
|
- **License:** Apache 2.0 |
|
|
- **Fine-tuned on:** SQuAD 1.1 |
|
|
- **Parameters:** 108,893,186 (all trainable) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
This model is designed for extractive question answering tasks where: |
|
|
- The answer exists as a continuous span of text within the provided context |
|
|
- Questions are factual and answerable from the context |
|
|
- English language text processing |
|
|
|
|
|
### Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForQuestionAnswering.from_pretrained("G20-CS4248/bert-baseline-qa") |
|
|
tokenizer = AutoTokenizer.from_pretrained("G20-CS4248/bert-baseline-qa") |
|
|
|
|
|
# Create QA pipeline |
|
|
qa_pipeline = pipeline( |
|
|
"question-answering", |
|
|
model=model, |
|
|
tokenizer=tokenizer |
|
|
) |
|
|
|
|
|
# Ask a question |
|
|
context = """ |
|
|
The Amazon rainforest is a moist broadleaf tropical rainforest in the Amazon biome |
|
|
that covers most of the Amazon basin of South America. This basin encompasses |
|
|
7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are |
|
|
covered by the rainforest. |
|
|
""" |
|
|
|
|
|
question = "How large is the Amazon basin?" |
|
|
|
|
|
result = qa_pipeline(question=question, context=context) |
|
|
|
|
|
print(f"Answer: {result['answer']}") |
|
|
print(f"Confidence: {result['score']:.4f}") |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
``` |
|
|
Answer: 7,000,000 km2 |
|
|
Confidence: 0.9234 |
|
|
``` |
|
|
|
|
|
### Direct Model Usage (without pipeline) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForQuestionAnswering, AutoTokenizer |
|
|
|
|
|
model = AutoModelForQuestionAnswering.from_pretrained("G20-CS4248/bert-baseline-qa") |
|
|
tokenizer = AutoTokenizer.from_pretrained("G20-CS4248/bert-baseline-qa") |
|
|
|
|
|
question = "What is the capital of France?" |
|
|
context = "Paris is the capital and largest city of France." |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer(question, context, return_tensors="pt") |
|
|
|
|
|
# Get predictions |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Get answer span |
|
|
answer_start = torch.argmax(outputs.start_logits) |
|
|
answer_end = torch.argmax(outputs.end_logits) + 1 |
|
|
|
|
|
answer = tokenizer.convert_tokens_to_string( |
|
|
tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end]) |
|
|
) |
|
|
|
|
|
print(f"Answer: {answer}") |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset: SQuAD 1.1 |
|
|
|
|
|
The Stanford Question Answering Dataset (SQuAD) v1.1 consists of questions posed by crowdworkers on a set of Wikipedia articles. |
|
|
|
|
|
**Training Set:** |
|
|
- **Examples:** 87,599 |
|
|
- **Average question length:** 10.06 words |
|
|
- **Average context length:** 119.76 words |
|
|
- **Average answer length:** 3.16 words |
|
|
|
|
|
**Validation Set:** |
|
|
- **Examples:** 10,570 |
|
|
- **Average question length:** 10.22 words |
|
|
- **Average context length:** 123.95 words |
|
|
- **Average answer length:** 3.02 words |
|
|
|
|
|
### Data Preprocessing |
|
|
|
|
|
- **Tokenizer:** `bert-base-uncased` |
|
|
- **Max sequence length:** 384 tokens |
|
|
- **Stride:** 128 tokens (for handling long contexts) |
|
|
- **Padding:** Maximum length |
|
|
- **Truncation:** Only second sequence (context) |
|
|
|
|
|
Long contexts are split into multiple features with overlapping windows to ensure answers aren't lost at sequence boundaries. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Base model** | bert-base-uncased | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Learning rate** | 3e-5 | |
|
|
| **Learning rate schedule** | Linear with warmup | |
|
|
| **Warmup ratio** | 0.1 (10% of training) | |
|
|
| **Weight decay** | 0.01 | |
|
|
| **Batch size (train)** | 8 | |
|
|
| **Batch size (eval)** | 8 | |
|
|
| **Number of epochs** | 1 | |
|
|
| **Mixed precision** | FP16 (enabled) | |
|
|
| **Gradient accumulation** | 1 | |
|
|
| **Max gradient norm** | 1.0 | |
|
|
|
|
|
### Training Environment |
|
|
|
|
|
- **Hardware:** NVIDIA GPU (CUDA enabled) |
|
|
- **Framework:** PyTorch with Transformers library |
|
|
- **Training time:** ~29.5 minutes (1 epoch) |
|
|
- **Training samples/second:** 44.95 |
|
|
- **Total FLOPs:** 14,541,777 GF |
|
|
|
|
|
### Training Metrics |
|
|
|
|
|
- **Final training loss:** 1.2236 |
|
|
- **Evaluation strategy:** End of epoch |
|
|
- **Metric for best model:** Evaluation loss |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Evaluation Results |
|
|
|
|
|
Evaluated on SQuAD 1.1 validation set (10,570 examples): |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Exact Match (EM)** | **79.45%** | |
|
|
| **F1 Score** | **87.41%** | |
|
|
|
|
|
### Metric Explanations |
|
|
|
|
|
- **Exact Match (EM):** Percentage of predictions that match the ground truth answer exactly |
|
|
- **F1 Score:** Token-level F1 score measuring overlap between predicted and ground truth answers |
|
|
|
|
|
### Comparison to BERT Base Performance |
|
|
|
|
|
| Model | EM | F1 | Training | |
|
|
|-------|----|----|----------| |
|
|
| **This model (1 epoch)** | 79.45 | 87.41 | 29.5 min | |
|
|
| BERT Base (original paper, 3 epochs) | 80.8 | 88.5 | ~2-3 hours | |
|
|
| BERT Base (fully trained) | 81-84 | 88-91 | ~2-3 hours | |
|
|
|
|
|
**Note:** This is a baseline model trained for only 1 epoch. Performance can be improved with additional training epochs. |
|
|
|
|
|
### Performance by Question Type |
|
|
|
|
|
The model performs well on: |
|
|
- ✅ Factual questions (What, When, Where, Who) |
|
|
- ✅ Short answer spans (1-5 words) |
|
|
- ✅ Questions with clear context |
|
|
|
|
|
May struggle with: |
|
|
- ⚠️ Questions requiring reasoning across multiple sentences |
|
|
- ⚠️ Very long answer spans |
|
|
- ⚠️ Ambiguous questions with multiple valid answers |
|
|
- ⚠️ Questions requiring world knowledge not in context |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
|
|
|
1. **Extractive Only:** Can only extract answers present in the context; cannot generate or synthesize answers |
|
|
2. **Single Answer:** Provides only one answer span, even if multiple valid answers exist |
|
|
3. **Context Dependency:** Requires relevant context; cannot answer from general knowledge |
|
|
4. **Length Constraints:** Limited to 384 tokens per context window |
|
|
5. **English Only:** Trained on English text; not suitable for other languages |
|
|
6. **Training Duration:** Only 1 epoch of training; may underfit compared to longer training |
|
|
|
|
|
### Potential Biases |
|
|
|
|
|
- **Domain Bias:** Trained primarily on Wikipedia articles; may perform worse on other text types (news, technical docs, etc.) |
|
|
- **Temporal Bias:** Training data from 2016; may have outdated information |
|
|
- **Cultural Bias:** Reflects biases present in Wikipedia content |
|
|
- **Answer Position Bias:** May favor answers appearing in certain positions within context |
|
|
- **BERT Base Biases:** Inherits any biases from the pre-trained BERT base model |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model should NOT be used for: |
|
|
- ❌ Medical, legal, or financial advice |
|
|
- ❌ High-stakes decision making |
|
|
- ❌ Generative question answering (creating new answers) |
|
|
- ❌ Non-English languages |
|
|
- ❌ Yes/no or multiple choice questions (without adaptation) |
|
|
- ❌ Questions requiring reasoning beyond the context |
|
|
- ❌ Real-time fact checking or verification |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
``` |
|
|
BertForQuestionAnswering( |
|
|
(bert): BertModel( |
|
|
(embeddings): BertEmbeddings |
|
|
(encoder): BertEncoder (12 layers) |
|
|
(pooler): BertPooler |
|
|
) |
|
|
(qa_outputs): Linear(768 -> 2) # Start and end position logits |
|
|
) |
|
|
``` |
|
|
|
|
|
- **Hidden size:** 768 |
|
|
- **Attention heads:** 12 |
|
|
- **Intermediate size:** 3072 |
|
|
- **Hidden layers:** 12 |
|
|
- **Vocabulary size:** 30,522 |
|
|
- **Max position embeddings:** 512 |
|
|
- **Total parameters:** 108,893,186 |
|
|
|
|
|
### Input Format |
|
|
|
|
|
The model expects tokenized input with: |
|
|
- Question and context concatenated with `[SEP]` token |
|
|
- Format: `[CLS] question [SEP] context [SEP]` |
|
|
- Token type IDs to distinguish question (0) from context (1) |
|
|
- Attention mask to identify real vs padding tokens |
|
|
|
|
|
### Output Format |
|
|
|
|
|
Returns: |
|
|
- `start_logits`: Scores for each token being the start of the answer span |
|
|
- `end_logits`: Scores for each token being the end of the answer span |
|
|
|
|
|
The predicted answer is the span from token with highest start_logit to token with highest end_logit (where end >= start). |
|
|
|
|
|
## Evaluation Data |
|
|
|
|
|
**SQuAD 1.1 Validation Set** |
|
|
- 10,570 question-context-answer triples |
|
|
- Same source and format as training data |
|
|
- Used for final performance evaluation |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Training hardware:** 1x NVIDIA GPU |
|
|
- **Training time:** ~29.5 minutes |
|
|
- **Compute region:** Not specified |
|
|
- **Carbon footprint:** Estimated minimal due to short training time |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[Your Name / Team Name] |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
[Your Email / Contact Information] |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{bert-squad-baseline-2025, |
|
|
author = {Your Name}, |
|
|
title = {BERT Base Uncased Fine-tuned on SQuAD 1.1 (Baseline)}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/your-username/bert-squad-baseline}} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Original BERT Paper |
|
|
|
|
|
```bibtex |
|
|
@article{devlin2018bert, |
|
|
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, |
|
|
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
|
|
journal={arXiv preprint arXiv:1810.04805}, |
|
|
year={2018} |
|
|
} |
|
|
``` |
|
|
|
|
|
### SQuAD Dataset |
|
|
|
|
|
```bibtex |
|
|
@article{rajpurkar2016squad, |
|
|
title={SQuAD: 100,000+ Questions for Machine Comprehension of Text}, |
|
|
author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy}, |
|
|
journal={arXiv preprint arXiv:1606.05250}, |
|
|
year={2016} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Additional Information |
|
|
|
|
|
### Future Improvements |
|
|
|
|
|
Potential enhancements for this baseline model: |
|
|
- 🔄 Train for additional epochs (2-3 epochs recommended) |
|
|
- 📈 Increase batch size with gradient accumulation |
|
|
- 🎯 Implement learning rate scheduling |
|
|
- 🔍 Add answer validation/verification |
|
|
- 📊 Ensemble with multiple models |
|
|
- 🚀 Distillation to smaller model for deployment |
|
|
|
|
|
### Related Models |
|
|
|
|
|
- [bert-base-uncased](https://huggingface.co/bert-base-uncased) - Base model |
|
|
- [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) - Larger BERT variant |
|
|
- [distilbert-base-uncased-distilled-squad](https://huggingface.co/distilbert-base-uncased-distilled-squad) - Smaller, faster variant |
|
|
|
|
|
### Acknowledgments |
|
|
|
|
|
- Google Research for BERT |
|
|
- Stanford NLP for SQuAD dataset |
|
|
- Hugging Face for Transformers library |
|
|
- [Your course/institution if applicable] |
|
|
|
|
|
--- |
|
|
|
|
|
**Last updated:** October 2025 |
|
|
**Model version:** 1.0 (Baseline) |
|
|
**Status:** Baseline model - suitable for development/comparison |