G20-CS4248
/

bert-baseline-qa

@@ -1,199 +1,382 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language: en
+license: apache-2.0
+tags:
+- question-answering
+- bert
+- squad
+- extractive-qa
+- baseline
+datasets:
+- squad
+metrics:
+- f1
+- exact_match
+model-index:
+- name: bert-base-uncased-squad-baseline
+  results:
+  - task:
+      type: question-answering
+      name: Question Answering
+    dataset:
+      name: SQuAD 1.1
+      type: squad
+      split: validation
+    metrics:
+    - type: exact_match
+      value: 79.45
+      name: Exact Match
+    - type: f1
+      value: 87.41
+      name: F1 Score
 ---
+# BERT Base Uncased - SQuAD 1.1 Baseline
+This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the SQuAD 1.1 dataset for extractive question answering.
+## Model Description
+**BERT (Bidirectional Encoder Representations from Transformers)** fine-tuned on the Stanford Question Answering Dataset (SQuAD 1.1) to perform extractive question answering - finding the answer span within a given context passage.
+- **Model Type:** Question Answering (Extractive)
+- **Base Model:** `bert-base-uncased`
+- **Language:** English
+- **License:** Apache 2.0
+- **Fine-tuned on:** SQuAD 1.1
+- **Parameters:** 108,893,186 (all trainable)
+## Intended Use
+### Primary Use Cases
+This model is designed for extractive question answering tasks where:
+- The answer exists as a continuous span of text within the provided context
+- Questions are factual and answerable from the context
+- English language text processing
+### Example Usage
+```python
+from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
+# Load model and tokenizer
+model = AutoModelForQuestionAnswering.from_pretrained("your-username/bert-squad-baseline")
+tokenizer = AutoTokenizer.from_pretrained("your-username/bert-squad-baseline")
+# Create QA pipeline
+qa_pipeline = pipeline(
+    "question-answering",
+    model=model,
+    tokenizer=tokenizer
+)
+# Ask a question
+context = """
+The Amazon rainforest is a moist broadleaf tropical rainforest in the Amazon biome
+that covers most of the Amazon basin of South America. This basin encompasses
+7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are
+covered by the rainforest.
+"""
+question = "How large is the Amazon basin?"
+result = qa_pipeline(question=question, context=context)
+print(f"Answer: {result['answer']}")
+print(f"Confidence: {result['score']:.4f}")
+```
+**Output:**
+```
+Answer: 7,000,000 km2
+Confidence: 0.9234
+```
+### Direct Model Usage (without pipeline)
+```python
+import torch
+from transformers import AutoModelForQuestionAnswering, AutoTokenizer
+model = AutoModelForQuestionAnswering.from_pretrained("your-username/bert-squad-baseline")
+tokenizer = AutoTokenizer.from_pretrained("your-username/bert-squad-baseline")
+question = "What is the capital of France?"
+context = "Paris is the capital and largest city of France."
+# Tokenize
+inputs = tokenizer(question, context, return_tensors="pt")
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+# Get answer span
+answer_start = torch.argmax(outputs.start_logits)
+answer_end = torch.argmax(outputs.end_logits) + 1
+answer = tokenizer.convert_tokens_to_string(
+    tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end])
+)
+print(f"Answer: {answer}")
+```
+## Training Data
+### Dataset: SQuAD 1.1
+The Stanford Question Answering Dataset (SQuAD) v1.1 consists of questions posed by crowdworkers on a set of Wikipedia articles.
+**Training Set:**
+- **Examples:** 87,599
+- **Average question length:** 10.06 words
+- **Average context length:** 119.76 words
+- **Average answer length:** 3.16 words
+**Validation Set:**
+- **Examples:** 10,570
+- **Average question length:** 10.22 words
+- **Average context length:** 123.95 words
+- **Average answer length:** 3.02 words
+### Data Preprocessing
+- **Tokenizer:** `bert-base-uncased`
+- **Max sequence length:** 384 tokens
+- **Stride:** 128 tokens (for handling long contexts)
+- **Padding:** Maximum length
+- **Truncation:** Only second sequence (context)
+Long contexts are split into multiple features with overlapping windows to ensure answers aren't lost at sequence boundaries.
+## Training Procedure
+### Training Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| **Base model** | bert-base-uncased |
+| **Optimizer** | AdamW |
+| **Learning rate** | 3e-5 |
+| **Learning rate schedule** | Linear with warmup |
+| **Warmup ratio** | 0.1 (10% of training) |
+| **Weight decay** | 0.01 |
+| **Batch size (train)** | 8 |
+| **Batch size (eval)** | 8 |
+| **Number of epochs** | 1 |
+| **Mixed precision** | FP16 (enabled) |
+| **Gradient accumulation** | 1 |
+| **Max gradient norm** | 1.0 |
+### Training Environment
+- **Hardware:** NVIDIA GPU (CUDA enabled)
+- **Framework:** PyTorch with Transformers library
+- **Training time:** ~29.5 minutes (1 epoch)
+- **Training samples/second:** 44.95
+- **Total FLOPs:** 14,541,777 GF
+### Training Metrics
+- **Final training loss:** 1.2236
+- **Evaluation strategy:** End of epoch
+- **Metric for best model:** Evaluation loss
+## Performance
+### Evaluation Results
+Evaluated on SQuAD 1.1 validation set (10,570 examples):
+| Metric | Score |
+|--------|-------|
+| **Exact Match (EM)** | **79.45%** |
+| **F1 Score** | **87.41%** |
+### Metric Explanations
+- **Exact Match (EM):** Percentage of predictions that match the ground truth answer exactly
+- **F1 Score:** Token-level F1 score measuring overlap between predicted and ground truth answers
+### Comparison to BERT Base Performance
+| Model | EM | F1 | Training |
+|-------|----|----|----------|
+| **This model (1 epoch)** | 79.45 | 87.41 | 29.5 min |
+| BERT Base (original paper, 3 epochs) | 80.8 | 88.5 | ~2-3 hours |
+| BERT Base (fully trained) | 81-84 | 88-91 | ~2-3 hours |
+**Note:** This is a baseline model trained for only 1 epoch. Performance can be improved with additional training epochs.
+### Performance by Question Type
+The model performs well on:
+- ✅ Factual questions (What, When, Where, Who)
+- ✅ Short answer spans (1-5 words)
+- ✅ Questions with clear context
+May struggle with:
+- ⚠️ Questions requiring reasoning across multiple sentences
+- ⚠️ Very long answer spans
+- ⚠️ Ambiguous questions with multiple valid answers
+- ⚠️ Questions requiring world knowledge not in context
+## Limitations and Biases
+### Known Limitations
+1. **Extractive Only:** Can only extract answers present in the context; cannot generate or synthesize answers
+2. **Single Answer:** Provides only one answer span, even if multiple valid answers exist
+3. **Context Dependency:** Requires relevant context; cannot answer from general knowledge
+4. **Length Constraints:** Limited to 384 tokens per context window
+5. **English Only:** Trained on English text; not suitable for other languages
+6. **Training Duration:** Only 1 epoch of training; may underfit compared to longer training
+### Potential Biases
+- **Domain Bias:** Trained primarily on Wikipedia articles; may perform worse on other text types (news, technical docs, etc.)
+- **Temporal Bias:** Training data from 2016; may have outdated information
+- **Cultural Bias:** Reflects biases present in Wikipedia content
+- **Answer Position Bias:** May favor answers appearing in certain positions within context
+- **BERT Base Biases:** Inherits any biases from the pre-trained BERT base model
+### Out-of-Scope Use
+This model should NOT be used for:
+- ❌ Medical, legal, or financial advice
+- ❌ High-stakes decision making
+- ❌ Generative question answering (creating new answers)
+- ❌ Non-English languages
+- ❌ Yes/no or multiple choice questions (without adaptation)
+- ❌ Questions requiring reasoning beyond the context
+- ❌ Real-time fact checking or verification
+## Technical Specifications
+### Model Architecture
+```
+BertForQuestionAnswering(
+  (bert): BertModel(
+    (embeddings): BertEmbeddings
+    (encoder): BertEncoder (12 layers)
+    (pooler): BertPooler
+  )
+  (qa_outputs): Linear(768 -> 2)  # Start and end position logits
+)
+```
+- **Hidden size:** 768
+- **Attention heads:** 12
+- **Intermediate size:** 3072
+- **Hidden layers:** 12
+- **Vocabulary size:** 30,522
+- **Max position embeddings:** 512
+- **Total parameters:** 108,893,186
+### Input Format
+The model expects tokenized input with:
+- Question and context concatenated with `[SEP]` token
+- Format: `[CLS] question [SEP] context [SEP]`
+- Token type IDs to distinguish question (0) from context (1)
+- Attention mask to identify real vs padding tokens
+### Output Format
+Returns:
+- `start_logits`: Scores for each token being the start of the answer span
+- `end_logits`: Scores for each token being the end of the answer span
+The predicted answer is the span from token with highest start_logit to token with highest end_logit (where end >= start).
+## Evaluation Data
+**SQuAD 1.1 Validation Set**
+- 10,570 question-context-answer triples
+- Same source and format as training data
+- Used for final performance evaluation
 ## Environmental Impact
+- **Training hardware:** 1x NVIDIA GPU
+- **Training time:** ~29.5 minutes
+- **Compute region:** Not specified
+- **Carbon footprint:** Estimated minimal due to short training time
+## Model Card Authors
+[Your Name / Team Name]
+## Model Card Contact
+[Your Email / Contact Information]
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{bert-squad-baseline-2025,
+  author = {Your Name},
+  title = {BERT Base Uncased Fine-tuned on SQuAD 1.1 (Baseline)},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/your-username/bert-squad-baseline}}
+}
+```
+### Original BERT Paper
+```bibtex
+@article{devlin2018bert,
+  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
+  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  journal={arXiv preprint arXiv:1810.04805},
+  year={2018}
+}
+```
+### SQuAD Dataset
+```bibtex
+@article{rajpurkar2016squad,
+  title={SQuAD: 100,000+ Questions for Machine Comprehension of Text},
+  author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
+  journal={arXiv preprint arXiv:1606.05250},
+  year={2016}
+}
+```
+## Additional Information
+### Future Improvements
+Potential enhancements for this baseline model:
+- 🔄 Train for additional epochs (2-3 epochs recommended)
+- 📈 Increase batch size with gradient accumulation
+- 🎯 Implement learning rate scheduling
+- 🔍 Add answer validation/verification
+- 📊 Ensemble with multiple models
+- 🚀 Distillation to smaller model for deployment
+### Related Models
+- [bert-base-uncased](https://huggingface.co/bert-base-uncased) - Base model
+- [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) - Larger BERT variant
+- [distilbert-base-uncased-distilled-squad](https://huggingface.co/distilbert-base-uncased-distilled-squad) - Smaller, faster variant
+### Acknowledgments
+- Google Research for BERT
+- Stanford NLP for SQuAD dataset
+- Hugging Face for Transformers library
+- [Your course/institution if applicable]
+---
+**Last updated:** October 2025
+**Model version:** 1.0 (Baseline)
+**Status:** Baseline model - suitable for development/comparison