|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
datasets: |
|
- wikitext |
|
- glue |
|
pipeline_tag: text-generation |
|
tags: |
|
- transformer |
|
- attention |
|
- mla |
|
- research |
|
|
|
--- |
|
|
|
# Deepseek Tiny V0.1 |
|
|
|
6-layer DeepSeek-V3 with Multihead Latent Attention (MLA) trained for research on shared subspaces in Transformer attention mechanisms. |
|
|
|
## Model Description |
|
|
|
- **Model Type**: Transformer Decoder (DeepSeek-V3 based) |
|
- **Architecture**: 6-layer decoder with Mixture of Experts |
|
- **Parameters**: 16.26M |
|
- **Hidden Size**: 256 |
|
- **Attention Heads**: 8 |
|
- **Head Dimension**: 32 |
|
- **Sequence Length**: 1,024 tokens |
|
- **Query Latent Dimension**: 96 |
|
- **Key-Value Latent Dimension**: 64 |
|
|
|
|
|
## Performance |
|
|
|
- **SST-2 Accuracy**: 87.96% |
|
- **WikiText-103 Perplexity**: 28.89 |
|
|
|
## Research Context |
|
|
|
This model is part of the [shared-subspaces](https://github.com/chrisjmccormick/shared-subspaces) research project investigating the impact of shared output latent spaces in Transformer attention mechanisms. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Usage |
|
|
|
```python |
|
import torch |
|
from transformers import DeepseekV3ForCausalLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1") |
|
tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1") |
|
|
|
|
|
|
|
# Generate text |
|
inputs = tokenizer("The future of AI is", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=50, temperature=0.7) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Details |
|
|
|
- **Pre-training Dataset**: WikiText-103 |
|
- **Fine-tuning Dataset**: SST-2 (GLUE) |
|
- **Optimizer**: AdamW |
|
- **Learning Rate**: 5e-4 (pre-training), 5e-5 (fine-tuning) |
|
- **Weight Decay**: 0.01 (pre-training), 0.05 (fine-tuning) |
|
- **Precision**: bfloat16 |
|
- **Compilation**: torch.compile with inductor backend |
|
- **Training Steps**: 12,500 (pre-training), 1,500 (fine-tuning) |
|
|
|
## Limitations |
|
|
|
- Small scale model (16M parameters) intended for research purposes |
|
- Trained on limited data compared to production models |
|
- May require custom loading code for output subspace variants |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{mccormick2025sharedsubspaces, |
|
title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces}, |
|
author={McCormick, Chris}, |
|
year={2025}, |
|
howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|