deepseek-tiny-v0.1 / README.md

Upload deepseek-tiny-v0.1 model weights and documentation

b4e4358 verified 20 days ago

2.37 kB

	---
	license: apache-2.0
	language:
	- en
	datasets:
	- wikitext
	- glue
	pipeline_tag: text-generation
	tags:
	- transformer
	- attention
	- mla
	- research

	---

	# Deepseek Tiny V0.1

	6-layer DeepSeek-V3 with Multihead Latent Attention (MLA) trained for research on shared subspaces in Transformer attention mechanisms.

	## Model Description

	- Model Type: Transformer Decoder (DeepSeek-V3 based)
	- Architecture: 6-layer decoder with Mixture of Experts
	- Parameters: 16.26M
	- Hidden Size: 256
	- Attention Heads: 8
	- Head Dimension: 32
	- Sequence Length: 1,024 tokens
	- Query Latent Dimension: 96
	- Key-Value Latent Dimension: 64


	## Performance

	- SST-2 Accuracy: 87.96%
	- WikiText-103 Perplexity: 28.89

	## Research Context

	This model is part of the [shared-subspaces](https://github.com/chrisjmccormick/shared-subspaces) research project investigating the impact of shared output latent spaces in Transformer attention mechanisms.








	## Usage

	```python
	import torch
	from transformers import DeepseekV3ForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")
	tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")



	# Generate text
	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	- Pre-training Dataset: WikiText-103
	- Fine-tuning Dataset: SST-2 (GLUE)
	- Optimizer: AdamW
	- Learning Rate: 5e-4 (pre-training), 5e-5 (fine-tuning)
	- Weight Decay: 0.01 (pre-training), 0.05 (fine-tuning)
	- Precision: bfloat16
	- Compilation: torch.compile with inductor backend
	- Training Steps: 12,500 (pre-training), 1,500 (fine-tuning)

	## Limitations

	- Small scale model (16M parameters) intended for research purposes
	- Trained on limited data compared to production models
	- May require custom loading code for output subspace variants

	## Citation

	```bibtex
	@misc{mccormick2025sharedsubspaces,
	title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces},
	author={McCormick, Chris},
	year={2025},
	howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}}
	}
	```

	## License

	Apache 2.0