Commit
·
bf7364f
1
Parent(s):
2d7348d
cross referencing other transformer-related implementations
Browse files
README.md
CHANGED
|
@@ -11,10 +11,12 @@ language: en
|
|
| 11 |
license: mit
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# DeepSeek Multi-Latent Attention
|
| 15 |
|
| 16 |
This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
|
| 17 |
|
|
|
|
|
|
|
| 18 |
## Key Features
|
| 19 |
|
| 20 |
- **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
|
|
@@ -114,6 +116,18 @@ Key aspects:
|
|
| 114 |
- Position encoding through decoupled RoPE pathway
|
| 115 |
- Efficient cache management for both pathways
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
## Contributing
|
| 118 |
|
| 119 |
Contributions are welcome! Feel free to:
|
|
|
|
| 11 |
license: mit
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# DeepSeek Multi-Head Latent Attention
|
| 15 |
|
| 16 |
This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
|
| 17 |
|
| 18 |
+
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the **Related Implementations** section for the complete series.
|
| 19 |
+
|
| 20 |
## Key Features
|
| 21 |
|
| 22 |
- **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
|
|
|
|
| 116 |
- Position encoding through decoupled RoPE pathway
|
| 117 |
- Efficient cache management for both pathways
|
| 118 |
|
| 119 |
+
## Related Implementations
|
| 120 |
+
|
| 121 |
+
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:
|
| 122 |
+
|
| 123 |
+
1. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**(This Repository): Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.
|
| 124 |
+
|
| 125 |
+
2. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)**: Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.
|
| 126 |
+
|
| 127 |
+
3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.
|
| 128 |
+
|
| 129 |
+
Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.
|
| 130 |
+
|
| 131 |
## Contributing
|
| 132 |
|
| 133 |
Contributions are welcome! Feel free to:
|