|
--- |
|
language: |
|
- en |
|
tags: |
|
- pytorch |
|
- transformer |
|
- language-model |
|
- mixture-of-experts |
|
- tree-of-thoughts |
|
- neural-memory |
|
datasets: |
|
- openai/gsm8k |
|
- cais/mmlu |
|
- TIGER-Lab/MMLU-Pro |
|
- openai/MMMLU |
|
- MMMU/MMMU |
|
- greengerong/leetcode |
|
- LimYeri/LeetCode_Python_Solutions_v2 |
|
- newfacade/LeetCodeDataset |
|
- deepmind/math_dataset |
|
- google/IFEval |
|
- Idavidrein/gpqa |
|
- google/frames-benchmark |
|
- camel-ai/math |
|
- camel-ai/code |
|
- microsoft/SCBench |
|
- princeton-nlp/SWE-bench_Verified |
|
- princeton-nlp/SWE-bench |
|
- wikimedia/wikipedia |
|
- HuggingFace/C4 |
|
- SamuelYang/bookcorpus |
|
- sentence-transformers/codesearchnet |
|
- openai/openai_humaneval |
|
|
|
license: mit |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# VishwamAI |
|
|
|
VishwamAI is an enhanced transformer model that combines several cutting-edge techniques to improve reasoning, memory retention, and computational efficiency. |
|
|
|
## Model Details |
|
|
|
- **Developers**: VishwamAI Team |
|
- **Architecture**: Enhanced Transformer with MoE |
|
- **Release Date**: 2024 |
|
- **Languages**: English |
|
- **Framework**: PyTorch |
|
- **License**: MIT |
|
- **Model Type**: Causal Language Model |
|
|
|
### Technical Specifications |
|
|
|
- Parameters: 671B |
|
- Context Length: 32,768 tokens |
|
- Hidden Size: 8,192 |
|
- Attention Heads: 64 |
|
- Layers: 120 |
|
- Vocabulary Size: 64,000 |
|
|
|
## Key Innovations |
|
|
|
1. **Differentiable Cache Augmentation** |
|
- Enhances transformer's key-value cache with learnable embeddings |
|
- Enables asynchronous reasoning capabilities |
|
- Implements gated memory updating mechanism |
|
|
|
2. **Neural Long-Term Memory** |
|
- Memory layers with read/write/forget gates |
|
- Multi-head memory attention mechanisms |
|
- Hierarchical memory organization |
|
|
|
3. **Tree of Thoughts Reasoning** |
|
- Multi-path reasoning exploration |
|
- Beam search for solution paths |
|
- Intermediate step evaluation |
|
|
|
## Training Data |
|
|
|
The model is being trained on a diverse set of datasets: |
|
|
|
1. **GSM8K** |
|
- Grade school math word problems |
|
- Tests mathematical reasoning capabilities |
|
|
|
2. **MMLU (Massive Multitask Language Understanding)** |
|
- Broad knowledge evaluation |
|
- Multiple academic and professional domains |
|
|
|
3. **MMLU-Pro** |
|
- Professional and specialized knowledge |
|
- Advanced reasoning tasks |
|
|
|
4. **MMMLU (Massive Multi-task Multi-token Language Understanding)** |
|
- Extended reasoning capabilities |
|
- Complex multi-step problems |
|
|
|
## Training Procedure |
|
|
|
### Hardware Requirements |
|
|
|
- Minimum: Single NVIDIA A100 (80GB) |
|
- Recommended: Multiple A100s with NVLink |
|
- Distributed Training: Supported via FSDP |
|
|
|
### Software Requirements |
|
|
|
- PyTorch >= 2.0 |
|
- CUDA >= 11.8 |
|
- [Optional] NCCL for distributed training |
|
|
|
### Optimization |
|
|
|
- FP8 precision training |
|
- Fully Sharded Data Parallel (FSDP) |
|
- Gradient checkpointing |
|
- Mixed precision training |
|
- CPU offloading capabilities |
|
|
|
## Intended Use |
|
|
|
This model is designed for: |
|
- Research in language model capabilities |
|
- Development of reasoning-enhanced applications |
|
- Exploration of memory-augmented architectures |
|
|
|
### Primary Intended Uses |
|
|
|
1. **Research and Development** |
|
- Study of neural memory mechanisms |
|
- Investigation of reasoning capabilities |
|
- Architecture optimization research |
|
|
|
2. **Educational Applications** |
|
- Mathematical problem solving |
|
- Complex reasoning tasks |
|
- Knowledge retrieval and application |
|
|
|
### Out-of-Scope Uses |
|
|
|
- Production deployment (currently in research phase) |
|
- Safety-critical applications |
|
- Real-time applications requiring low latency |
|
|
|
## Evaluation Results |
|
|
|
Currently in training and evaluation phase. Initial metrics will be published after completion of training. |
|
|
|
## Limitations |
|
|
|
1. **Current Development Status** |
|
- Training in progress |
|
- Performance metrics are preliminary |
|
- Features under active development |
|
|
|
2. **Technical Limitations** |
|
- High computational requirements |
|
- Large memory footprint |
|
- Complex deployment needs |
|
|
|
3. **Capability Limitations** |
|
- Reasoning capabilities still being optimized |
|
- Memory mechanisms under refinement |
|
- Limited multilingual support |
|
|
|
## Bias and Ethics |
|
|
|
- Model is currently in research phase |
|
- Full bias evaluation pending |
|
- Not recommended for production use |
|
- Safety measures being implemented |
|
|
|
## Environmental Impact |
|
|
|
Working to minimize environmental impact through: |
|
- Efficient training procedures |
|
- Optimized architecture |
|
- Resource-aware deployment options |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@software{vishwamai2024, |
|
author = {Kasinadhsarma}, |
|
title = {VishwamAI: Enhanced Transformer with Advanced Reasoning Capabilities}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
url = {https://github.com/VishwamAI/VishwamAI} |
|
} |
|
``` |
|
|
|
## Example Usage |
|
|
|
```python |
|
from vishwamai.model_utils import load_model |
|
|
|
# Load model |
|
model = load_model("vishwamai/model", device="cuda") |
|
|
|
# Generate output |
|
input_ids = tokenizer.encode("Solve this problem step by step:", return_tensors="pt") |
|
output = model(input_ids) |
|
``` |
|
|
|
## Additional Information |
|
|
|
- **Repository**: [GitHub Repository](https://github.com/VishwamAI/VishwamAI) |
|
- **Issues**: [GitHub Issues](https://github.com/VishwamAI/VishwamAI/issues) |
|
- **Documentation**: under construction mode owe are devloping it |
|
## Acknowledgments |
|
|
|
This project builds upon several research papers and open-source projects. We thank the authors and contributors of: |
|
- Transformer architectures |
|
- Mixture of Experts implementations |
|
- Tree of Thoughts reasoning |
|
- Neural memory architectures |