--- tags: - kernel license: apache-2.0 --- # Activation Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs. - Currently implemented - [PolyNorm](https://arxiv.org/html/2411.03884v1) - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) - **FusedAddRMSNorm** A fused operator that combines **residual addition** (`x + residual`) with **RMSNorm** in a single kernel. - Instead of: ```python y = x + residual hidden_state = rms_norm(y, weight, eps) out = y + some_op(hidden_state) ``` - Fused as: ```python hidden_state, y = fused_add_rms_norm(x, residual, weight, eps) out = y + some_op(hidden_state) ``` - **FusedMulPolyNorm** A fused operator that combines **PolyNorm** with an **element-wise multiplication** by a Tensor. - Instead of: ```python y = poly_norm(x, weight, bias, eps) out = y * a ``` - Fused as: ```python out = fused_mul_poly_norm(x, a, weight, bias, eps) ``` ## Usage ```python import torch from kernels import get_kernel activation = get_kernel("motif-technologies/activation") torch.set_default_device("cuda") poly_norm = activation.layers.PolyNorm(eps=1e-6) x = torch.randn(10, 10) print(poly_norm(x)) ``` ## Performance - Test cases are from the Motif LLM - The results can be reproduced using the provided benchmarking tools. - For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md). - The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small. ### RMSNorm #### H100 Results
Forward Performance ![RMSNorm Forward Performance](./benchmarks/plots/h100/rms/plot_rms-fwd-perf.png)
Backward Performance ![RMSNorm Backward Performance](./benchmarks/plots/h100/rms/plot_rms-bwd-perf.png)
#### MI250 Results
Forward Performance ![RMSNorm Forward Performance](./benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png)
Backward Performance ![RMSNorm Backward Performance](./benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png)
--- ### FusedAddRMSNorm > [!NOTE] > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. #### H100 Results
Forward Performance ![FusedAddRMSNorm Forward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png)
Backward Performance ![FusedAddRMSNorm Backward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png)
#### MI250 Results
Forward Performance ![FusedAddRMSNorm Forward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png)
Backward Performance ![FusedAddRMSNorm Backward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png)
--- ### PolyNorm #### H100 Results
Forward Performance ![PolyNorm Forward Performance](./benchmarks/plots/h100/poly/plot_poly-fwd-perf.png)
Backward Performance ![PolyNorm Backward Performance](./benchmarks/plots/h100/poly/plot_poly-bwd-perf.png)
#### MI250 Results
Forward Performance ![PolyNorm Forward Performance](./benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png)
Backward Performance ![PolyNorm Backward Performance](./benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png)
--- ### FusedMulPolyNorm > [!NOTE] > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. #### H100 Results
Forward Performance ![FusedMulPolyNorm Forward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png)
Backward Performance ![FusedMulPolyNorm Backward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png)
#### MI250 Results
Forward Performance ![FusedMulPolyNorm Forward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png)
Backward Performance ![FusedMulPolyNorm Backward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png)
## Pre-commit Hooks This project uses [pre-commit](https://pre-commit.com/) to automatically check and format code before commits. ### Setup 1. Install pre-commit: ```bash pip install pre-commit ``` 2. Install the git hooks: ```bash pre-commit install ``` Once installed, the configured hooks will run automatically on each commit. ### Included Hooks The following tools are run via pre-commit: - **[yapf](https://github.com/google/yapf)** – Python code formatter - **[typos](https://github.com/crate-ci/typos)** – Spell checker for common typos - **[isort](https://github.com/PyCQA/isort)** – Organizes and sorts Python imports - **[clang-format](https://clang.llvm.org/docs/ClangFormat.html)** – Formats C++/CUDA code (`--style=file`) - **[pymarkdown](https://github.com/jackdewinter/pymarkdown)** – Lints and auto-fixes Markdown files - **[actionlint](https://github.com/rhysd/actionlint)** – Validates GitHub Actions workflows ### Usage - Run all checks on the entire codebase: ```bash pre-commit run --all-files ``` - Run a specific hook (example: isort): ```bash pre-commit run isort --all-files ```