activation / README.md

TaehyunKim

Fix fused add rms norm (#4)

a1e5ca8 unverified 5 days ago

5.87 kB

	---
	tags:
	- kernel
	license: apache-2.0
	---

	# Activation

	Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs.

	- Currently implemented
	- [PolyNorm](https://arxiv.org/html/2411.03884v1)
	- [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html)
	- FusedAddRMSNorm

	A fused operator that combines residual addition (`x + residual`) with RMSNorm in a single kernel.
	- Instead of:

	```python
	y = x + residual
	hidden_state = rms_norm(y, weight, eps)
	out = y + some_op(hidden_state)
	```

	- Fused as:

	```python
	hidden_state, y = fused_add_rms_norm(x, residual, weight, eps)
	out = y + some_op(hidden_state)
	```

	- FusedMulPolyNorm

	A fused operator that combines PolyNorm with an element-wise multiplication by a Tensor.
	- Instead of:

	```python
	y = poly_norm(x, weight, bias, eps)
	out = y * a
	```

	- Fused as:

	```python
	out = fused_mul_poly_norm(x, a, weight, bias, eps)
	```

	## Usage

	```python
	import torch
	from kernels import get_kernel

	activation = get_kernel("motif-technologies/activation")

	torch.set_default_device("cuda")
	poly_norm = activation.layers.PolyNorm(eps=1e-6)
	x = torch.randn(10, 10)

	print(poly_norm(x))
	```

	## Performance
	- Test cases are from the Motif LLM
	- The results can be reproduced using the provided benchmarking tools.
	- For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md).
	- The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small.

	### RMSNorm

	#### H100 Results

	<details>
	<summary>Forward Performance</summary>

	![RMSNorm Forward Performance](./benchmarks/plots/h100/rms/plot_rms-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![RMSNorm Backward Performance](./benchmarks/plots/h100/rms/plot_rms-bwd-perf.png)

	</details>

	#### MI250 Results

	<details>
	<summary>Forward Performance</summary>

	![RMSNorm Forward Performance](./benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![RMSNorm Backward Performance](./benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png)

	</details>

	---

	### FusedAddRMSNorm

	> [!NOTE]
	> For fusion case performance, the non-fused baseline was implemented with our custom kernels.

	#### H100 Results

	<details>
	<summary>Forward Performance</summary>

	![FusedAddRMSNorm Forward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![FusedAddRMSNorm Backward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png)

	</details>

	#### MI250 Results

	<details>
	<summary>Forward Performance</summary>

	![FusedAddRMSNorm Forward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![FusedAddRMSNorm Backward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png)

	</details>

	---

	### PolyNorm

	#### H100 Results

	<details>
	<summary>Forward Performance</summary>

	![PolyNorm Forward Performance](./benchmarks/plots/h100/poly/plot_poly-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![PolyNorm Backward Performance](./benchmarks/plots/h100/poly/plot_poly-bwd-perf.png)

	</details>

	#### MI250 Results

	<details>
	<summary>Forward Performance</summary>

	![PolyNorm Forward Performance](./benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![PolyNorm Backward Performance](./benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png)

	</details>

	---

	### FusedMulPolyNorm

	> [!NOTE]
	> For fusion case performance, the non-fused baseline was implemented with our custom kernels.

	#### H100 Results

	<details>
	<summary>Forward Performance</summary>

	![FusedMulPolyNorm Forward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![FusedMulPolyNorm Backward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png)

	</details>

	#### MI250 Results

	<details>
	<summary>Forward Performance</summary>

	![FusedMulPolyNorm Forward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png)

	</details>

	<details>
	<summary>Backward Performance</summary>

	![FusedMulPolyNorm Backward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png)

	</details>

	## Pre-commit Hooks

	This project uses [pre-commit](https://pre-commit.com/) to automatically check and format code before commits.

	### Setup

	1. Install pre-commit:

	```bash
	pip install pre-commit
	```

	2. Install the git hooks:

	```bash
	pre-commit install
	```

	Once installed, the configured hooks will run automatically on each commit.

	### Included Hooks

	The following tools are run via pre-commit:

	- [yapf](https://github.com/google/yapf) – Python code formatter
	- [typos](https://github.com/crate-ci/typos) – Spell checker for common typos
	- [isort](https://github.com/PyCQA/isort) – Organizes and sorts Python imports
	- [clang-format](https://clang.llvm.org/docs/ClangFormat.html) – Formats C++/CUDA code (`--style=file`)
	- [pymarkdown](https://github.com/jackdewinter/pymarkdown) – Lints and auto-fixes Markdown files
	- [actionlint](https://github.com/rhysd/actionlint) – Validates GitHub Actions workflows

	### Usage

	- Run all checks on the entire codebase:

	```bash
	pre-commit run --all-files
	```

	- Run a specific hook (example: isort):

	```bash
	pre-commit run isort --all-files
	```