Skip to content

Performance

Hopper numbers are from an H100 SXM5 (80 GB HBM3, ~3 TB/s peak bandwidth, ~989 TFLOPS bf16 tensor core peak). Blackwell numbers are from a B200 (~2250 TFLOPS bf16 tensor core peak, dense). Each kernel is measured with CUDA events amortized over 20–200 iterations after a short warmup.

Reference timings use PyTorch 2.8 / CUDA 12.8 eager or fused kernels, whichever is the strongest publicly available comparison point.

Reproduce Hopper benchmarks with:

python benchmarks/bench_final.py
python benchmarks/bench_hopper_gemm.py --size 8192 --iters 16

Reproduce Blackwell benchmarks with:

python benchmarks/bench_blackwell_gemm.py
python benchmarks/bench_blackwell_kernels.py
python examples/blackwell/tcgen05_suite.py

Compute kernels

Blackwell GEMM (tcgen05, bf16)

Warp-specialized 4-stage pipeline: warp 0 lane 0 issues TMA 2D loads for A and B into a ring buffer; warp 1 lane 0 waits on load barriers and dispatches tcgen05.mma.cta_group::1.kind::f16 (m128n256k16) with a single-commit-per-CTA-tile handshake. Grid is N-major so consecutive CTAs walk N (unit-stride) for L2 reuse. TMEM accumulator is read back in two .32x32b.x128 loads per warp. See examples/blackwell/gemm_highperf_blackwell.py.

Shape (M × N × K) 1SM TFLOPS 2SM TFLOPS cuBLAS TFLOPS best / cuBLAS
2048³ 645 649 1006 64%
4096³ 1194 1168 1532 78%
4096 × 8192 × 8192 1240 1268 1652 77%
8192 × 4096 × 8192 1315 1230 1648 80%
8192³ 1240 1046 1610 77%

The 2SM variant (build_gemm_2sm) uses tcgen05.mma.cta_group::2, a cluster-shared mbarrier hand-off (count=2, self + peer via arrive_remote), tcgen05.commit.multicast::cluster for per-slot consumed signaling, and a 5-stage pipeline. Each CTA holds BN/2 cols of B so the extra stages fit inside the dynamic SMEM budget. The 2SM path is useful on selected shapes, but the 1SM kernel remains the default maintained path because it is simpler and usually faster once K gets large. The current remaining gap to cuBLAS is largely in the cluster/store pipeline, not basic tensor-core bring-up.

Blackwell grouped GEMM (tcgen05, bf16 -> f32)

Grouped GEMM reuses the same Blackwell tcgen05.mma mainloop with a G-problem outer loop for MoE-style shapes.

G M N K pyptx TFLOPS vs torch
8 512 128 512 41.6
8 1024 128 1024 160.1
4 2048 256 2048 401.0 ~10.0×

The large-shape Torch and JAX paths are both correct on the maintained Blackwell grouped GEMM. See benchmarks/bench_blackwell_kernels.py for the current kernel suite.

Hopper GEMM (warp-specialized, bf16)

Hand-written m64n256k16 WGMMA kernel with a producer + 2 consumer warpgroups, 2-CTA clusters with TMA multicast, 3-stage SMEM pipeline, and a Hilbert tile schedule. Transpiled from fast.cu's kernel12 and sugared in examples/hopper/gemm_highperf_hopper.py.

Size pyptx TFLOPS vs cuBLAS
2048³ 0.087 ms 198
4096³ 0.176 ms 783 ~parity
6144³ 0.580 ms 802 +10%
8192³ 1.349 ms 815 +10%

Beats cuBLAS at matrix sizes ≥ 6144.

Grouped GEMM (bf16 → f32, MoE shapes)

Multi-k WGMMA with tile_k=64 (4 WGMMAs per K-loop iter). Used for MoE where every expert has equal capacity.

G M N K pyptx TFLOPS
8 512 64 512 17 µs 16
16 512 64 512 17 µs 32
8 1024 64 1024 17 µs 63
4 2048 64 2048 30 µs 72
8 2048 64 2048 41 µs 104

2.3× over the naive single-WGMMA-per-iter implementation.


Bandwidth-bound kernels

All three norm/activation kernels hit the launch-overhead floor of ~14 µs for small shapes (torch has a ~7 µs floor via direct cuLaunchKernel). At large shapes they saturate HBM3.

RMS norm (f32)

B N pyptx torch pyptx GB/s torch GB/s speedup
32 1024 14 µs 27 µs 19 10 1.9×
256 4096 14 µs 28 µs 602 298 2.0×
1024 8192 31 µs 104 µs 2176 644 3.4×
2048 8192 51 µs 199 µs 2647 675 3.9×

Layer norm (f32)

Compared against torch.nn.functional.layer_norm (not eager).

B N pyptx torch pyptx GB/s torch GB/s speedup
32 1024 14 µs 7 µs 19 37 0.5×
256 4096 15 µs 8 µs 574 1017 0.6×
1024 8192 30 µs 42 µs 2214 1602 1.4×
2048 8192 54 µs 82 µs 2480 1643 1.5×

Small shapes lose to F.layer_norm because of the pyptx Torch dispatch overhead (~14 µs vs torch's ~7 µs). Large shapes win because the kernel's memory pattern is closer to peak.

Fused SwiGLU (f32)

M F pyptx torch pyptx GB/s torch GB/s speedup
32 1024 14 µs 8 µs 29 48 0.6×
256 4096 14 µs 8 µs 887 1543 0.6×
1024 8192 38 µs 62 µs 2665 1633 1.6×
2048 8192 72 µs 117 µs 2795 1724 1.6×

2795 GB/s is 94% of H100 HBM3 peak. The kernel is bandwidth-bound.


Flash attention (bf16)

Q-tile parallel, BN=64 multi-k WGMMA for both Q@K^T and P@V, head_dim up to 64. Compared against naive PyTorch (f32 matmul + softmax) and torch.nn.functional.scaled_dot_product_attention (fused FA kernel).

M N HD pyptx naive SDPA vs naive vs SDPA
512 512 64 19 µs 43 µs 15 µs 2.2× 0.73×
2048 2048 64 52 µs 88 µs 15 µs 1.7× 0.29×
4096 4096 64 88 µs 262 µs 25 µs 3.0× 0.29×
8192 4096 64 88 µs 470 µs 40 µs 5.3× 0.46×

Beats the naive reference by 2-5×. Torch SDPA is faster — it's a production FA3 kernel with warp specialization and 3-stage pipelining. Closing the gap is ongoing work.


Launch overhead

Per-launch dispatch cost for a trivial kernel at shape=(4, 64):

Path Overhead
PyTorch eager, pyptx via C++ extension ~14 µs
PyTorch eager, pyptx via ctypes (no C++ ext) ~34 µs
PyTorch native (torch.add reference) ~7 µs
CUDA graph replay ~4 µs

Install ninja so the C++ torch extension builds — it drops launch overhead from 34 µs to 14 µs by avoiding per-call Python→ctypes marshalling:

pip install ninja

Inside a CUDA graph, the steady-state cost drops to 4 µs on repeated-shape calls ("turbo graph replay" mode).


Environment

  • H100 SXM5 80 GB, CUDA 12.8, NVIDIA driver 580.126.09
  • PyTorch 2.8, JAX 0.10, cuda-python 13.2
  • Compiled with ptxas at release optimization level
  • --pyptx-shim C++ extension built via ninja
  • Torch custom_op fast path (torch.library.custom_op)