Getting Started¶

The fastest way to understand pyptx is:

read a small kernel
inspect the emitted PTX
run it through one runtime path

Mental Model¶

A @kernel function executes in Python at trace time. During that trace:

reg.* allocates PTX registers
smem.* describes shared-memory allocations and addressing
ptx.* emits PTX instructions and structured control flow

The output of that trace is PTX, not Python bytecode and not an intermediate tensor IR.

Minimal Workflow¶

from pyptx import kernel, reg, ptx
from pyptx.types import f32, u32

@kernel(arch="sm_90a")
def tiny():
    tid = reg.from_(ptx.special.tid.x(), u32)
    x = tid + 1
    ptx.inst.mov.u32(tid, x)
    ptx.ret()

Then inspect the PTX:

print(tiny.ptx())

That PTX is the real artifact. The Python function is just the authoring surface.

A Slightly More Real Example¶

The pattern looks more interesting once the kernel is called from a framework runtime:

from examples.hopper.rms_norm import build_rms_norm
import torch

k = build_rms_norm(4, 64)

x = torch.randn(4, 64, device="cuda") * 0.3
w = torch.randn(64, device="cuda") * 0.1 + 1.0
out = k(x, w)

The same kernel object can also be used with torch.compile:

@torch.compile
def compiled(x, w):
    return k(x, w)

or JAX:

import jax
import jax.numpy as jnp

@jax.jit
def compiled(x, w):
    return k(x, w)

Mental Checklist¶

When reading any pyptx kernel, ask:

what are the tensor boundary specs?
what registers are loop-carried state?
what is static Python structure, and what is emitted PTX control flow?
what runtime path is launching the kernel?

Hopper vs Blackwell¶

pyptx targets two architectures:

Hopper (sm_90a) — WGMMA, TMA 2D/3D with multicast, mbarriers, cluster launch. Kernels live in examples/hopper/.
Blackwell (sm_100a) — tcgen05.mma / .ld, TMEM, SMEM and instruction descriptors, 2-SM cooperative MMA via cta_group::2, TMA multicast. Kernels live in examples/blackwell/.

Pick the target with arch="sm_90a" or arch="sm_100a" in the @kernel decorator. For a B200, start with examples/blackwell/tcgen05_suite.py — it exercises every Blackwell primitive (alloc / MMA / ld / commit / fence) in isolation — then examples/blackwell/gemm_highperf_blackwell.py for the 1+ PFLOP 1SM GEMM.

What To Read Next¶

First Kernel for the basic authoring pattern
JAX Runtime if you want to call kernels from jax.jit
Torch Runtime if you want PyTorch or torch.compile
Examples for real kernels