Transpiler¶
pyptx is a real PTX-to-Python transpiler. Feed it compiled PTX from
nvcc, Triton, Pallas, or any other source; get back Python that uses
the same @kernel / reg / smem / ptx namespaces you would write
by hand, and round-trips byte-identical.
That matters for three workflows:
- port an existing PTX kernel into pyptx
- study how a production kernel maps onto the DSL
- iterate on a kernel with a safe PTX baseline you can diff against
The maintained 815 TFLOPS Hopper GEMM in examples/hopper/gemm_highperf_hopper.py
is exactly this workflow, applied to fast.cu's kernel12:
compile → extract PTX → transpile with --sugar → hand-clean → ship.
The pipeline¶
┌───────┐ ┌─────────┐ ┌───────┐ ┌──────┐
PTX text ─▶│ parse │─▶│ codegen │─▶│ trace │─▶│ emit │─▶ PTX text
└───────┘ └─────────┘ └───────┘ └──────┘
| Stage | Input | Output | Source |
|---|---|---|---|
| parse | PTX text | ir.Module |
pyptx/parser/ (opcode-agnostic recursive descent) |
| sugar (optional) | ir.Module |
rewritten ir.Module |
pyptx/codegen/sugar.py — name demangling, loop raising, mbarrier spin-loop collapse, expression grouping |
| codegen | ir.Module |
Python (pyptx DSL) | pyptx/codegen/codegen.py |
| trace | Python | ir.Module |
pyptx/_trace.py (see How It Works) |
| emit | ir.Module |
PTX text | pyptx/emitter/ |
The round-trip invariant: parse → emit gives back the input bytes
exactly, verified on 218+ real kernels in tests/corpus/.
Quick start¶
python -m pyptx.codegen kernel.ptx --sugar --name my_kernel > my_kernel.py
python my_kernel.py # runs
Or from Python:
from pathlib import Path
from pyptx.codegen import ptx_to_python
source = Path("kernel.ptx").read_text()
python_source = ptx_to_python(source, sugar=True, kernel_name="my_kernel")
print(python_source)
A real before/after¶
Here's a fragment from fast.cu's kernel12, the WGMMA K-loop body. This
is what comes out of nvcc -arch=sm_90a -ptx fast.cu (one of four
sub-k WGMMA calls):
$L__BB13_2:
mov.u64 %rd283, 0;
shl.b32 %r421, %r420, 13;
add.s32 %r422, %r421, %r419;
shl.b32 %r423, %r422, 1;
mov.b64 %rd284, %_ZN3M124smemE;
add.s64 %rd285, %rd284, %r423;
add.s64 %rd286, %rd285, -8192;
and.b64 %rd287, %rd286, 262016;
shr.u64 %rd288, %rd287, 4;
or.b64 %rd289, %rd288, 4611686293305344000;
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
{%f0, %f1, %f2, ..., %f127},
%rd289, %rd300, %p21, 1, 1, 0, 0;
...
with ptx.loop("consumer_pair_loop"):
stage, phase = consumer_pipe.advance()
full.at(stage).wait(phase)
ptx.inst.wgmma.fence.sync.aligned()
a_base = smem_base + (((stage << 13) + lane_mask) << 1)
a_desc = ptx.wgmma.masked_descriptor(
a_base, byte_offset=-8192, mask=262016,
)
for sub_k in range(4):
b_desc = ptx.wgmma.masked_descriptor(
b_base,
byte_offset=b_offsets[sub_k],
mask=b_masks[sub_k],
)
ptx.inst.wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16(
acc_frag, a_desc, b_desc,
0 if (first_scale_zero and sub_k == 0) else 1,
1, 1, 0, 0,
)
ptx.inst.wgmma.commit_group.sync.aligned()
ptx.inst.wgmma.wait_group.sync.aligned(0)
The sugar pass did real work here:
_ZN3M124smemE→smem(demangled).- The spin-loop over
mbarrier.try_wait(not shown) collapsed intofull.at(stage).wait(phase). - The four near-identical WGMMA blocks collapsed into a
range(4)loop with ab_offsetstable. - The address arithmetic for the A descriptor stayed inline — because reordering it past the WGMMA fence would be a correctness bug, and the sugar pass refuses to move instructions across barriers.
The result is byte-identical PTX when re-emitted. The transpiled-and-sugared kernel is 665 lines of pyptx Python (down from ~1070 raw). Same PTX, same 815 TFLOPS.
What sugar does¶
The rewrite passes, in order:
| Pass | Effect |
|---|---|
| Name demangling | _ZN3M124smemE → smem; mangled params → M, N, K, tma_A |
| Label shortening | $L__BB13_2 → BB2 |
| Loop raising | label + ... + bra(label) → with ptx.loop(label, pred=p): |
| Mbarrier collapse | spin-loop block → ptx.mbarrier.wait(addr, phase) |
| WGMMA compact | 128 register operands → [f[i] for i in range(lo, hi)] |
| Parameterized loops | repeated instruction blocks → for _i in range(N) with base + _i*stride |
| Expression grouping | temp chains → with ptx.expr(): blocks with # rd[26] = ... comments |
None of the passes reorder instructions across barriers, WGMMA fences, or mbarrier arrives. PTX instruction order matters for memory consistency on Hopper, and the transpiler refuses to touch it.
Parse / emit without codegen¶
If you want to work one layer below the Python DSL:
from pathlib import Path
from pyptx.parser import parse
from pyptx.emitter import emit
source = Path("kernel.ptx").read_text()
module = parse(source)
assert emit(module) == source # round-trip invariant
This is the path for tooling, normalization passes, or structural tests that don't need to generate Python.
Round-trip testing¶
The transpiler is backed by real round-trip tests:
tests/test_codegen.py— codegen → trace → emittests/test_pipeline.py— single-file pipelinetests/test_pipeline_full.py— multi-file pipelinetests/test_external_corpus.py— 218+ corpus files
All of these assert byte-identical or structural equivalence across the round-trip. If you break one of them, the PR doesn't land.
The usual workflow¶
# 1. Get the PTX out of whatever produced it.
cuobjdump -ptx kernel.cubin > kernel.ptx
# or: nvcc -arch=sm_90a -ptx kernel.cu
# 2. Transpile to pyptx.
python -m pyptx.codegen kernel.ptx --sugar --name my_kernel > my_kernel.py
# 3. Run it. It should work out of the box.
python my_kernel.py
# 4. Hand-clean the pieces you want to own. Keep diffing emitted PTX.
diff <(python -c "from my_kernel import k; print(k.ptx())") kernel.ptx
Step 4 is where the value is. The transpile gives you a correct, editable, emitting-identical-PTX baseline; refactoring from there is much safer than starting from C++ templates or PTX text.