The CUDA-to-AMD Table I Kept Open While Writing AMD Kernels

A cleaned-up reference I made for myself so I could stop context-switching between CUDA naming and ROCm/HIP naming every ten minutes.

While writing AMD kernels, I kept tripping over the same translations. Not because HIP is unreadable, but because years of CUDA vocabulary are sticky. Shared memory becomes LDS. Warps become wavefronts. Warp masks become 64-bit. Some things map cleanly, and some absolutely do not.

So I made myself a table. I did not need every row every day, but having the whole thing nearby was still useful. It let me keep writing instead of bouncing between docs, API comparison pages, compiler notes, and half-remembered forum answers.

This is the version I would actually put on a second monitor: the rows I reached for most, the ones that changed how I wrote code, and the ones where assuming a perfect one-to-one mapping would have been a mistake.

The missing layer on my first pass was the lower-level vocabulary: gfx90a, gfx942, v_mfma_*, VGPR, SGPR, EXEC, and s_waitcnt. Those are the words that start showing up as soon as the abstraction leaks.

What changed code fastest

warpSize is not safely assumed to be 32, and HIP warp masks are 64-bit. That alone is enough to quietly break ports.

What was not worth forcing

Some NVIDIA features simply do not have a clean HIP equivalent. The faster move is admitting that early and redesigning around it.

A quick disclaimer before the table

This is a portability reference, not a claim that every row is a perfect translation. Some entries are exact source-level matches, some are only conceptual analogies, and some are really just "this is the AMD thing you should go read next."

If you are porting warp code, check mask widths first.
If you are porting profiler or debugger workflows, expect tool bundles rather than one-for-one substitutions.
If you are porting Hopper-specific tricks, assume you need a different design until proven otherwise.

Toolchain and Libraries

The first layer was just naming. I wanted one place where the CUDA tool I knew lined up with the ROCm or HIP tool I needed.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`CUDA`	`ROCm (HIP)`	ROCm is the platform. HIP is the CUDA-like programming layer.	Used a lot
`nvcc`	`hipcc`	Main compiler driver for HIP code.	Used a lot
`cuBLAS`	`rocBLAS`	BLAS mapping. Useful whenever the kernel was really a library call.	Used sometimes
`cuDNN`	`MIOpen`	Deep learning primitive library.	Good to know
`NCCL`	`RCCL`	Collective communication for multi-GPU work.	Good to know
`Nsight Compute`	`rocprof / omniperf`	Closest profiling workflow for kernel-level performance work.	Used a lot
`cuda-gdb`	`rocgdb`	GPU debugger.	Used sometimes
`compute-sanitizer`	`No single direct equivalent`	Think toolkit, not drop-in replacement. Use ROCm GPU AddressSanitizer, ROCR Debug Agent, ROCgdb, and rocprof depending on the bug.	Good to know

Execution Model

This was the part I kept translating in my head while reading old CUDA code and writing the AMD version.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`Streaming Multiprocessor (SM)`	`Compute Unit (CU)`	Closest execution-unit analogy.	Used a lot
`Warp`	`Wavefront`	Same broad idea, different width assumptions.	Used a lot
`Warp size = 32`	`Wavefront size = 64 on CDNA / 32 on RDNA`	This was the biggest practical gotcha. Do not hardcode 32 in portable HIP code.	Used a lot
`Shared Memory`	`LDS`	Local Data Share is the AMD term that shows up constantly in docs and ISA notes.	Used a lot
`Tensor Core`	`Matrix Core / MFMA`	Important when you start thinking about matrix instructions instead of scalar ALU work.	Used sometimes
`Register File`	`VGPR + SGPR split`	AMD makes the vector/scalar split much more explicit.	Used sometimes
`Warp divergence`	`Wavefront divergence`	Same performance issue, wider blast radius on wave64 hardware.	Used a lot
`NVLink`	`Infinity Fabric / xGMI`	Useful context, but not something I needed day to day for kernel authoring.	Good to know

Architecture Targets and Binary Formats

The first pass of this post underplayed the terms that show up when you compile for real hardware, inspect binaries, or look at docs for a specific accelerator generation. These mattered more than I expected.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`-arch=sm_XX`	`--offload-arch=gfxYYYZ`	This is where names like gfx90a, gfx942, and gfx1100 appear. If you are targeting a specific AMD GPU, you end up learning these quickly.	Used a lot
`sm_80 (Ampere)`	`gfx908 / gfx90a`	Rough generation-level mapping for MI100 and MI200-class hardware.	Used sometimes
`sm_90 / sm_90a (Hopper)`	`gfx940 / gfx941 / gfx942`	The AMD target strings most people run into when they start touching MI300 systems.	Used sometimes
`PTX`	`AMDGPU IR / LLVM IR`	The intermediate representation story is different, but this is the closest mental analogue.	Good to know
`SASS`	`GCN / CDNA / RDNA ISA`	This is the layer you are effectively reading when you inspect AMD disassembly.	Used sometimes
`cubin`	`Code object (.hsaco)`	The binary artifact you actually ship and load.	Used sometimes
`fatbin`	`Bundled code objects`	Same broad idea: one package containing multiple architecture targets.	Good to know
`nvdisasm / cuobjdump`	`llvm-objdump / roc-obj-*`	Handy once you want to see what the compiler really emitted.	Used sometimes

Programming Model

These are the translations that actually touched source code over and over: launches, memory movement, sync, streams, and device properties.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`<<<blocks, threads, shmem, stream>>>`	`Same syntax or hipLaunchKernelGGL(...)`	HIP supports triple-chevron launches, but HIPIFY often emits hipLaunchKernelGGL.	Used a lot
`cudaMalloc`	`hipMalloc`	Device allocation.	Used a lot
`cudaFree`	`hipFree`	Device free.	Used a lot
`cudaMemcpy`	`hipMemcpy`	Basic transfer API.	Used a lot
`cudaMemcpyAsync`	`hipMemcpyAsync`	Async transfer API.	Used a lot
`cudaDeviceSynchronize`	`hipDeviceSynchronize`	The global stop-and-wait button.	Used a lot
`cudaStreamCreate`	`hipStreamCreate`	Stream creation is almost identical.	Used sometimes
`cudaEventCreate`	`hipEventCreate`	Useful for timing and dependency chains.	Used sometimes
`cudaGetDeviceProperties`	`hipGetDeviceProperties`	Good for checking limits and confirming architecture properties.	Used sometimes
`warpSize`	`warpSize`	Same identifier, but the value is not guaranteed to be 32. Query it and write code accordingly.	Used a lot

Matrix, Dot, and Packed Math Terms

Once you leave basic HIP syntax and start thinking about throughput, the vocabulary shifts toward matrix instructions, packed math, and architecture-specific tensor paths.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`Tensor Core`	`MFMA / Matrix Core`	MFMA is the AMD term I kept seeing over and over in docs, kernels, and ISA notes.	Used a lot
`mma.sync.*`	`v_mfma_*`	This is the big one. NVIDIA tensor instructions map conceptually to AMD MFMA families, not a cute 1:1 rename.	Used a lot
`wmma.*`	`rocWMMA`	If you want a higher-level matrix API on AMD, this is usually the thing to read next.	Used sometimes
`dp4a`	`v_dot4_i32_i8`	Useful if you end up thinking about int8 dot products instead of just GEMM abstractions.	Good to know
`FP8 tensor path`	`CDNA3 FP8 MFMA`	Good reminder that MI300-class parts have native FP8 support, but the programming surface still differs.	Good to know
`add.f16x2 / fma.f16x2`	`v_pk_add_f16 / v_pk_fma_f16`	Packed half-precision math shows up once you start reading lower-level AMD ISA tables.	Good to know
`cp.async`	`No direct HIP builtin`	Still worth keeping nearby because it changes how you reason about ports from modern NVIDIA kernels.	Used sometimes
`wgmma`	`No direct source-level equivalent`	Closest hardware cousin is MFMA, but not in a way that lets you port the code mechanically.	Good to know

Warp and Synchronization Intrinsics

This was where a cheat sheet saved the most context switching. The names often look similar, but the mask types and assumptions are not identical.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`__syncthreads()`	`__syncthreads()`	Block-level barrier. Easy one.	Used a lot
`__syncwarp(mask)`	`__syncwarp(mask)`	Modern HIP supports it. Do not replace it with __syncthreads(); the scope is different.	Used sometimes
`__shfl_sync(mask, val, lane)`	`__shfl_sync(mask, val, lane)`	Supported in modern HIP. The mask argument is 64-bit.	Used a lot
`__ballot_sync(mask, pred)`	`__ballot_sync(mask, pred)`	Returns a 64-bit mask in HIP. That matters immediately when porting warp code.	Used a lot
`__activemask()`	`__activemask()`	Also returns a 64-bit mask in HIP.	Used sometimes
`__match_any_sync(mask, val)`	`__match_any_sync(mask, val)`	Available in modern HIP.	Good to know
`__reduce_add_sync(mask, val)`	`__reduce_add_sync(mask, val)`	Available in modern HIP.	Used sometimes
`atomicAdd`	`atomicAdd`	Mostly familiar, but lower-precision atomics are architecture-dependent.	Used sometimes

ISA Words That Show Up Once the Abstraction Leaks

These are not the first terms you learn, but they become unavoidable once you read compiler output, inspect disassembly, or chase a performance issue far enough down the stack.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`Registers`	`VGPR + SGPR`	The vector/scalar split is one of the most AMD-specific concepts and it colors how you read almost everything else.	Used a lot
`Predicate register`	`VCC / EXEC`	AMD divergence control is much more explicit. EXEC masks active lanes and VCC carries compare results.	Used a lot
`ld.global`	`global_load_dword / buffer_load_dword`	The first global-memory instruction names you start recognizing in AMD disassembly.	Used sometimes
`ld.shared / st.shared`	`ds_read_* / ds_write_*`	Shared-memory traffic becomes ds_* when you get closer to the ISA.	Used a lot
`bar.sync`	`s_barrier`	A good anchor point when reading lower-level synchronization behavior.	Used sometimes
`Implicit scoreboarding`	`s_waitcnt`	This is one of the most important AMD ISA words to recognize. It is everywhere once you inspect emitted code.	Used a lot
`atomic on shared memory`	`ds_add_u32 / ds_cmpst_b32`	Another pattern that becomes readable once ds_* stops looking alien.	Good to know
`.cg / .cs cache hints`	`glc / slc`	Good to know when cache-control language leaks into docs or disassembly.	Good to know

No Clean One-to-One

A lot of wasted time comes from assuming there must be a direct analogue. For these, the right answer was usually to stop forcing the mapping.

CUDA Term	AMD / ROCm Term	Why It Mattered	How Often I Reached For It
`libcuda`	`No 1:1 public drop-in`	Do not treat libhsa-runtime64 as libcuda.	Good to know
`cuTENSOR`	`hipTensor (work in progress)`	Useful directionally, not feature-for-feature parity.	Good to know
`wmma.*`	`Use rocWMMA`	No core HIP builtin that behaves like the CUDA WMMA layer.	Good to know
`cp.async`	`No direct HIP builtin`	Use ordinary loads plus explicit synchronization or a library pipeline.	Used sometimes
`TMA`	`No direct equivalent`	Do not expect a Hopper-style bulk tensor copy primitive.	Good to know
`CUDA Dynamic Parallelism`	`Not supported in HIP`	Device-launched kernels do not port cleanly.	Good to know
`-maxrregcount`	`No direct hipcc equivalent`	Usually better to reason with launch bounds and backend-level tuning instead.	Good to know
`compute-sanitizer racecheck`	`No direct equivalent`	Another place where the right mental model is partial coverage, not parity.	Good to know

What I actually needed most

In practice, most of my time was spent in the middle of this page rather than at the edges. I cared about launch syntax, memory APIs, wavefront width, LDS, shuffles, ballots, profiling, architecture targets like gfx90a, and matrix terms like MFMA. The deeper ISA correspondences were still useful to have around, but mostly as a backstop when something felt off or when I wanted to reason a bit closer to the hardware.

That is the real reason I wanted a table in the first place: not because I planned to memorize all of it, but because I wanted one stable map while my mental model was still switching from CUDA-shaped habits to AMD-shaped ones.

Primary docs I cross-checked against

March 2026