← Back to blog

The CUDA-to-AMD Table I Kept Open While Writing AMD Kernels

A cleaned-up reference I made for myself so I could stop context-switching between CUDA naming and ROCm/HIP naming every ten minutes.

While writing AMD kernels, I kept tripping over the same translations. Not because HIP is unreadable, but because years of CUDA vocabulary are sticky. Shared memory becomes LDS. Warps become wavefronts. Warp masks become 64-bit. Some things map cleanly, and some absolutely do not.

So I made myself a table. I did not need every row every day, but having the whole thing nearby was still useful. It let me keep writing instead of bouncing between docs, API comparison pages, compiler notes, and half-remembered forum answers.

This is the version I would actually put on a second monitor: the rows I reached for most, the ones that changed how I wrote code, and the ones where assuming a perfect one-to-one mapping would have been a mistake.

The missing layer on my first pass was the lower-level vocabulary: gfx90a, gfx942, v_mfma_*, VGPR, SGPR, EXEC, and s_waitcnt. Those are the words that start showing up as soon as the abstraction leaks.

What changed code fastest

warpSize is not safely assumed to be 32, and HIP warp masks are 64-bit. That alone is enough to quietly break ports.

What was not worth forcing

Some NVIDIA features simply do not have a clean HIP equivalent. The faster move is admitting that early and redesigning around it.

A quick disclaimer before the table

This is a portability reference, not a claim that every row is a perfect translation. Some entries are exact source-level matches, some are only conceptual analogies, and some are really just "this is the AMD thing you should go read next."

  • If you are porting warp code, check mask widths first.
  • If you are porting profiler or debugger workflows, expect tool bundles rather than one-for-one substitutions.
  • If you are porting Hopper-specific tricks, assume you need a different design until proven otherwise.

Toolchain and Libraries

The first layer was just naming. I wanted one place where the CUDA tool I knew lined up with the ROCm or HIP tool I needed.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
CUDAROCm (HIP)ROCm is the platform. HIP is the CUDA-like programming layer.Used a lot
nvcchipccMain compiler driver for HIP code.Used a lot
cuBLASrocBLASBLAS mapping. Useful whenever the kernel was really a library call.Used sometimes
cuDNNMIOpenDeep learning primitive library.Good to know
NCCLRCCLCollective communication for multi-GPU work.Good to know
Nsight Computerocprof / omniperfClosest profiling workflow for kernel-level performance work.Used a lot
cuda-gdbrocgdbGPU debugger.Used sometimes
compute-sanitizerNo single direct equivalentThink toolkit, not drop-in replacement. Use ROCm GPU AddressSanitizer, ROCR Debug Agent, ROCgdb, and rocprof depending on the bug.Good to know

Execution Model

This was the part I kept translating in my head while reading old CUDA code and writing the AMD version.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
Streaming Multiprocessor (SM)Compute Unit (CU)Closest execution-unit analogy.Used a lot
WarpWavefrontSame broad idea, different width assumptions.Used a lot
Warp size = 32Wavefront size = 64 on CDNA / 32 on RDNAThis was the biggest practical gotcha. Do not hardcode 32 in portable HIP code.Used a lot
Shared MemoryLDSLocal Data Share is the AMD term that shows up constantly in docs and ISA notes.Used a lot
Tensor CoreMatrix Core / MFMAImportant when you start thinking about matrix instructions instead of scalar ALU work.Used sometimes
Register FileVGPR + SGPR splitAMD makes the vector/scalar split much more explicit.Used sometimes
Warp divergenceWavefront divergenceSame performance issue, wider blast radius on wave64 hardware.Used a lot
NVLinkInfinity Fabric / xGMIUseful context, but not something I needed day to day for kernel authoring.Good to know

Architecture Targets and Binary Formats

The first pass of this post underplayed the terms that show up when you compile for real hardware, inspect binaries, or look at docs for a specific accelerator generation. These mattered more than I expected.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
-arch=sm_XX--offload-arch=gfxYYYZThis is where names like gfx90a, gfx942, and gfx1100 appear. If you are targeting a specific AMD GPU, you end up learning these quickly.Used a lot
sm_80 (Ampere)gfx908 / gfx90aRough generation-level mapping for MI100 and MI200-class hardware.Used sometimes
sm_90 / sm_90a (Hopper)gfx940 / gfx941 / gfx942The AMD target strings most people run into when they start touching MI300 systems.Used sometimes
PTXAMDGPU IR / LLVM IRThe intermediate representation story is different, but this is the closest mental analogue.Good to know
SASSGCN / CDNA / RDNA ISAThis is the layer you are effectively reading when you inspect AMD disassembly.Used sometimes
cubinCode object (.hsaco)The binary artifact you actually ship and load.Used sometimes
fatbinBundled code objectsSame broad idea: one package containing multiple architecture targets.Good to know
nvdisasm / cuobjdumpllvm-objdump / roc-obj-*Handy once you want to see what the compiler really emitted.Used sometimes

Programming Model

These are the translations that actually touched source code over and over: launches, memory movement, sync, streams, and device properties.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
<<<blocks, threads, shmem, stream>>>Same syntax or hipLaunchKernelGGL(...)HIP supports triple-chevron launches, but HIPIFY often emits hipLaunchKernelGGL.Used a lot
cudaMallochipMallocDevice allocation.Used a lot
cudaFreehipFreeDevice free.Used a lot
cudaMemcpyhipMemcpyBasic transfer API.Used a lot
cudaMemcpyAsynchipMemcpyAsyncAsync transfer API.Used a lot
cudaDeviceSynchronizehipDeviceSynchronizeThe global stop-and-wait button.Used a lot
cudaStreamCreatehipStreamCreateStream creation is almost identical.Used sometimes
cudaEventCreatehipEventCreateUseful for timing and dependency chains.Used sometimes
cudaGetDevicePropertieshipGetDevicePropertiesGood for checking limits and confirming architecture properties.Used sometimes
warpSizewarpSizeSame identifier, but the value is not guaranteed to be 32. Query it and write code accordingly.Used a lot

Matrix, Dot, and Packed Math Terms

Once you leave basic HIP syntax and start thinking about throughput, the vocabulary shifts toward matrix instructions, packed math, and architecture-specific tensor paths.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
Tensor CoreMFMA / Matrix CoreMFMA is the AMD term I kept seeing over and over in docs, kernels, and ISA notes.Used a lot
mma.sync.*v_mfma_*This is the big one. NVIDIA tensor instructions map conceptually to AMD MFMA families, not a cute 1:1 rename.Used a lot
wmma.*rocWMMAIf you want a higher-level matrix API on AMD, this is usually the thing to read next.Used sometimes
dp4av_dot4_i32_i8Useful if you end up thinking about int8 dot products instead of just GEMM abstractions.Good to know
FP8 tensor pathCDNA3 FP8 MFMAGood reminder that MI300-class parts have native FP8 support, but the programming surface still differs.Good to know
add.f16x2 / fma.f16x2v_pk_add_f16 / v_pk_fma_f16Packed half-precision math shows up once you start reading lower-level AMD ISA tables.Good to know
cp.asyncNo direct HIP builtinStill worth keeping nearby because it changes how you reason about ports from modern NVIDIA kernels.Used sometimes
wgmmaNo direct source-level equivalentClosest hardware cousin is MFMA, but not in a way that lets you port the code mechanically.Good to know

Warp and Synchronization Intrinsics

This was where a cheat sheet saved the most context switching. The names often look similar, but the mask types and assumptions are not identical.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
__syncthreads()__syncthreads()Block-level barrier. Easy one.Used a lot
__syncwarp(mask)__syncwarp(mask)Modern HIP supports it. Do not replace it with __syncthreads(); the scope is different.Used sometimes
__shfl_sync(mask, val, lane)__shfl_sync(mask, val, lane)Supported in modern HIP. The mask argument is 64-bit.Used a lot
__ballot_sync(mask, pred)__ballot_sync(mask, pred)Returns a 64-bit mask in HIP. That matters immediately when porting warp code.Used a lot
__activemask()__activemask()Also returns a 64-bit mask in HIP.Used sometimes
__match_any_sync(mask, val)__match_any_sync(mask, val)Available in modern HIP.Good to know
__reduce_add_sync(mask, val)__reduce_add_sync(mask, val)Available in modern HIP.Used sometimes
atomicAddatomicAddMostly familiar, but lower-precision atomics are architecture-dependent.Used sometimes

ISA Words That Show Up Once the Abstraction Leaks

These are not the first terms you learn, but they become unavoidable once you read compiler output, inspect disassembly, or chase a performance issue far enough down the stack.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
RegistersVGPR + SGPRThe vector/scalar split is one of the most AMD-specific concepts and it colors how you read almost everything else.Used a lot
Predicate registerVCC / EXECAMD divergence control is much more explicit. EXEC masks active lanes and VCC carries compare results.Used a lot
ld.globalglobal_load_dword / buffer_load_dwordThe first global-memory instruction names you start recognizing in AMD disassembly.Used sometimes
ld.shared / st.sharedds_read_* / ds_write_*Shared-memory traffic becomes ds_* when you get closer to the ISA.Used a lot
bar.syncs_barrierA good anchor point when reading lower-level synchronization behavior.Used sometimes
Implicit scoreboardings_waitcntThis is one of the most important AMD ISA words to recognize. It is everywhere once you inspect emitted code.Used a lot
atomic on shared memoryds_add_u32 / ds_cmpst_b32Another pattern that becomes readable once ds_* stops looking alien.Good to know
.cg / .cs cache hintsglc / slcGood to know when cache-control language leaks into docs or disassembly.Good to know

No Clean One-to-One

A lot of wasted time comes from assuming there must be a direct analogue. For these, the right answer was usually to stop forcing the mapping.

CUDA TermAMD / ROCm TermWhy It MatteredHow Often I Reached For It
libcudaNo 1:1 public drop-inDo not treat libhsa-runtime64 as libcuda.Good to know
cuTENSORhipTensor (work in progress)Useful directionally, not feature-for-feature parity.Good to know
wmma.*Use rocWMMANo core HIP builtin that behaves like the CUDA WMMA layer.Good to know
cp.asyncNo direct HIP builtinUse ordinary loads plus explicit synchronization or a library pipeline.Used sometimes
TMANo direct equivalentDo not expect a Hopper-style bulk tensor copy primitive.Good to know
CUDA Dynamic ParallelismNot supported in HIPDevice-launched kernels do not port cleanly.Good to know
-maxrregcountNo direct hipcc equivalentUsually better to reason with launch bounds and backend-level tuning instead.Good to know
compute-sanitizer racecheckNo direct equivalentAnother place where the right mental model is partial coverage, not parity.Good to know

What I actually needed most

In practice, most of my time was spent in the middle of this page rather than at the edges. I cared about launch syntax, memory APIs, wavefront width, LDS, shuffles, ballots, profiling, architecture targets like gfx90a, and matrix terms like MFMA. The deeper ISA correspondences were still useful to have around, but mostly as a backstop when something felt off or when I wanted to reason a bit closer to the hardware.

That is the real reason I wanted a table in the first place: not because I planned to memorize all of it, but because I wanted one stable map while my mental model was still switching from CUDA-shaped habits to AMD-shaped ones.

March 2026