CUDA: the programming model for NVIDIA GPUs

CUDA as SIMT programming model for NVIDIA GPUs: grid, block, thread, warp, memory hierarchy, streaming multiprocessors. Toolkit (NVCC, PTX, SASS), libraries (cuBLAS, cuDNN, NCCL, TensorRT). Evolution from CUDA 1.0 (2007) to CUDA 12.0 (8 December 2022).

R&DAIHardware CUDANVIDIAGPUSIMTPTXcuDNNcuBLASNCCLTensorRTHardwareAI

Origin: Tesla and CUDA 1.0 (2007)

CUDA (Compute Unified Device Architecture) is the programming model and runtime introduced by NVIDIA with the Tesla architecture (G80 GPU, GeForce 8800 GTX, November 2006) and made publicly available as CUDA 1.0 in June 2007. The stated goal was to expose the GPU as a general-purpose parallel compute device, with a C-derived language and explicit parallel semantics.

The execution model is SIMT (Single Instruction, Multiple Threads): scalar instructions replicated across groups of threads (hardware warps of 32 threads), with divergence handled at warp level.

Grid, block, thread

A CUDA program is composed of kernels, __global__ functions executed on the device (GPU). Each kernel launch specifies:

  • a grid of blocks (1D/2D/3D)
  • for each block, a number of threads (up to 1024)

Blocks and threads expose indices (blockIdx, threadIdx) used to map computation onto data. The hardware assigns whole blocks to a Streaming Multiprocessor (SM) — a unit with registers, shared memory and warp scheduler. Blocks do not communicate except through global memory, while threads within a block share on-chip shared memory.

__global__ void saxpy(int n, float a, const float* x, float* y) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) y[i] = a * x[i] + y[i];
}

Memory hierarchy

  • Registers — per-thread, fastest
  • Shared memory — per-block, ~100 KB per SM, ~TB/s bandwidth
  • L1/L2 cache — hardware-managed
  • Global memory — HBM/GDDR, hundreds of GB/s to TB/s, high latency
  • Constant and texture memory — caching-optimised read paths

CUDA tuning is largely about mapping data to these tiers and ensuring coalesced accesses to global memory.

Toolkit and libraries

The CUDA Toolkit includes:

  • NVCC — C/C++ compiler that splits host and device code, emits PTX (Parallel Thread Execution, intermediate virtual ISA) and SASS (the real ISA of the target GPU)
  • NVRTC — runtime compilation of CUDA sources
  • cuBLAS, cuSPARSE, cuFFT, cuSOLVER — linear algebra, FFT, sparse/dense solvers
  • cuDNN — primitives for neural networks (convolutions, attention, batchnorm)
  • NCCLcollective communication across multi-GPU/multi-node (all-reduce, broadcast)
  • TensorRT — inference optimiser and runtime (INT8/FP8 quantisation, fusion, kernel autotuning)
  • Nsight Systems/Compute — profiling

On top of the toolkit sit PyTorch, TensorFlow, JAX, RAPIDS (cuDF, cuML for data science), Triton (OpenAI, Python DSL for kernels), llama.cpp / vLLM / TensorRT-LLM for LLMs.

Compute capability and versions

Each NVIDIA GPU exposes a compute capability (e.g. 7.0 for Volta, 8.0 for Ampere A100, 9.0 for Hopper H100, 10.0 for Blackwell) that determines supported instructions, hardware limits and available features. PTX acts as an intermediate layer: a PTX binary compiled for one capability can be JIT-compiled by the driver for newer GPUs, preserving forward compatibility.

Main releases:

  • CUDA 1.0 (2007) — G80, first public release
  • CUDA 4.0 (2011) — unified virtual addressing, GPU peer-to-peer access
  • CUDA 6.0 (2014) — unified memory (single pointer across host/device)
  • CUDA 9.0 (2017) — Volta support, cooperative groups
  • CUDA 11.0 (2020) — Ampere support, TF32, MIG
  • CUDA 12.0 (8 December 2022) — Hopper support, lazy loading, minimum driver 525.60.13, partial C++20 support in device code, compute capability < 3.5 removed

CUDA as lock-in and the alternatives

CUDA is proprietary to NVIDIA: binaries compiled for NVIDIA GPUs do not run on AMD/Intel. Open alternatives (OpenCL, SYCL, AMD’s HIP/ROCm, Intel’s oneAPI) cover overlapping use cases, but as of 2024 the reference AI library stack (cuDNN, NCCL, TensorRT) remains CUDA-first. AMD’s HIP intentionally adopts near-identical syntax to CUDA to ease porting.

The noze context

noze builds on-premise AI stacks in healthcare and R&D: LLM training and inference (Llama, Mistral, BioMistral, Meditron), imaging pipelines (MONAI, nnU-Net), RAG systems. CUDA is the common layer across all of these stacks. Using Docker NGC and ready-made images (e.g. pytorch:24.x-py3, tritonserver:24.x) enables reproducible, version-pinned CUDA runtime deployments — a stable requirement for MDR pathways and EU AI Act audits.


References: CUDA introduced with the Tesla architecture (G80, GeForce 8800 GTX, 8 November 2006). CUDA 1.0 publicly available June 2007. CUDA 12.0 general release 8 December 2022 (NVIDIA release notes, developer.nvidia.com/cuda-toolkit-archive). Licence: proprietary NVIDIA (CUDA Toolkit EULA). Primary sources: CUDA C++ Programming Guide, PTX ISA documentation, GPU architecture whitepapers.

Need support? Under attack? Service Status
Need support? Under attack? Service Status