Just now, NVIDIA CUDA has received the largest update in history!

Wallstreetcn
2025.12.06 06:40
portai
I'm PortAI, I can summarize articles.

NVIDIA released CUDA Toolkit 13.1, calling it the biggest update in 20 years. The update includes the CUDA Tile programming model, Runtime API exposure of the Green Context, cuBLAS emulation capabilities, and a rewritten CUDA programming guide. CUDA Tile allows developers to write algorithms at a higher level, abstracting specialized hardware details

A few hours ago, NVIDIA CUDA Toolkit 13.1 was officially released, and NVIDIA stated: "This is the biggest update in 20 years."

This is the largest and most comprehensive update since the birth of the CUDA platform in 2006, which includes:

  • The release of NVIDIA CUDA Tile, a tile-based programming model developed by NVIDIA that can be used to abstract specialized hardware, including tensor cores.
  • Runtime API exposure of green contexts (referring to exposing the so-called Green Context, which means lightweight, concurrently schedulable contexts or execution environments, to external callers).
  • Double precision and single precision simulation in NVIDIA cuBLAS.
  • A completely rewritten CUDA programming guide designed for both CUDA beginners and advanced programmers.

Now let's take a closer look.

CUDA Tile

CUDA Tile is the core update of NVIDIA CUDA Toolkit 13.1. It is a tile-based programming model that allows algorithms to be written at a higher level and abstracts the details of specialized hardware (such as tensor cores).

NVIDIA's blog explains that CUDA Tile allows developers to write GPU kernel functions at a level above SIMT (Single Instruction, Multiple Threads).

In current SIMT programming, developers typically specify kernel functions by partitioning data and defining the execution path for each thread.

With CUDA Tile, developers can elevate the abstraction level of their code by directly specifying data blocks called "Tiles." They only need to specify the mathematical operations to be performed on these Tiles, and the compiler and runtime environment will automatically determine the best way to distribute the workload across threads.

This Tile model abstracts the low-level details of calling specialized hardware like Tensor Core, and Tile code will be compatible with future GPU architectures.

CUDA 13.1 includes two components for Tile programming:

  • CUDA Tile IR: A new virtual instruction set architecture (ISA) for programming NVIDIA GPUs.
  • cuTile Python: A new domain-specific language (DSL) for writing array and Tile-based kernel functions in Python

The compiled Tile path can integrate into the complete software stack, corresponding to the SIMT path.

This is the first version of the software, which includes the following notes:

  • CUDA Tile only supports NVIDIA Blackwell (compute capability 10.x and 12.x) series products. Future CUDA versions will expand support for more architectures.
  • The current development focus is on Tile programming for AI algorithms. NVIDIA stated that it will continue to add more features, functionalities, and performance improvements in future CUDA versions.
  • NVIDIA plans to introduce a C++ implementation in the upcoming CUDA version.

Why introduce Tile programming for GPUs?

CUDA provides developers with a Single Instruction Multiple Threads (SIMT) hardware and programming model. This model requires (and also allows) developers to have fine-grained control over how code is executed with maximum flexibility and targeting. However, writing high-performance code often requires significant effort, especially when adapting to multiple GPU architectures.

Although many libraries (such as NVIDIA CUDA-X and NVIDIA CUTLASS) aim to help developers unlock performance, CUDA Tile introduces a new type of GPU programming that is at a higher level than the SIMT hierarchy.

As computational workloads evolve, especially in the AI field, tensors have become a fundamental data type. NVIDIA has developed hardware specifically for processing tensors, such as NVIDIA Tensor Core (TC) and NVIDIA Tensor Memory Accelerator (TMA), which have now become indispensable components of every new GPU architecture.

The more complex the hardware, the more software is needed to help harness these capabilities. CUDA Tile abstracts Tensor Core and its programming model, allowing code written with CUDA Tile to be compatible with current and future Tensor Core architectures.

The Tile-based programming approach allows developers to write algorithms by specifying data blocks (i.e., Tiles) and then defining the computations to be performed on these Tiles. Developers do not need to set execution details of the algorithm at the element level: the compiler and runtime will handle this work.

The following diagram illustrates the conceptual differences between the Tile model introduced with CUDA Tile and the CUDA SIMT model.

The Tile model (left) divides data into multiple blocks, and the compiler maps them to threads. The Single Instruction Multiple Threads (SIMT) model (right) maps data simultaneously to blocks and threads.

This programming paradigm is common in languages like Python, where libraries like NumPy allow developers to specify data types such as matrices and then perform batch operations with simple code.

CUDA Software Updates

Here are other important software improvements included in this CUDA version update:

Support for Green Context in the runtime

Green Context in CUDA is a lightweight form of context that serves as an alternative to traditional CUDA contexts, providing developers with finer-grained GPU space partitioning and resource allocation capabilities.

Since CUDA 12.4, they have been available in the driver API; starting from this version, Green Context is also officially available in the runtime API.

Green Context allows users to define and manage independent partitions of GPU resources, primarily Streaming Multiprocessors (SM). You can allocate a specific number of SMs to a particular Green Context and then launch CUDA kernels and manage streams that only run within the resources owned by that context.

A typical application scenario is: if part of your code is extremely sensitive to latency and needs to execute ahead of all other GPU work, you can ensure that there are always available SMs for high-priority computations by creating a separate Green Context for that code and allocating SM resources to it while assigning the remaining SMs to another Green Context for handling other tasks.

CUDA 13.1 also introduces a more customizable split() API. Developers can use this interface to build SM partitions that previously required multiple API calls to complete, and they can configure work queues to reduce false dependencies that arise when submitting tasks between different Green Contexts.

For more information about these features and Green Context, please refer to the CUDA Programming Guide.

  • CUDA Programming Guide URL: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/green-contexts.html

CUDA Multi-Process Service (MPS) Updates

CUDA 13.1 brings several new features and functionalities for multi-process services. For complete information on these new features, please refer to the MPS documentation. Here are some highlights:

Memory Locality Optimization Partition

Memory locality optimization partition (MLOPart) is a feature available on NVIDIA Blackwell series (compute capability 10.0 and 10.3, as architecture version numbers) and newer GPUs.

This feature allows users to create CUDA devices specifically optimized for memory locality. MLOPart devices are derived from the same physical GPU but present as multiple independent devices, each with fewer computing resources and less available memory.

On GPUs with compute capability 10.0 and 10.3, each GPU contains two partitions.

When MLOPart is enabled on the GPU, each partition appears as an independent CUDA device, with its corresponding compute and memory resources.

Currently, MLOPart only supports NVIDIA B200 and NVIDIA B300 series products. Future CUDA releases will add support for NVIDIA GB200 and NVIDIA GB300 series.

Static Streaming Multiprocessor (SM) Partition

As an alternative to the existing dynamic provisioning in MPS, static streaming multiprocessor (SM) partition is a feature for NVIDIA Ampere architecture (compute capability 8.0) and newer GPUs, providing MPS clients with a method to create exclusive SM partitions.

This mode is enabled by starting the MPS control daemon with the -S or --static-partitioning flag, primarily aimed at providing deterministic resource allocation and improving isolation between MPS clients. The basic unit of partitioning is a "Chunk," the size of which varies based on the GPU architecture—for example, on independent GPUs with Hopper (compute capability 9.0) and newer, a Chunk contains 8 SMs.

Double Precision and Single Precision Emulation in cuBLAS

Although this is not strictly an update of CUDA 13.1, the cuBLAS update in NVIDIA CUDA Toolkit 13.0 introduced new APIs and implementations aimed at enhancing the performance of double precision (FP64) matrix multiplication (matmul).

This is achieved through floating-point (FP) emulation on Tensor Cores of GPU architectures such as NVIDIA GB200 NVL72 and NVIDIA RTX PRO 6000 Blackwell Server Edition

Developer Tools

Developer tools are an important part of the CUDA platform. This release brings several innovations and feature enhancements, including:

CUDA Tile Kernel Performance Analysis Tool

  • A new "Result Type" column has been added to the summary page to distinguish between Tile kernels and SIMT kernels.
  • A new "Tile Statistics" section has been added to the details page, summarizing the utilization of Tile dimensions and important pipelines.
  • The source code page supports mapping metrics to high-level cuTile kernel source code.

Nsight Compute analysis highlights the Tile Statistics section in the analysis output.

The released Nsight Compute also adds support for analyzing CUDA graph nodes in device-launched graphs and improves navigation on the source code page, providing clickable links for compiler-generated and user-generated labels.

Compile-time Patching

NVIDIA Compute Sanitizer 2025.4 adds support for compile-time patching with the NVIDIA CUDA Compiler (NVCC) through the -fdevice-sanitize=memcheck compiler flag. This patching enhances memory error detection capabilities and improves the performance of Compute Sanitizer.

Compile-time instrumentation can directly integrate error detection into NVCC, resulting in faster execution and capturing more subtle memory issues (such as illegal access between adjacent allocations) through advanced base-and-bounds analysis. This means developers can debug memory issues without sacrificing speed, run more tests, and maintain productivity. Currently, this feature only supports the memcheck tool.

To use this new feature, compile your code with the following NVCC flag:

nvcc -fdevice-sanitize=memcheck -o myapp myapp.cu

Then run your application using the memcheck tool:

compute-sanitizer --tool memcheck myapp

NVIDIA Nsight Systems

NVIDIA Nsight Systems 2025.6.1 is released in sync with CUDA Toolkit 13.1, bringing several new tracing features:

  • System-level CUDA tracing: --cuda-trace-scope can enable tracing across process trees or the entire system.
  • CUDA host function tracing: Added support for tracing CUDA Graph host function nodes and cudaLaunchHostFunc(), which execute on the host and block streams.
  • CUDA hardware tracing: Hardware-based tracing is now the default mode when supported; use --trace=cuda-sw to revert to software mode.
  • The Green Context timeline line will now display SM allocation in the tooltip, helping users understand GPU resource utilization.

Math Library

New features in the core CUDA toolkit math library include:

  • NVIDIA cuBLAS: A brand new experimental API that supports grouped GEMM functionality for Blackwell GPUs and is compatible with FP8 and BF16/FP16 data types. For the aforementioned data types, support for grouped GEMM in CUDA graphs provides a way to implement without host synchronization, achieving up to 4 times acceleration on the device side shape, outperforming multi-stream GEMM implementations in MoE use cases.
  • NVIDIA cuSPARSE: A new sparse matrix-vector multiplication (SpMVOp) API that offers performance improvements over the CsrMV API. This API supports CSR format, 32-bit indexing, double precision, and user-defined suffixes.
  • NVIDIA cuFFT: A brand new API called cuFFT device API that provides host functions for querying or generating device function code and database metadata in C++ header files. This API is designed for the cuFFTDx library and can generate cuFFTDx code blocks by querying cuFFT, which can be linked with cuFFTDx applications to enhance performance.

Performance updates have been released for the new Blackwell architecture. Users can choose key APIs for updates and view performance update details.

cuBLAS Blackwell Performance

CUDA Toolkit 12.9 introduces block scaling FP4 and FP8 matrix multiplication on the NVIDIA Blackwell platform. CUDA 13.1 adds performance support for these data types and BF16. Figure 2 shows the acceleration ratios on NVIDIA Blackwell and Hopper platforms.

cuSOLVER Blackwell Performance

CUDA 13.1 continues to optimize the batch SYEVD and GEEV APIs for eigenvalue decomposition, bringing significant performance enhancements.

Among them, batch SYEV (cusolverDnXsyevBatched) is a unified batch version of the SYEV routine in cuSOLVER, used for computing eigenvalues and eigenvectors of symmetric/Hermitian matrices, making it very suitable for scenarios requiring parallel solving of a large number of small matrices.

Figure 3 shows the test results with a batch size of 5,000 (matrix row sizes from 24 to 256). Compared to the NVIDIA L40S, the NVIDIA Blackwell RTX Pro 6000 Server Edition achieved approximately 2 times the speedup, which aligns with the expected memory bandwidth improvement.

For both single-precision complex and single-precision real matrices, when the number of rows N = 5, the speedup ratio is about 1.5×, gradually increasing as the number of rows increases, reaching 2.0× at N = 250.

Figure 4 shows the performance speedup ratio of cusolverDnXgeev (GEEV), which is used to compute eigenvalues and eigenvectors of general (non-symmetric) dense matrices. GEEV is a hybrid CPU/GPU algorithm. A single CPU thread is responsible for performing efficient early reduction in the QR algorithm, while the GPU handles the rest. The figure displays the relative performance speedup ratio for matrix sizes ranging from 1,024 to 32,768.

When the matrix row size n = 5000, the speedup ratio is about 1.0, gradually increasing as the matrix size grows, reaching approximately 1.7 at n = 30000.

NVIDIA CUDA Core Computing Library

The NVIDIA CUDA Core Computing Library (CCCL) brings multiple innovations and enhancements to CUB.

Deterministic Floating-Point Operation Simplification

Due to the non-associative nature of floating-point addition, cub::DeviceReduce historically could only guarantee bitwise identical results on the same GPU for each run. This was implemented as a two-pass algorithm.

As part of CUDA 13.1, NVIDIA CCCL 3.1 provides two additional floating-point determinism options, allowing you to trade off between determinism and performance

  • No guarantee: Single reduction using atomic operations. This does not guarantee identical results at the bit level.
  • Between GPUs: Based on reproducible dimensionality reduction results presented by Kate Clark at the NVIDIA GTC 2024 conference. The results are always identical at the bit level.

Deterministic options can be set through flag settings, as shown in the code below.

More convenient single-phase CUB API

Almost all CUB algorithms require temporary storage space as intermediate buffer space. In the past, users had to query and allocate the necessary temporary storage space through a two-phase calling pattern, which was cumbersome and error-prone if the parameters passed between the two calls were inconsistent.

CCCL 3.1 adds new overloads for some CUB algorithms that accept memory resources, allowing users to skip the temporary storage query/allocation/release pattern.

Risk warning and disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at their own risk