Usage

Installation

To use nvcc4jupyter, first install it using pip:

(venv) $ pip install nvcc4jupyter

Load the Extension

Now we need to load the IPython extension to be able to use its cell and line magic commands:

%load_ext nvcc4jupyter

Hello World

We will use the cuda cell magic command to run a simple hello world program.

%%cuda
#include <stdio.h>

__global__ void hello(){
    printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}

int main(){
    hello<<<2, 2>>>();
    cudaDeviceSynchronize();
}

Groups

Now we will demonstrate a more complex scenario that uses source file groups. If you want to split your code into multiple source files, either for code reuse or just to have an easier to read project, you want to use groups. A group of source files will be compiled together. Because of this, you can include headers from the same group and use the code defined in other “.cu” files. There is also a special group named “shared” whose files will be compiled together with all other groups, which is a great feature for error handling code as we’ll show now:

%%cuda_group_save --group shared --name "error_handling.h"
// error checking macro
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

Now we can use that error handling macro in this vector addition program but also in other programs that we define in other Jupyter cells:

%%cuda
#include <stdio.h>
#include "error_handling.h"

const int DSIZE = 4096;
const int block_size = 256;

// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < ds) {
        C[idx] = A[idx] + B[idx];
    }
}

int main(){
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // allocate space for vectors in host memory
    h_A = new float[DSIZE];
    h_B = new float[DSIZE];
    h_C = new float[DSIZE];

    // initialize vectors in host memory to random values (except for the
    // result vector whose values do not matter as they will be overwritten)
    for (int i = 0; i < DSIZE; i++) {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // allocate space for vectors in device memory
    cudaMalloc(&d_A, DSIZE*sizeof(float));
    cudaMalloc(&d_B, DSIZE*sizeof(float));
    cudaMalloc(&d_C, DSIZE*sizeof(float));
    cudaCheckErrors("cudaMalloc failure"); // error checking

    // copy vectors A and B from host to device:
    cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    cudaCheckErrors("cudaMemcpy H2D failure");

    // launch the vector adding kernel
    vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE);
    cudaCheckErrors("kernel launch failure");

    // wait for the kernel to finish execution
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel execution failure");

    cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
    cudaCheckErrors("cudaMemcpy D2H failure");

    printf("A[0] = %f\n", h_A[0]);
    printf("B[0] = %f\n", h_B[0]);
    printf("C[0] = %f\n", h_C[0]);
    return 0;
}

Above we use the cuda magic command which saves the code in the cell to an anonymous source file group, compiles, and executes that code. This only allows us to have one source file (besides the ones in the “shared” group). In order to have multiple source files we need to use the cuda_group_save and cuda_group_run magics.

First, we save the vector addition function to its own file:

%%cuda_group_save --name "vector_add.cu" --group "vector_add"
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < ds) {
        C[idx] = A[idx] + B[idx];
    }
}

Now we create a header file so the main cuda file knows the signature of “vadd”:

%%cuda_group_save --name "vector_add.h" --group "vector_add"
__global__ void vadd(const float *A, const float *B, float *C, int ds);

To tie it all together, we save the main cuda file, which includes our vector addition code:

%%cuda_group_save --name "main.cu" --group "vector_add"
#include <stdio.h>
#include "error_handling.h"
#include "vector_add.h"

const int DSIZE = 4096;
const int block_size = 256;

int main(){
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // allocate space for vectors in host memory
    h_A = new float[DSIZE];
    h_B = new float[DSIZE];
    h_C = new float[DSIZE];

    // initialize vectors in host memory to random values (except for the
    // result vector whose values do not matter as they will be overwritten)
    for (int i = 0; i < DSIZE; i++) {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // allocate space for vectors in device memory
    cudaMalloc(&d_A, DSIZE*sizeof(float));
    cudaMalloc(&d_B, DSIZE*sizeof(float));
    cudaMalloc(&d_C, DSIZE*sizeof(float));
    cudaCheckErrors("cudaMalloc failure"); // error checking

    // copy vectors A and B from host to device:
    cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    cudaCheckErrors("cudaMemcpy H2D failure");

    // launch the vector adding kernel
    vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE);
    cudaCheckErrors("kernel launch failure");

    // wait for the kernel to finish execution
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel execution failure");

    cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
    cudaCheckErrors("cudaMemcpy D2H failure");

    printf("A[0] = %f\n", h_A[0]);
    printf("B[0] = %f\n", h_B[0]);
    printf("C[0] = %f\n", h_C[0]);
    return 0;
}

Now we can compile all the source files in the group and execute the main function with the following command:

%cuda_group_run --group "vector_add"

Profiling

Another important feature of nvcc4jupyter is its integration with the NVIDIA Nsight Compute / NVIDIA Nsight Systems profilers, which you need to make sure are installed and the executables can be found in a directory in your PATH environment variable.

To profile using Nsight Compute with custom arguments:

%cuda_group_run --group "vector_add" --profile --profiler-args "--section SpeedOfLight"

Running the cell above will compile and execute the vector addition code in the “vector_add” group and profile it, keeping only the metrics from the “SpeedOfLight” section. The output will contain something similar to:

Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name               Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency          cycle/nsecond         4.65
SM Frequency            cycle/usecond       544.31
Elapsed Cycles                  cycle        2,145
Memory Throughput                   %         3.19
DRAM Throughput                     %         3.19
Duration                      usecond         3.94
L1/TEX Cache Throughput             %         6.67
L2 Cache Throughput                 %         1.98
SM Active Cycles                cycle       383.65
Compute (SM) Throughput             %         1.19
----------------------- ------------- ------------

To profile using Nsight Systems with custom arguments:

%cuda_group_run --group "vector_add" --profiler nsys --profile --profiler-args "profile --stats=true"

Running the cell above will compile and execute the vector addition code in the “vector_add” group and profile it with Nsight Systems. The output will contain multiple tables, one of which will look similar to this:

[5/8] Executing 'cuda_api_sum' stats report

Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name
--------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ----------------------
    77.3      200,844,276          1  200,844,276.0  200,844,276.0  200,844,276  200,844,276          0.0  cudaMalloc
    22.6       58,594,762          2   29,297,381.0   29,297,381.0   29,153,999   29,440,763    202,772.8  cudaMemcpy
     0.1          305,450          1      305,450.0      305,450.0      305,450      305,450          0.0  cudaLaunchKernel
     0.0            1,970          1        1,970.0        1,970.0        1,970        1,970          0.0  cuModuleGetLoadingMode

Compiler arguments

In the same way profiler arguments can be passed to the profiling tool, compiling arguments can be passed to nvcc:

%cuda_group_run --group "vector_add" --compiler-args "--optimize 3"

Running the cell above will compile and execute the vector addition code in the “vector_add” group. During compilation, nvcc receives the “--optimize” option which specifies the optimization level for host code.

Set default arguments

In the case where you execute multiple magic commands with the same compiler or profiler arguments you can avoid writing them every time by setting the default arguments:

from nvcc4jupyter import set_defaults
set_defaults(compiler_args="--optimize 3", profiler_args="--section SpeedOfLight")

The same effect can be achieved by running “set_defaults” once for each config due to the fact that the default value is not changed if an a value is not given to the “set_defaults” function.

from nvcc4jupyter import set_defaults
set_defaults(compiler_args="--optimize 3")
set_defaults(profiler_args="--section SpeedOfLight")

Now we can run the following cell without specifying the compiler and profiler arguments once again.

%cuda_group_run --group "vector_add" --profile