Skip to content

Debugging Tips & Information

sean-dougherty edited this page Jan 31, 2015 · 17 revisions

This document provides information on how to debug your homework within cuda-edu. It currently assumes that you'll be using a command-line debugger. In the future, a similar document will be made for GUI debuggers.

Printing from Your Kernel

Sometimes it's useful to put a trace/print in your kernel, which can be done via such mechanisms as wbLog(), printf(), or cout. If you do put a trace in your kernel, keep in mind you'll want to remove it prior to submitting to the real WebGPU. If your trace code is left in your kernel, it will most likely cause a compiler error.

Also, if you're going to print from your kernel, it would be a good idea to use a single OS thread so that your output doesn't get garbled. This can be accomplished with the -s flag of the run script.

threadIdx & blockIdx

Due to limitations of some platforms, the threadIdx and blockIdx values have to be stored in local variables that cuda-edu injects into your __global__ and __device__ functions. The local variables don't receive their true values until you enter your function. So, if you were to set a breakpoint on your kernel and then use your debugger to look at the values of threadIdx/blockIdx, their contents would be garbage. Once you step to your first statement, their values should become valid.

To clarify, your first statement is going to see the correct values for threadIdx/blockIdx. You're just not going to see the correct values from your debugger until you are at your first statement.

Pointer Guards

When cuda-edu converts your Cuda source code into C++, it inserts code that protects your pointers with "guards". Guards intercept your requests to dereference a pointer and query the cuda-edu memory system to determine if the dereferenced address is valid. A declaration in your original code like:

float *host_src;

will be converted into something like:

edu::guard::ptr_guard_t<float> host_src;

When you look at host_buffer in your debugger, it'll look something like the following:

{ptr = 0x8066018, buf = {addr = 0x8066018, len = 256, space = edu::mem::MemorySpace_Host, alloced = true}}

The meaning of the fields are as follows:

  • ptr: The actual pointer value, just as it would be in your untranslated code.
  • buf: The buffer within which the pointed to memory resides (e.g. something you malloc'ed).
  • buf::addr: Starting address of buffer.
  • buf::len: Length of the buffer in bytes.
  • buf::space: Where the buffer resides (host/device).
  • buf::alloced: Whether or not this is a dynamically allocated buffer (versus a global variable).

Array Guards

Similar to pointers, cuda-edu will wrap your arrays in guards. For example, the following declaration in your original code:

float x[4][4];

would be converted by cuda-edu into:

edu::guard::array2_guard_t<float, 4, 4> x;

When printed out in your debugger, it will look something like the following:

{data2 = {{data = {0, 0, 0, 0}}, {data = {0, 0, 0, 0}}, {data = {0, 0, 0, 0}}, {data = {0, 0, 0, 0}}}}

data2 is the 2-dimensional data in the array guard, and each data field is a 1-dimensional array guard. The array elements are layed out in memory exactly as if the guards weren't present. That is, the guards don't effect the memory footprint.

Cuda Thread Context Switching

cuda-edu uses "fibers" to emulate cuda threads. Each cuda block is executed by a single operating system thread, and within that thread are as many fibers as there are cuda threads for that cuda block. Each fiber has its own stack and can store the values of its registers (its context) when it's not currently executing. When one cuda thread pauses execution and another takes its place as the executing thread, that is called a context switch.

A cuda thread context switch will occur under two circumstances: your code has called __syncthreads(), or a value has been written to a buffer (e.g. shared memory). When a cuda thread reaches an invocation of __synchthreads(), it enters a Sync state and pauses its execution until all cuda threads in the block have also entered a Sync state. When a cuda thread writes to a buffer, like shared memory, it enters a SyncWarp state until all other threads in its warp have also entered a SyncWarp state. The warp syncing is necessary because, on a real GPU, the threads in a warp execute in parallel. It is necessary for cuda-edu to pause a thread that has written to a buffer so that the other threads in its warp can see the new value, as if they're all executing at the same time.

This context switching is problematic for a debugger if you're single-stepping through your kernel because the debugger doesn't expect the stack frame to change from one line to the next.