I have a really big CUDA kernel which does a lot of stuff. Like
_global_ void bigkernel(args)
{
func1();
func2();
func3();
func4();
func5();
....
}
I want to profile each one of those functions and visualize them in a Nsight.
When I run this in Nsight, it only shows the bigkernel and not the details of the func1()
and the rest.
Right now I use the built-in clock64()
to time each of the functions, use a structure to keep track and store.
struct time_stuff
{
uint64_t start, end,
func1, func2, func3, func4,...
};
To visualize I use python but I would like to inquire if there is better method?
I can use Nsight Compute and Systems to understand my program and how it affects functions but using clock seems the easiest.
nsys profile --trace=nvtx,cuda --sample=cpu -o cu_trace ./cu_alg /datasets/collisions.txt 15000000 \\s 96 128