Hello, after about half-year, I at last found the solution to the biggest optimization issue I had so far in the Darknet on OpenCL solution. It was tough to solve, and I even wrote at the AMD Community Post. Today very early morning, I posted on that post solution shown below.
CPU CODE THAT INVOKES THE GPU CODE AND USES BOTH GLOBAL AND LOCAL THREAD SPACES!
void fast_mean_gpu(cl_mem_ext x, int batch, int filters, int spatial, cl_mem_ext mean) { // I changed to dynamic, but it can be const as 16 or 8. int tuning = filters / 4; dim2 dimGridG1; // this is very important change as a multiplication of threads! dimGridG1 = dim2_create(tuning, filters); dim2 dimGridL1; // this change makes sure of local space expected to calculate! dimGridL1 = dim2_create(tuning, 1); opencl_kernel_local( opencl_fast_mean_kernel[opencl_device_id_t], dimGridG1, dimGridL1, 14, &tuning, sizeof(cl_int), // this is to avoid any constant on local GPU sums array size! NULL, tuning*sizeof(cl_float), &filters, sizeof(cl_int), &batch, sizeof(cl_int), &spatial, sizeof(cl_int), &x.mem, sizeof(cl_mem), &mean.mem, sizeof(cl_mem)); }
GPU CODE THAT IS ACCELERATED WITHOUT ATOMICS BASED ON DATA MODEL WIN PERFORMANCE!
__kernel void fast_mean_kernel(int tuning, __local float *sums, int filters, int batch, int spatial, __global float *x, __global float *mean) { // that was changed from index 0 to index 1! int i = get_global_id(1); // that has to be a global threads space first index! int t = get_global_id(0); sums[t] = 0; int j,k; for (j = 0; j < batch; ++j) { for (k = t; k < spatial; k += tuning) { int index = j * filters * spatial + i * spatial + k; // I tried atomic add, but it needs data model optimization! sums[t] += x[index]; } } barrier(CLK_GLOBAL_MEM_FENCE); if (t == tuning-1) { mean[i] = 0; int s; for(s = 0; s < tuning; ++s) { mean[i] += sums[s]; } mean[i] /= (spatial * batch); } }
Thanks for reading! Here you have YOLO2 Training on VOC that works as expected, and LOSS is going down, with no math mistakes anymore. ;-),
p ;).
Dzień dobry Piotr,
I have been following your blog for some time (thanks for your efforts!) but have not had a chance to try your code. I will start from now (well…maybe from tomorrow). I will be using a Khadas VIM3 which uses an Amlogic ARM-based processor and on-chip GPU.
A couple of questions. I noticed that you were able to run this on a Beagleboard. Have you tried training a “standard” (i.e. not “tiny Darknet”) using the Beagleboard? I have found anecdotally that training on these small boards have caused the boards to crash. From my (limited) knowledge of OpenCL, large models would slow it down, but wouldn’t memory get swapped out in the case the the model/training grows too large to hold in local memory? If that is the case, then it would seem that memory shouldn’t be a problem if there is a large enough swap space…but there could still be a memory leak somewhere.
My next question is about this blog’s topic, “fine tuning problem”. You seemed to indicate that there was some math problem. Was this problem part of the OpenCL code or the original Darknet code…or somewhere else? If I start with your latest version, these changes will be reflected in the code?
Best regards
Hello User1234,
My lesson learned is that on Beagleboard is DSP not GPU. And my work is an OpenCL version of the original darknet that was written in CUDA. That makes my work quite unique for a device like Khadas VIM3 and others with Mali-GPU with OpenCL support. My code is up to date at https://github.com/sowson/darknet ;-).
This particular change reduced YOLO2 on VOC training from 108 hours to 44 hours, so it was a big optimization and math thing. The first longer training was incorrect from a math perspective.
Thanks for your comment! and Enjoy!