GPU OpenCL Fine-Tuning Problem Solution

Hello, after about half-year I at last found the solution of the biggest optimization issue I had so far in the Darknet on OpenCL solution. It was really hard to solve and I even wrote at the AMD Community Post. Today very early morning I post on that post solution that is shown below.

CPU CODE THAT INVOKES THE GPU CODE AND USES BOTH GLOBAL AND LOCAL THREAD SPACES!

void fast_mean_gpu(cl_mem_ext x, int batch, int filters, int spatial, cl_mem_ext mean)
{
    // I changed to dynamic, but it can be const as 16 or 8.
    int tuning = filters / 4;
    dim2 dimGridG1;
    // this is very important change as a multiplication of threads!
    dimGridG1 = dim2_create(tuning, filters);
    dim2 dimGridL1;
    // this change makes sure of local space expected to calculate!
    dimGridL1 = dim2_create(tuning, 1);
    opencl_kernel_local(
        opencl_fast_mean_kernel[opencl_device_id_t],
        dimGridG1, dimGridL1, 14,
        &tuning, sizeof(cl_int),
        // this is to avoid any constant on local GPU sums array size!
        NULL, tuning*sizeof(cl_float),
        &filters, sizeof(cl_int),
        &batch, sizeof(cl_int),
        &spatial, sizeof(cl_int),
        &x.mem, sizeof(cl_mem),
        &mean.mem, sizeof(cl_mem));
}

GPU CODE THAT IS ACCELERATED WITHOUT ATOMICS BASED ON DATA MODEL WIN PERFORMANCE!

__kernel void fast_mean_kernel(int tuning, __local float *sums, int filters, int batch, int spatial, __global float *x, __global float *mean)
{
    // that was changed from the index 0 to the index 1!
    int i = get_global_id(1);
    // that has to be a global threads space first index!
    int t = get_global_id(0);
    sums[t] = 0;
    int j,k;
    for (j = 0; j < batch; ++j) {
        for (k = t; k < spatial; k += tuning) {
            int index = j * filters * spatial + i * spatial + k;
            // I tried atomic add, but it needs data model optimization!
            sums[t] += x[index];
        }
    }
    barrier(CLK_GLOBAL_MEM_FENCE);
    if (t == tuning-1) {
        mean[i] = 0;
        int s;
        for(s = 0; s < tuning; ++s) {
            mean[i] += sums[s];
        }
        mean[i] /= (spatial * batch);
    }
}

Thanks for reading! Here you have YOLO2 Training on VOC that already works as expected and LOSS is going down, no math mistakes anymore. ;-),

YOLO2-VOC-TRAINING

p ;).

2 Replies to “GPU OpenCL Fine-Tuning Problem Solution”

  1. Dzień dobry Piotr,

    I have been following your blog for some time (thanks for your efforts!) but have not had a chance to try your code. I will start from now (well…maybe from tomorrow). I will be using a Khadas VIM3 which uses an Amlogic ARM-based processor and on-chip GPU.

    A couple of questions. I noticed that you were able to run this on a Beagleboard. Have you tried training a “standard” (i.e. not “tiny Darknet”) using the Beagleboard? I have found anecdotally that training on these small boards have caused the boards to crash. From my (limited) knowledge of OpenCL, large models would slow it down, but wouldn’t memory get swapped out in the case the the model/training grows too large to hold in local memory? If that is the case, then it would seem that memory shouldn’t be a problem if there is a large enough swap space…but there could still be a memory leak somewhere.

    My next question is about this blog’s topic, “fine tuning problem”. You seemed to indicate that there was some math problem. Was this problem part of the OpenCL code or the original Darknet code…or somewhere else? If I start with your latest version, these changes will be reflected in the code?

    Best regards

    • Hello User1234,

      My lesson learned is that on Beagleboard is DSP not GPU. And my work is an OpenCL version of the original darknet that was written in CUDA. That makes my work quite unique for a device like Khadas VIM3 and others with Mali-GPU with OpenCL support. My code is up to date at https://github.com/sowson/darknet ;-).

      This particular change reduced YOLO2 on VOC training from 108 hours to 44 hours, so it was a big optimization and math thing. The first longer training was incorrect from a math perspective.

      Thanks for your comment! and Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.