GPU OpenCL Fine-Tuning Problem Solution

Hello, after about half-year I at last found the solution of the biggest optimization issue I had so far in the Darknet on OpenCL solution. It was really hard to solve and I even wrote at the AMD Community Post. Today very early morning I post on that post solution that is shown below.

CPU CODE THAT INVOKES THE GPU CODE AND USES BOTH GLOBAL AND LOCAL THREAD SPACES!

void fast_mean_gpu(cl_mem_ext x, int batch, int filters, int spatial, cl_mem_ext mean)
{
    // I changed to dynamic, but it can be const as 16 or 8.
    int tuning = filters / 4;
    dim2 dimGridG1;
    // this is very important change as a multiplication of threads!
    dimGridG1 = dim2_create(tuning, filters);
    dim2 dimGridL1;
    // this change makes sure of local space expected to calculate!
    dimGridL1 = dim2_create(tuning, 1);
    opencl_kernel_local(
        opencl_fast_mean_kernel[opencl_device_id_t],
        dimGridG1, dimGridL1, 14,
        &tuning, sizeof(cl_int),
        // this is to avoid any constant on local GPU sums array size!
        NULL, tuning*sizeof(cl_float),
        &filters, sizeof(cl_int),
        &batch, sizeof(cl_int),
        &spatial, sizeof(cl_int),
        &x.mem, sizeof(cl_mem),
        &mean.mem, sizeof(cl_mem));
}

GPU CODE THAT IS ACCELERATED WITHOUT ATOMICS BASED ON DATA MODEL WIN PERFORMANCE!

__kernel void fast_mean_kernel(int tuning, __local float *sums, int filters, int batch, int spatial, __global float *x, __global float *mean)
{
    // that was changed from the index 0 to the index 1!
    int i = get_global_id(1);
    // that has to be a global threads space first index!
    int t = get_global_id(0);
    sums[t] = 0;
    int j,k;
    for (j = 0; j < batch; ++j) {
        for (k = t; k < spatial; k += tuning) {
            int index = j * filters * spatial + i * spatial + k;
            // I tried atomic add, but it needs data model optimization!
            sums[t] += x[index];
        }
    }
    // barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); // not need!
    if (t == 0) {
        mean[i] = 0;
        int s;
        for(s = 0; s < tuning; ++s) {
            mean[i] += sums[s];
        }
        mean[i] /= (spatial * batch);
    }
}

Thanks for reading! Here you have YOLO2 Training on VOC that already works as expected and LOSS is going down, no math mistakes anymore. ;-),

YOLO2-VOC-TRAINING

p ;).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.