GPU Computing on OpenCL

Hi, today I have something charming for all readers. The subject is GPU Computing on OpenCL for all readers.

Abstract: Have you ever wanted to sum 10 million random numbers in an array in C? Do you know that GPU can be faster, about 20k times more rapid than CPU? This post is about GPU Computing, and all examples were written in C language with OpenCL 1.2+. A primary C language is necessary to understand this post’s content. From the technical standpoint, the goal of each example was to compute on both CPU and GPU, compare two results and discuss all aspects. So you may expect many of the following examples in all mentioned flavors. All code examples are available at https://github.com/sowson/gpucomp, and after those simple examples that should make the readers feel comfortable with OpenCL, there will be another critical project of the author that is at https://github.com/sowson/darknet that is the fork of the https://github.com/pjreddie/darknet that is the very successful implementation of the Conventional Neural Network (CNN) and with very famous Yolo1, Yolo2 and Yolo3 models including Yolo3-SPP. It is unique because it has more than 15.4k stars and 9.4k forks on GitHub. That last-mentioned project is written in CUDA, not in OpenCL, it is faster than the OpenCL implementation made by me, but the OpenCL version can run not only on NVidia but also on Intel, AMD, and ARM-based platforms. Some examples of the computation and training will be presented on CIFAR-10 as an industry-standard model for validating CNN engines. This post also covers some aspects of Image Classification on images and/or video files. Gartner’s publication for business makes it very clear that Image Classification on pictures and/or video streams is now one of the hottest research disciplines of artificial intelligence on the planet.
Introduction: Have you ever wanted to accelerate any of your CPU-computed programs written in C by, for example, 20k times? Do you know that Darknet, the Convolutional Neural Network (CNN) Engine written in C and leverages CUDA technology that uses GPU on GitHub, has more than 15.4k stars and more than 9.4 forks? Do you know that GPU or FPGA accelerators are ready to make this happen with the OpenCL library? Are you looking for an excellent study to become an Artificial Intelligence Specialist who improves the detection algorithms from minutes to nanoseconds? If any of those questions answer yes, this post is definitely for you! Many industries’ modern research and development units need the fastest computation for decision-making and improvement by the business’s artificial intelligence. For example, the Gartner researchers are providing predictions, and summaries for the business of many kinds, saying that Deep Learning on CNN is the hottest subject for researchers around the globe. All computations can be done on GPUs instead of CPUs. Many robotic, anonymous agents based solutions need GPU computing to accelerate daily analysis. Airlines, Cars Manufacturers, the Entertainment Industry, and even Marketing units of many companies are looking for AI engineers who can write efficient algorithms to improve decisions making every day. If you want to become one of them or hire one, this post is also for you! What I expect from you is basic knowledge of C language, thanks to an excellent IDE for C programmers from JetBrains® named the CLion that can be run on Apple® macOS, GNU/Linux, and Microsoft® Windows operating systems. You will be able to easily follow the examples presented in this post. All models will be tested on the following computers.
First will be a PC workstation with GNU/Linux Ubuntu 18.04 accelerated by Intel® i7-5960X CPU and 2 NVIDIA® Titan RTX GPU cards. The second will be Apple® MacBook Pro 2018 13″ with Intel® i7-8559U CPU with Intel® Iris Plus Graphics 655 GPU and with Sonnet Breakaway Puck AMD® Radeon 570 eGPU. The third environment will be Asus® Tinker Board S small form-factor single-board computer with Mali-T760 GPU. All that supports at least OpenCL 1.2. Why do I believe OpenCL 1.2 and C language are a perfect combination? First, all presented devices and test environments support those standards and programming languages. There are seams to be old but also very powerful and can run the fastest code that can be even faster on GPU than on CPU, and that post is all about that. Provided CPUs, GPUs, and computers are generally available in the mainstream segment of the customers market, nothing is notable for only big data centers or supercomputing laboratories. And the author believes this is an important aspect as well, to provide the power of computing, algorithms, deep learning, and Convolutional Neural Networks CNN for research with a limited budget. That is also important for the audience of the readers.
First Impression: Let me show you the caparison of two simple implementations of computing the sum of all matrix elements. A matrix has 10 million random integers. I believe that this example will show you the power of GPU computing and give you the first impression of GPU computing. All models were well-tested. The author wants to provide a good foundation of examples to teach you GPU Computing. Now we can answer the question from the abstract about 1 million random numbers sum. How to make it in C on the CPU and accelerate it on GPU? Let’s go for the first friendly and clean C code example of computation.

//CPU (Intel i7-5960X)
int sum()
{
    int sum = 0;
    const int* set = _gen_randoms;
    int i = 0;
    for(i = 0; i < N; ++i) {
        sum += set[i];
    }
    return sum;
}
//GPU (NVIDIA Titan RTX)
inline void sum(
__global int* a,
int v)
{
    int s = v;
    int n = 0;
    int o = 0;
    do {
        n = s + atom_xchg(a, o);
        s = o + atom_xchg(a, n);
    }
    while (s != o);
}
__kernel void sum_kernel(
__global int *set,
__global int* out)
{
    int i = (get_group_id(0) +
             get_group_id(1) *
             get_num_groups(0))
           * get_local_size(0) +
             get_local_id(0);
    sum(out, set[i]);
}
RESULT
EX01: SUM INT
SUM of 10 000 000 random numbers:
Compare 1:
Exec of sum_cpu took 20 000 ticks
with result 25 002 358.
Exec of sum_gpu took 0 ticks
with result 25 002 358.

When you look at the above example, you may think it is too small of code, and you are correct. All OpenCL elements like context, queue, and kernel compilation are away, and if you go to examples site at https://github.com/sowson/gpucomp you will find all needed elements that are missing. The intention is here to provide you with the essence of the OpenCL code, so the first is pure C code for the CPU, and the second is the pure kernel code for GPU. Be aware that GPU code uses atomic functions not part of some OpenCL implementations. For example, on the Asus Tinker Board S, you may use MaliGPU implementation that does not have those atomic extensions. But it is no blocker to use OpenCL for Convolutional Neural Networks in general. And the following example that you can find below. The author of this post will provide a more realistic computation example for them.

// CPU Neural Network Computation
void ex06_net_cpu() {
    network net = *ex06_net;
    int I;
    layer I = net.L[0];
    for(i = 0; i < I.n; ++i)
    {
        I.N[i] = (float)(rand() % 4 + 1);
    }
    clock_t t = ex06_benchmark_start();
    for(i = 1; i < net.l; ++i)
    {
        layer L = net.L[i];
        layer B = net.L[i-1];
        int j;
        int w = 0;
        for(j = 0; j < L.n; ++j) {
            float sum = .0f;
            int k;
            for(k = 0; k < L.b; ++k, ++w) {
                sum += B.N[k] * L.W[w];
            }
            L.N[j] = L.A(sum);
        }
    }
    layer O = net.L[net.l - 1];
    ex06_benchmark_stop(t, "net_cpu");
    printf(" with result %.4f.\n", O.N[0]);
}
// GPU Kernel for Neural Network Computation
static const char* const ex06_net_kernel_source
= CONVERT_KERNEL_TO_STRING(
        typedef enum {
            LINEAR,
            TANH
        } activation;
        float activation_linear(float x);
        float activation_tanh(float x);
        float activation_linear(float x)
        {return x;}
        float activation_tanh(float x)
        {return (2.f/(1 + exp(-2.f*x)) - 1);}
        __kernel void net_kernel(
        int b,
        __global float *B,
        __global float* W,
        __global float* N,
        int a)
        {
            int j =
            (get_group_id(0) +
            get_group_id(1) *
            get_num_groups(0)) *
            get_local_size(0) +
            get_local_id(0);
            int w = 0;
            int k = 0;
            N[j] = 0.f;
            for(k = 0; k < b; ++k, ++w) {
                N[j] += B[k] * W[w];
            }
            N[j] = a == LINEAR
            ? activation_linear(N[j]) :
                   a == TANH
            ? activation_tanh(N[j])   :
                   0;
        }
);
// GPU Neural Network Computation
void ex06_net_gpu()
{
    network net = *ex06_net;
    int I;
    layer I = net.L[0];
    for(i = 0; i < I.n; ++i)
    {
        I.N[i] = (float)(rand() % 4 + 1);
    }
    opencl_push_array(I.Ng, I.N);
    clock_t t = ex06_benchmark_start();
    for(i = 1; i < net.l; ++i)
    {
        layer L = net.L[i];
        layer B = net.L[i-1];
        dim2 dimGrid = opencl_gridsize(L.n);
        opencl_kernel(
        ex06_net_kernel[0], dimGrid, 10,
        &L.b, sizeof(cl_int),
        &B.Ng.org, sizeof(cl_mem),
        &L.Wg.org, sizeof(cl_mem),
        &L.Ng.org, sizeof(cl_mem),
        &L.a, sizeof(cl_int)
        );
    }
    layer O = net.L[net.l - 1];
    ex06_benchmark_stop(t, "net_gpu");
    opencl_pull_array(O.Ng, O.N);
    printf(" with result %.4f.\n", O.N[0]);
}

The above example is because I want to show you how easily you can model a neural network of any kind, i.e., perceptron or other solutions. The thing is that I did not provide you the result of a computation. Please check on your own those benchmarks. In that post, I would like to encourage you to do more GPU computing experiments. My journey started with the https://github.com/sowson/darknet, and there are already examples of using it, so I will add only two friendly training processes results of CIFAR-10 below.
CUDA version of Darknet

OpenCL version of Darknet

Quite similar. However, there are some differences, OpenCL is slower than CUDA, but the shape of the learning curve looks very similar. There are a few new things you may check on your own: the validation of computation results… I am leaving you with this exercise… Enjoy!
p ;).

Leave a Reply

Your email address will not be published.

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.