Mac Moria Miner Project Failure

Hello, every failure is a lesson, so let me describe my experiment and the story behind it… I dream about the most robust possible GPU-computing macOS. Let’s start with hardware… design, the way it looks, is not a design, so let’s name it the look and feel ;-). I am sorry to Apple, please forgive me. This was only for tests and future, again try, Ph.D. studies. Do not look at it if you can… pls :D.

ON TOP, it looks like four regions and… powerful PSU ;-).
MMM01

ON FRONT, well… you may see special raisers that are x16 to x16 possible, and here once I use 4 GPUs, each works in x8 mode Gen3.
MMM02

ON POWER… I like those beautiful RED Radeon logos like product names, but it can be disabled… in Windows if someone needs that trick… ;-).
MMM03

So, let’s test it! :D. I am using the LuxMark benchmark… it is OpenCL, and for Ph.D. research, I like macOS OpenCL implementation. I was super happy when I saw the results of this computation, which you can find below.

MMM04

Nice, isn’t it? ;-). That I thought until I started to test Darknet on OpenCL and… well, let me show you step by step the idea of training tiny Yolo4 AI/ML CNN model.

MMM05

What is the failure, then? It detected all 4 GPUs and runs excellently… well.

MMM06

Was a reasonable speed at this moment, and I was pretty happy until…

MMM07

Until the moment calculation goes to NaN, the game is over.

MMM08

It means NaN is the non a number, and GPUs start working wrong. Why this project failed? Because the most potent possible hardware not calculating correctly, why? Maybe GPUs are wrong, which I doubt; maybe Apple made the bad driver in the early version, and they do not have the hardware to test :D, or simply perhaps I did something wrong with OpenCL in my project that I doubt as well :P. In the end, I have two points, do not do this at home, especially not spend half-year savings on hardware that does not work because of, I am sure, the wrong drivers!

I tried to examine them… in the way shown below.

__kernel void test_kernel(int N, __global float *input, __global float *output, __global float *expected)
{
    int index = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0);

    if (index >= N) return;

    output[index] = sqrt(input[index]);

    index += 1;
    input[index] = output[index-1];
    output[index] = log(input[index]);

    index += 1;
    input[index] = output[index-1];
    output[index] = pow(input[index], output[index-2]);

    index += 1;
    input[index] = output[index-1];
    output[index] = -exp(input[index]);

    index += 1;
    input[index] = output[index-1];
    output[index] = fabs(input[index]);

    index += 1;
    input[index] = output[index-1];
    output[index] = sin(input[index]);

    index += 1;
    input[index] = output[index-1];
    output[index] = cos(input[index]);
}

And invoked that way.

void opencl_cpu_gpu_test()
{
    int N = 1;
    int s = 7;

    float* input = (float*)calloc(s, sizeof(float));
    float* output = (float*)calloc(s, sizeof(float));
    float* expected = (float*)calloc(s, sizeof(float));

    ///*
    input[0] = 2.f;
    output[0] = 0;
    expected[0] = 1.4142135381698608398438f;
    expected[1] = 0.3465735614299774169922f;
    expected[2] = 0.2234459370374679565430f;
    expected[3] = -1.2503780126571655273438f;
    expected[4] = 1.2503780126571655273438f;
    expected[5] = 0.9491037726402282714844f;
    expected[6] = 0.5824118852615356445312f;
    //*/

    /*
    input[0] = 5.f;
    output[0] = 0;
    expected[0] = 2.2360680103302001953125f;
    expected[1] = 0.8047189712524414062500f;
    expected[2] = 0.6151968240737915039062f;
    expected[3] = -1.8500206470489501953125f;
    expected[4] = 1.8500206470489501953125f;
    expected[5] = 0.9612694978713989257812f;
    expected[6] = 0.5724795460700988769531f;
    */

    cl_mem_ext input_gpu = opencl_make_array(input, s);
    cl_mem_ext output_gpu = opencl_make_array(output, s);
    cl_mem_ext expected_gpu = opencl_make_array(expected, s);

    printf("\n");

    printf("TEST CPU:\n");
    int index = 0;
    output[index] = sqrtf(input[index]);
    printf("sqrt(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = logf(input[index]);
    printf("log(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = powf(input[index], output[index-2]);
    printf("pow(%.22f, %.22f) = %.22f", input[index], output[index-2], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = -expf(input[index]);
    printf("exp(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = fabsf(input[index]);
    printf("fabs(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = sinf(input[index]);

    printf("sin(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    input[index] = output[index-1];
    output[index] = cosf(input[index]);
    printf("cos(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    sleep(1);

    printf("\n");

    index = 0;
    output[0] = 0;
    output[1] = 0;
    printf("TEST GPU:\n");
    test_kernel_gpu(N, input_gpu, output_gpu, expected_gpu);
    opencl_pull_array(input_gpu, input, s);
    opencl_pull_array(output_gpu, output, s);
    printf("sqrt(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("log(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("pow(%.22f, %.22f) = %.22f", input[index], output[index-2], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("exp(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("fabs(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("sin(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    index += 1;
    printf("cos(%.22f) = %.22f", input[index], output[index]);
    printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL");
    sleep(1);

    printf("\n");

    opencl_free(input_gpu);
    opencl_free(output_gpu);
    opencl_free(expected_gpu);
}

Unfortunately, it is not math. It is more like a “processing” issue.

MMM09

So, I am still trying, but it looks like those drivers have an issue because, on the “GREEN” GPU, all in OpenCL is fine.

piotr@moria darknet-phd % ./darknet -i 0
Device IDs: 5
Device ID: 0
Device name: Intel(R) UHD Graphics 630
Device vendor: Intel Inc.
Device OpenCL availability: OpenCL 1.2 
Device opencl used: 1.2(Jul  9 2021 22:05:41)
Device double precision: NO
Device max group size: 256
Device address bits: 64

TEST CPU:
sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS
log(1.4142135381698608398438) = 0.3465735614299774169922 PASS
pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS
exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS
fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS
sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS
cos(0.9491037726402282714844) = 0.5824118852615356445312 PASS

TEST GPU:
sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS
log(1.4142135381698608398438) = 0.3465735614299774169922 PASS
pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459221363067626953 FAIL
exp(0.2234459221363067626953) = -1.2503780126571655273438 PASS
fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS
sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS
cos(0.9491037726402282714844) = 0.5824118256568908691406 FAIL

piotr@moria darknet-phd % ./darknet -i 1
Device IDs: 5
Device ID: 1
Device name: AMD Radeon RX 6900 XT Compute Engine
Device vendor: AMD
Device opencl availability: OpenCL 1.2 
Device opencl used: 1.2 (Jul  9 2021 21:55:26)
Device double precision: YES
Device max group size: 256
Device address bits: 32

TEST CPU:
sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS
log(1.4142135381698608398438) = 0.3465735614299774169922 PASS
pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS
exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS
fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS
sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS
cos(0.9491037726402282714844) = 0.5824118852615356445312 PASS

TEST GPU:
sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS
log(1.4142135381698608398438) = 0.3465735614299774169922 PASS
pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS
exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS
fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS
sin(1.2503780126571655273438) = 0.9491037130355834960938 FAIL
cos(0.9491037130355834960938) = 0.5824118852615356445312 PASS

Btw, yes, that is/was done on the macOS Monterey ;-). And it has in ETH mining 234 MH/s :P.

p ;).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.