Hello, every failure is a lesson, so let me describe my experiment and the story behind it… I dream about the most robust possible GPU-computing macOS. Let’s start with hardware… design, the way it looks, is not a design, so let’s name it the look and feel ;-). I am sorry to Apple, please forgive me. This was only for tests and future, again try, Ph.D. studies. Do not look at it if you can… pls :D.
ON TOP, it looks like four regions and… powerful PSU ;-).
ON FRONT, well… you may see special raisers that are x16 to x16 possible, and here once I use 4 GPUs, each works in x8 mode Gen3.
ON POWER… I like those beautiful RED Radeon logos like product names, but it can be disabled… in Windows if someone needs that trick… ;-).
So, let’s test it! :D. I am using the LuxMark benchmark… it is OpenCL, and for Ph.D. research, I like macOS OpenCL implementation. I was super happy when I saw the results of this computation, which you can find below.
Nice, isn’t it? ;-). That I thought until I started to test Darknet on OpenCL and… well, let me show you step by step the idea of training tiny Yolo4 AI/ML CNN model.
What is the failure, then? It detected all 4 GPUs and runs excellently… well.
Was a reasonable speed at this moment, and I was pretty happy until…
Until the moment calculation goes to NaN, the game is over.
It means NaN is the non a number, and GPUs start working wrong. Why this project failed? Because the most potent possible hardware not calculating correctly, why? Maybe GPUs are wrong, which I doubt; maybe Apple made the bad driver in the early version, and they do not have the hardware to test :D, or simply perhaps I did something wrong with OpenCL in my project that I doubt as well :P. In the end, I have two points, do not do this at home, especially not spend half-year savings on hardware that does not work because of, I am sure, the wrong drivers!
I tried to examine them… in the way shown below.
__kernel void test_kernel(int N, __global float *input, __global float *output, __global float *expected) { int index = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); if (index >= N) return; output[index] = sqrt(input[index]); index += 1; input[index] = output[index-1]; output[index] = log(input[index]); index += 1; input[index] = output[index-1]; output[index] = pow(input[index], output[index-2]); index += 1; input[index] = output[index-1]; output[index] = -exp(input[index]); index += 1; input[index] = output[index-1]; output[index] = fabs(input[index]); index += 1; input[index] = output[index-1]; output[index] = sin(input[index]); index += 1; input[index] = output[index-1]; output[index] = cos(input[index]); }
And invoked that way.
void opencl_cpu_gpu_test() { int N = 1; int s = 7; float* input = (float*)calloc(s, sizeof(float)); float* output = (float*)calloc(s, sizeof(float)); float* expected = (float*)calloc(s, sizeof(float)); ///* input[0] = 2.f; output[0] = 0; expected[0] = 1.4142135381698608398438f; expected[1] = 0.3465735614299774169922f; expected[2] = 0.2234459370374679565430f; expected[3] = -1.2503780126571655273438f; expected[4] = 1.2503780126571655273438f; expected[5] = 0.9491037726402282714844f; expected[6] = 0.5824118852615356445312f; //*/ /* input[0] = 5.f; output[0] = 0; expected[0] = 2.2360680103302001953125f; expected[1] = 0.8047189712524414062500f; expected[2] = 0.6151968240737915039062f; expected[3] = -1.8500206470489501953125f; expected[4] = 1.8500206470489501953125f; expected[5] = 0.9612694978713989257812f; expected[6] = 0.5724795460700988769531f; */ cl_mem_ext input_gpu = opencl_make_array(input, s); cl_mem_ext output_gpu = opencl_make_array(output, s); cl_mem_ext expected_gpu = opencl_make_array(expected, s); printf("\n"); printf("TEST CPU:\n"); int index = 0; output[index] = sqrtf(input[index]); printf("sqrt(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = logf(input[index]); printf("log(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = powf(input[index], output[index-2]); printf("pow(%.22f, %.22f) = %.22f", input[index], output[index-2], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = -expf(input[index]); printf("exp(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = fabsf(input[index]); printf("fabs(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = sinf(input[index]); printf("sin(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; input[index] = output[index-1]; output[index] = cosf(input[index]); printf("cos(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); sleep(1); printf("\n"); index = 0; output[0] = 0; output[1] = 0; printf("TEST GPU:\n"); test_kernel_gpu(N, input_gpu, output_gpu, expected_gpu); opencl_pull_array(input_gpu, input, s); opencl_pull_array(output_gpu, output, s); printf("sqrt(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("log(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("pow(%.22f, %.22f) = %.22f", input[index], output[index-2], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("exp(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("fabs(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("sin(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); index += 1; printf("cos(%.22f) = %.22f", input[index], output[index]); printf(" %s\n", output[index] == expected[index] ? "PASS" : "FAIL"); sleep(1); printf("\n"); opencl_free(input_gpu); opencl_free(output_gpu); opencl_free(expected_gpu); }
Unfortunately, it is not math. It is more like a “processing” issue.
So, I am still trying, but it looks like those drivers have an issue because, on the “GREEN” GPU, all in OpenCL is fine.
piotr@moria darknet-phd % ./darknet -i 0 Device IDs: 5 Device ID: 0 Device name: Intel(R) UHD Graphics 630 Device vendor: Intel Inc. Device OpenCL availability: OpenCL 1.2 Device opencl used: 1.2(Jul 9 2021 22:05:41) Device double precision: NO Device max group size: 256 Device address bits: 64 TEST CPU: sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS log(1.4142135381698608398438) = 0.3465735614299774169922 PASS pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS cos(0.9491037726402282714844) = 0.5824118852615356445312 PASS TEST GPU: sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS log(1.4142135381698608398438) = 0.3465735614299774169922 PASS pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459221363067626953 FAIL exp(0.2234459221363067626953) = -1.2503780126571655273438 PASS fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS cos(0.9491037726402282714844) = 0.5824118256568908691406 FAIL piotr@moria darknet-phd % ./darknet -i 1 Device IDs: 5 Device ID: 1 Device name: AMD Radeon RX 6900 XT Compute Engine Device vendor: AMD Device opencl availability: OpenCL 1.2 Device opencl used: 1.2 (Jul 9 2021 21:55:26) Device double precision: YES Device max group size: 256 Device address bits: 32 TEST CPU: sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS log(1.4142135381698608398438) = 0.3465735614299774169922 PASS pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS sin(1.2503780126571655273438) = 0.9491037726402282714844 PASS cos(0.9491037726402282714844) = 0.5824118852615356445312 PASS TEST GPU: sqrt(2.0000000000000000000000) = 1.4142135381698608398438 PASS log(1.4142135381698608398438) = 0.3465735614299774169922 PASS pow(0.3465735614299774169922, 1.4142135381698608398438) = 0.2234459370374679565430 PASS exp(0.2234459370374679565430) = -1.2503780126571655273438 PASS fabs(-1.2503780126571655273438) = 1.2503780126571655273438 PASS sin(1.2503780126571655273438) = 0.9491037130355834960938 FAIL cos(0.9491037130355834960938) = 0.5824118852615356445312 PASS
Btw, yes, that is/was done on the macOS Monterey ;-). And it has in ETH mining 234 MH/s :P.
p ;).