Our ACL 1.0 Beta 2 release contains several new features (that you can read more about here). One of those is AutoGemm, which is a new approach for achieving peak GEMM (GEneric Matrix Matrix multiplication) performance on GPUs.
AutoGemm is a suite of Python scripts that:
- generates thousands of optimized GEMM OpenCL™ kernels,
- benchmarks these kernels for a particular GPU to determine which are the fastest, and
- automatically chooses the optimal kernel within clBLAS for peak performance.
OpenCL Kernel Generation
Though GEMM is a single function, it has to handle a large variety of inputs:
- 4 precisions (float, double, complex float, complex double)
- 9 transpose combinations (matrix A and B can be transposed, conjugate transposed or not transposed)
- M, N, and K can be any size, which affects the behavior and performance of the function
- and more…
Rather than writing one GPU kernel to address every possible combination, AutoGemm writes a different kernel specifically designed to achieve peak performance for each possible input combination.
Additionally, it can write multiple kernels targeting the same input combination, but which perform differently for different GPUs (large vs small GPUs).
Benchmarking Kernels
After generating thousands of kernels, AutoGemm automatically benchmarks the performance of the different kernels for a given device, generating a type of “look-up table” for which kernel is ideal for which input parameters. Doing so improves performance dramatically since different GPUs can have different amounts of registers, global memory bandwidth, global memory latency, local memory bandwidth, local memory latency, local memory size; we believe in enabling peak performance for all.
Choosing Optimal Kernel
The look-up table gets compiled into the clBLAS library for fast execution during runtime. Thus, the clBLAS library can select and enqueue the fastest GPU kernel with low overhead and high performance.
Customizability
For an application with unique GEMM requirements (such as very small or very skinny matrices), AutoGemm can be customized to generate application-specific kernels for additional performance.
Performance
In clBLAS 2.8.0, AutoGemm has improved performance of GEMM significantly over the 2.6.0 release. The following Table shows the speedup of AutoGemm over the 2.6.0 implementation averaged over matrix sizes of M,N,K=8…5760.
| Mean Speedup | Median Speedup | |
| SGEMM | 38% | 11% |
| DGEMM | 23% | 14% |
| CGEMM | 45% | 49% |
| ZGEMM | 94% | 223% |
The below graphs show AutoGemm’s performance (green) on an AMD FirePro™ S9150 GPU versus cuBLAS 7.5 performance (blue) on a Tesla K40, both with GPU clocks pinned to maximum settings.



For more information regarding AutoGemm functionality, see the AutoGemm entry of the clBLAS wiki: https://github.com/clMathLibraries/clBLAS/wiki/AutoGemm
For more information regarding AutoGemm performance, see the raw performance numbers checked into the clBLAS repository: https://github.com/clMathLibraries/clBLAS/tree/master/doc/performance/clBLAS_2.7.1/S9150

David Tanner is a Senior Software Development Engineer on the AMD Compute Libraries team. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
OpenCL is a trademark of Apple Inc. used by permission by Khronos.
System configuration for all testing: AMD FirePro™ S9150: CentOS 6.6 Linux64, 14.502 AMD Driver, Intel Xeon E5-2640 v3 CPU, 125 GB RAM. K40: OpenSUSE 13.2 Linux64, 352.39 Nvidia Driver, Intel Core i5-4690K CPU, 16 GB RAM



