372x472_green2 Beta 2 of the clSPARSE library introduces sparse matrix–sparse matrix multiplication (SpGEMM) function, which currently supports the single-precision CSR format for sparse storage. Weifeng Liu developed and donated the SpGEMM algorithm (https://github.com/bhSPARSE). In this blog I discuss the algorithm implemented in the clSPARSE library and compare the performance of clSPARSE running on an AMD GPU with that of cuSPARSE on an Nvidia GPU.

The SpGEMM operation multiplies a matrix A of size m x k with a matrix B of size k x n, yielding a resulting matrix C of size m x n. It finds many applications in graph theory, linear solvers and sparse signal processing. Since data in these fields is naturally sparse, we must store the matrices in sparse format to save space and reduce computation costs—hence the need for an efficient SpGEMM algorithm.

The algorithm faces three primary challenges:

It is impossible to know a priori how much memory the resulting matrix requires, as the number of nonzero entries (nnz) in the sparse matrix C is unknown.
Parallel insert operations at random positions in the resulting sparse matrix are costly and dominate the execution time.
Load balancing is difficult owing to the diverse sparsity structures of the input matrices.

SpGEMM Algorithm

The clSPARSE SpGEMM algorithm comprises four stages that address the challenges mentioned above.

Stage 1

Previous SpGEMM algorithms employed four different solutions to determine how much memory the resulting matrix C requires.

1) Precise method,

2) Probabilistic method,

3) Upper-bound method, and

4) Progressive method.

In the precise method, a simplified SpGEMM (SpM-SpM) operation is pre-computed using the same computational pattern. Sparse Boolean matrices replace the actual matrices, and instead of floating operations, the algorithm performs Boolean operations. The Nvidia cuSPARSE library employs this kind of SpGEMM method. Even though this method exactly pre-computes the size of matrix C, it’s relatively expensive because it executes the SpGEMM operation twice in the same pattern.

The probabilistic method is based on random sampling; it estimates an imprecise nnz for C. Imprecise estimation and memory reallocation are the primary reasons why clSPARSE SpGEMM doesn’t use this method.

The upper-bound method calculates the upper bound of the number of nonzero entries in matrix C. It then allocates memory to an intermediate matrix that stores the candidate nonzero entries generated by the arithmetic operations. The method sorts the intermediate matrix by rows and columns and compresses it into matrix C by eliminating entries in duplicate positions. The problem with this approach is that the intermediate matrix may be too large to fit in the device global memory, reducing overall performance.

The progressive method first allocates memory of fixed size, then it starts the sparse-matrix computation and reallocates the buffer if more space is required.

The upper-bound method sacrifices space efficiency to improve performance, whereas the progressive method is good at saving space. The clSPARSE SpGEMM approach combines both of these methods. It creates an array U of size m corresponding to number of rows in C. A GPU wave item (thread) computes the upper bound of the nnz for each row in C and stores it in each entry of array U.

Stage 2

Binning deals with load balancing and memory pre-allocation. The computational complexity of clSPARSE SpM-SpM depends on the upper bound of nnz for each row in C. The algorithm groups all of these rows into 38 bins and classifies the bins into five super bin groups. The bins contain the entry indices for array U. The first bin group contains indices of rows for which nnz is 0. The second bin group contains indices of rows for which nnz is 1. To achieve better load balancing, matrix computation uses different techniques for each bin group.

The third bin group comprises 31 bins that contain row indices for which nnz is 2 to 32. This range aligns with the wavefront size of AMD GPUs. To maximize instruction throughput, each work item processes each row. Since each bin contains the rows of the same upper bound, the bins can be executed separately on the GPU using 31 different kernel programs for efficient load balancing and resource utilization.

The rows for which nnz is 33–64, 65–128, 129–256 and 257–512 constitute a fourth bin group consisting of four bins. This kind of grouping ensures that a wavefront can efficiently process each row and that the row’s nnz is small enough to exploit local memory. Here, the GPU’s low-level scheduling subsystem naturally guarantees load balancing.

The fifth bin group contains indices for rows with nnz greater than 512. This group is computationally demanding, since nnz can be very large, and there is no guarantee that it will fit in scratchpad memory. The hybrid method described in stage 1 allocates memory for the temporary sparse matrix in CSR format, using the notation C~. For row indices belonging to bin groups 1–4, the required memory is calculated on the basis of the upper-bound method; for bin group 5, it’s based on the progressive method. To exploit cache subsystems, the CPU executes all of stage 2.

Stage 3

In stage 3, the algorithm analyzes the nnz values in the temporary matrix and determines the nnz of matrix C. For bin groups 1–2, it updates the numbers of the corresponding nnz entries directly. For the rows in bin groups 3–5, the algorithm uses three different approaches: the heap method, bitonic ESC and the merge method, respectively. The goals of these methods are the same: sorting the nonzero values on the basis of their indices and combining the duplicate indices by summing their values.

In the heap method, one wave item processes one row of bin group 3. For that row, it creates an empty index-value-pair max-heap for an upper bound of size nnz. The heap resides in scratchpad memory and collects all candidate nonzero entries. A heap-sort operation then uses the column indices as the key to generate an ordered sequence of index-value pairs. The algorithm eliminates duplicate entries from the sorted sequence and finally moves the resultant sequence from scratchpad memory to the temporary matrix C~ in global memory. We now know the exact nnz values for each row in bin group 3, and we can update these values in matrix C.

In the case of bitonic ESC, which processes rows in bin group 4, a local-memory array holds all candidate nonzero entries. A basic bitonic algorithm sorts the array, and a prefix-sum scan compresses duplicate entries. The output of this compressed sorted sequence resides in the temporary matrix C~, and nnz entries are updated. Each bin from this bin group has a dedicated kernel.

Finally, for bin group 5, parallel insert operations are expensive owing to the size of the nnz values. The algorithm therefore converts insert operations into parallel merge operations that combine ordered sequences to ensure that the final sequence is ordered and duplicate free. This step assumes that the input matrices in CSR format are ordered by column index. According to the CSR standard, the column indices in each row need not be sorted, so this step is a precondition to using the SpM-SpM library routine. The clSPARSE routines for loading matrices from disk do guarantee this sorted property.

Each GPU wave group works on each row and does following steps:

Compute the candidate nonzero value and its location. Store these values in thread registers—call it the input sequence.
Search the result sequence in scratchpad memory for duplicate locations using a binary search.
Fuse values at the same location.
For non-duplicate entries, copy the input sequence to the result sequence.
Merge the two sequences in one continuous memory space.

After all input sequences merge, the resulting sequence is saved to the temporary matrix C~, and the number of nonzero entries in the row is updated. Whenever merge operations are disallowed because of fixed scratchpad memory, the current computation position undergoes bookkeeping and the result sequence from the scratchpad memory dumps to global memory. Then the host allocates more global memory (2x each time) and relaunches the kernel with a 2x-larger scratchpad memory. The relaunched kernel retrieves bookkeeping positions, loads the existing results to the scratchpad memory and continues computation. This extra overhead can be hidden because it is relatively small compared with a large computation and because global memory accesses are mostly coalesced.

Stage 4

This stage calculates the nnz entries for all rows in the temporary matrix C~, and it allocates the memory for matrix C. It copies all nonzero entries from C~ to C. For the rows in bin group 0, no copy is required; for bin group 2, each row uses one thread. All other groups use one wave group for each row. This stage completes the SpGEMM computation.

Performance

We compared the performance of clSPARSE’s SpGEMM against that of Nvidia’s cuSPARSE SpGEMM. We chose 21 sparse matrices covering diverse sparsity structures; all these matrices are available for download from the University of Florida Sparse Matrix Collection. Without sacrificing generality, we evaluated only multiplication of a sparse square matrix with itself (C = A * A). This approach offers a relatively fair platform for benchmarking the SpGEMM algorithms. The source code for the benchmark program that generated these numbers is available at the clSPARSE GitHub repository.

Figure 1. Comparison of clSPARSE and cuSPARSE SpGEMM.

Figure 1 shows the relative speedup of clSPARSE running on an AMD FirePro™ W9100 Professional Graphics card, compared with cuSPARSE running on an Nvidia Tesla K40. Performance is in megaflops and is the maximum number of operations divided by the elapsed CPU wall-clock time in milliseconds. The timestamps recorded in SpGEMM correspond to blocking API calls. To calculate the speedup we divide the W9100’s performance by the K40’s. On a very few matrices, cuSPARSE outperforms clSPARSE (values less than 1). For all other matrices, clSPARSE is clearly superior—more than 3.8x in the best cases.

The benchmark system details are the following:

clSPARSE SpGEMM on Ubuntu 14.04.3 LTS Linux64, AMD FirePro™ driver version 14.502.1040, running on AMD FirePro™ W9100Professional Graphics card, with AMD A10-7850K Radeon R7 and 4GB RAM
cuSPARSE SpGEMM on OpenSuse 13.2 Linux64, Nvidia driver version 352.39 and Cuda 5 Toolkit, running on Nvidia Tesla® K40, with Intel i5-4690K CPU and 16GB RAM

Conclusion

The AMD library team continues to raise the bar and improve the capabilities of the AMD Compute Libraries (ACL). The ACL components are all open-source projects; you can download the code at the clMath GitHub repository. The algorithms that this blog discusses are, of course, in the clSPARSE component of the library. Happy coding!

Kiran Varaganti is a Senior Member of Technical Staff Software Engineer on the AMD Compute Libraries team. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

General Sparse Matrix–Sparse Matrix Multiplication (SpGEMM)