Benchmarking sparse eigensolvers on CPU and GPU

Fábio Cruz
Fábio Cruz Author
Maria Castro
Maria Castro Author
Luís Sarmento
Luís Sarmento Author
David Carvalho
David Carvalho Reviewer and Editor
Benchmarking sparse eigensolvers on CPU and GPU

As part of our mission of solving high-impact real world problems using scientific computing and AI, at Inductiva we spend a lot of time benchmarking numerical methods and simulators on different hardware platforms.

Houston, we have an eigenproblem!

Finding eigenvalues and eigenvectors is a very common operation in many fields of science, because the numerical solution to many practical problems can be formulated as an eigenproblem.

Lately, we have been investigating the performance of several eigensolvers, i.e. routines for finding the eigenvalues and eigenvectors of a matrix (or some variation of this formulation).

Formally an eigenproblem consists in finding the pairs of solutions \((\lambda, v_\lambda)\) for the equation:

\[A v_\lambda = \lambda v_\lambda \ ,\]

where \(A\) is a square matrix and \((\lambda, v_\lambda)\) are eigenvalues and eigenvectors of \(A\).

It is quite fast to solve an eigenproblem for small matrices (e.g. a few hundred rows), but it becomes extremely challenging for very large matrices \(A\), with millions or billions of rows. Naturally, such very large matrices are exactly the ones that scientists and engineers have to deal with when addressing problems of realistic scales.

Even though increasing the size of the matrix increases the number of entries (quadratically), handling larger matrices is possible if such entries are correlated in some way. For instance, even though the identity matrix with 106 rows has 1012 entries, its diagonal structure (i.e., the fact that its elements \(A_{ij}\) are 1 for \(i = j\) and 0 otherwise) renders it as easy to solve as the 2 × 2 identity matrix.

Not surprisingly, the structure of \(A\) has a key role in the complexity of an eigenproblem. In practice, such structure is defined by how scientists and engineers model the specific problem they have at hand.

In some cases, the matrix \(A\) has a regular internal structure, with a number of “tiled patterns” repeating internally. In many cases, the matrix \(A\) is also an extremely sparse matrix, with only a few non-zero entries. This is such a common case that having good solvers for sparse matrices has been the subject of years of intense research in the scientific computing community.

Solvers, please?

There are many good numerical solvers for large sparse matrices out there (e.g. Eigen, SLEPc). Most of these solvers deal with scale by distributing the load over several CPU machines, but they are massive software systems that are not easy to set up and maintain. However, in this benchmark, we focus on solvers that run on a single CPU or GPU.

One very common option for solving sparse eigenproblems, at least in Python, is the eigensolver provided by SciPy. SciPy’s sparse eigensolver is based on the ARPACK library, which implements the Implicitly Restarted Arnoldi Method (IRAM) for solving eigenproblems of large sparse or unstructured matrices. For Hermitian matrices (i.e. symmetric matrices, if the matrix entries are real-valued), it uses a variant of the Lanczos algorithm.

Computationally, this solver uses all the CPU threads it has available on the machine, and with good processors (e.g. the latest Intel Core i9 or AMD ThreadRipper), this is actually a lot of firepower.

Here is a short code snippet that computes the eigenpairs of a real-valued Hermitian matrix using SciPy.

# Import the SciPy package.
import scipy

# Initialize a 10 x 10 random matrix with 10% of non-zero entries.
matrix = scipy.sparse.random(10, 10, density=0.1)

# Make sure it is Hermitian.
matrix = matrix + matrix.transpose()

# Compute eigenpairs.
eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(matrix)

Despite being so convenient to use, the SciPy solver runs only on CPU. Fortunately, there are also GPU-based solvers for sparse eigenproblems. A very interesting solver is the one provided by the CuPy Python package, that mimics the API of SciPy. Below we show how the previous code translates to CuPy.

# Import the SciPy-compatible module of CuPy.
import cupyx.scipy as _scipy

# Initialize a 10 x 10 random matrix with 10% of non-zero entries.
matrix = _scipy.sparse.random(10, 10, density=0.1)

# Make sure it is Hermitian.
matrix = matrix + matrix.transpose()

# Compute eigenpairs.
eigenvalues, eigenvectors = _scipy.sparse.linalg.eigsh(matrix)

CuPy’s eigensolver is built on top of NVIDIA’s CUDA Toolkit and implements the Jacobi eigenvalue algorithm to find the eigenvalues and eigenvectors of Hermitian matrices.

Since any “cheap” GPU has thousands of CUDA cores, the hope is that as the size of the matrix increases, the extremely high level of parallelism of the GPU will compensate for the overheads of pushing in and getting out the data between CPU and GPU.

Both SciPy and CuPy are very convenient to install and use (as opposed to the other very large scale solvers we mentioned before), so we decided to do a little benchmarking between the eigensolvers provided by the CuPy and SciPy packages.

We set ourselves to answer the question:

Do GPUs solve sparse eigenproblems faster than CPUs?

This is not a trivial question because GPUs are not necessarily good with dealing with sparse structures. It is well known that they excel over dense data, but dealing with sparse structure has always been challenging for GPUs. To get to the bottom of this question, below we put SciPy and CuPy face to face to see how they fare in solving increasingly large sparse eigenproblems.

So, who wins? Let the contest begin!

Clash of Pythans

We benchmark the performance of eigensolvers using sparse matrices of increasing size and a realistic structure that mimics the ones found in some common problems.

We use as problem inputs tridiagonal matrices \(A\), whose diagonals are normally distributed random numbers. The matrices are Hermitian, which in this case, since the entries are real-valued, simply means that \(A\) is symmetric. To avoid any abnormally difficult or easy cases that could arise from generating these random matrices, we repeat each test (both for CuPy and SciPy) 100 times and report the time required by the CPU or GPU to solve the eigenproblem.

The comparison involves challenging both solvers, on different hardware configurations, with increasingly large matrices. We start with small square matrices of size 10 (i.e. 10 columns × 10 rows) and increase the size of the matri× 10 times in each step, ultimately trying to solve the largest eigenproblem that each specific hardware configuration allows for. The hard limit is the memory (RAM or GPU memory) available in each hardware configuration.

The CPU-based systems in which the SciPy solver is tested are:

  • CPU1: Intel Core i7-8750H (released in Q2 2018, 8th gen., 2.2 GHz, 6 cores, 12 threads, 16 GB RAM)
  • CPU2: Intel Core i5-12400F (released in Q1 2022, 12th gen., 4.4 GHz, 6 cores, 12 threads, 64 GB RAM)
  • CPU3: Intel Core i7-12700 (released in Q1 2022, 12th gen., 4.9 GHz, 10 cores, 20 threads, 32 GB RAM)
  • CPU4: Intel Core i9-12900K (released in Q4 2021, 12th gen., 5.2 GHz, 16 cores, 24 threads, 64 GB)

These are systems with an increasing number of cores / threads of newer and newer generations. We know SciPy makes use of all CPU cores available so more cores and higher frequency clock should mean faster solving times. The available RAM also varies, so we have a good mix of common CPU configurations.

The GPUs in which the CuPy solver is tested are:

Again, these represent increasingly more powerful options, both in terms of number of cores and available memory. The NVIDIA GTX 1060 is a laptop-level GPU from 2016, and is a quite modest GPU, while the NVIDIA RTX 4090 is the latest generation of gaming GPUs by NVIDIA.

Table 1 shows the mean execution time over 100 runs for the SciPy solver on each CPU for increasing matrix sizes. Note that all values are given in seconds, and that the changes in the powers of 10 are significant across lines and columns.

Matrix size CPU1: i7-8750H CPU2: i5-12400F CPU3: i7-12700 CPU4: i9-12900K
101 4.64 × 10-4 1.97 × 10-4 1.88 × 10-4 1.61 × 10-4
102 4.31 × 10-3 2.00 × 10-3 1.51 × 10-3 1.49 × 10-3
103 3.74 × 10-2 6.42 × 10-3 6.49 × 10-3 5.78 × 10-3
104 1.26 × 10-1 2.97 × 10-2 3.13 × 10-2 2.60 × 10-2
105 1.58 × 100 3.16 × 10-1 4.02 × 10-1 4.12 × 10-1
106 1.96 × 101 6.84 × 100 8.34 × 100 6.72 × 100
107 1.91 × 102 9.04 × 101 9.46 × 101 9.05 × 101

Likewise, Table 2 shows the mean execution time over 100 runs for the CuPy solver on each GPU. All values are given in seconds.

Matrix size GPU1: GTX 1060 GPU2: RTX 3070 GPU3: RTX 3090 GPU4: RTX 4090
101 1.30 × 10-2 2.29 × 10-2 1.19 × 10-2 1.91 × 10-2
102 9.58 × 10-2 3.03 × 10-2 2.65 × 10-2 2.29 × 10-2
103 1.21 × 10-1 3.23 × 10-2 3.03 × 10-2 2.42 × 10-2
104 1.64 × 10-1 4.11 × 10-2 3.88 × 10-2 3.15 × 10-2
105 2.84 × 10-1 1.26 × 10-1 9.78 × 10-2 5.65 × 10-2
106 2.67 × 100 7.13 × 10-1 6.78 × 10-1 3.56 × 10-1
107 2.95 × 101 8.21 × 100 4.96 × 100 3.18 × 100

For a better comparison, the mean values of both CPU and GPU execution times are represented graphically as a function of matrix size in Figure 1. Observe the logarithmic scales used in the plot.

Figure 1: Mean time required by CPUs and GPUs to solve a sparse eigenproblem of varying matrix size. CPUs use SciPy and GPUs use CuPy.

Our results strikingly show that GPUs perform poorly for small matrices (size < 104), but clearly outperform CPUs for matrix sizes > 104.

Note that our timings do not include the transfer time between RAM and GPU memory. However, there is still significant overhead in the GPU process, which is observable for small matrices. Naturally, the devices with higher computing capacity – CPU4 and GPU4 – outperform their counterparts, although the difference is clearly more noticeable in GPUs. In fact, CPUs 2-4 show a very similar performance for all matrix sizes. GPUs 2-4, in turn, show significant differences, especially for sizes > 104.

Another clear outcome of our experiment is that choosing the computational stack to solve a sparse eigenproblem can result in dramatically different execution times. In our results, there are matrix sizes for which no GPU outperforms the worst CPU (e.g. matrix size 102) and vice-versa (e.g. matrix size 106).

Figure 2: Speedup factors relative to CPU2 for varying matrix size.

To quantify the said impact on the workflow of scientists and engineers, we represent in Figure 2 the speedup factors obtained by solving the eigenproblem on all GPUs relative to CPU2, a mid-range CPU commonly available in personal laptops and desktops. This representation shows that with GPUs 2-4, the speedup factor is > 10x for matrix sizes > 106, reaching a whopping 30x for GPU4.

That’s all for now, folks!

Stay tuned for more benchmarks on scientific computing tools. If you are interested in seeing a particular scientific computing tool benchmarked, drop us a line at contact@inductiva.ai.

Recent posts from our blog

Luís Sarmento

Luís Sarmento

Ivan Pombo

Ivan Pombo

Maya Hershey

Maya Hershey

Discover the 3D model mesh resolution threshold that still yields stable and dependable estimations of physical properties in engineering simulations.

Hugo Penedones

Hugo Penedones

Luís Sarmento

Luís Sarmento

The Inductiva API v0.5 release brings two New Simulators, Up-to-date GCP pricing information, additional features on Command Line Interface, more Metrics and Benchmarks, a Template Manager and more revamped Documentation. Additionally we are open sourcing all the scientific software containers we are using in our backend - Project KUTU.