In my previous post, I tried to explain how to use SIMD instructions for a really simple (and artificial) example: just adding numbers in two vectors together. In this post, I’d like to take this just a little bit further and talk about matrix multiplication. In this post, we’re using gcc’s auto-vectorization. We’ll vectorize this ourselves in my next post.

If you’re here, you probably know what matrix multiplication is. It’s got a lot of uses, including graphics and neural networks.

We’ll keep our implementation simple by only supporting square matrices with n dividable by 16 (in the case of AVX). Our example will use n=1024. So before we do the vectorized implementation, let’s look at a general (“naive”) example:

#include <stdio.h> int main(int argc, char **argv) { float matrix_a[1024][1024]; float matrix_b[1024][1024]; float result_matrix[1024][1024]; // initialize arrays for (int i = 0; i < 1024; i++) { for (int j = 0; j < 1024; j++) { matrix_a[i][j] = 0.1f; matrix_b[i][j] = 0.2f; result_matrix[i][j] = 0.0f; } } for (i = 0; i < 1024; i++) { // iterate over rows of matrix A/result matrix for (j = 0; j < 1024; j++) { // iterate over columns matrix B/result matrix for (k = 0; k < 1024; k++) { // iterate over columns of matrix A and rows of matrix B result_matrix[i][j] += matrix_a[i][k]*matrix_b[k][j] } } } // output for (int i = 0; i < 1024; i++) { for (int j = 0; j < 1024; j++) { printf("%f ", result_matrix[i][j]); } printf("\n"); } }

To compile and run, execute the following commands:

$ gcc -Wall -o mm mm.c $ ulimit -s 16384 $ time ./mm > mm_output real 0m20.189s user 0m20.016s sys 0m0.072s

(Note that we are allocating the arrays on the stack rather than using malloc, so we need to raise the stack size a bit, otherwise we get an immediate segmentation fault.)

The reason matrix multiplication code can look a bit mysterious is that there are a lot of things that can be optimized. However, there is only one optimization that is required to get vectorization to work at all.

As you can see, when we access matrix_a, we access matrix_a[i][0], then matrix_a[i][1], matrix_a[i][2], matrix_a[i][3], and so on until we have hit the end. This is nice and sequential memory access, and is much faster than haphazard (“random”) accesses. In matrix_b, we have somewhat haphazard accesses. The first access is matrix_b[0][j], the second access is (in our example) 1024 bytes away from the first, matrix_b[1][j], then another 1024 bytes away at matrix_b[2][j], etc. There is a 1024 byte gap between every access. This kind of access is slow. It ruins the CPU’s caching system. This is why matrix_b will often be transposed in matrix multiplication code. If you transpose the matrix, the rows will be the columns and the columns the rows, thus you get nice and sequential access to matrix_b. (In our demonstration code, we are using square matrices with the same values everywhere, so we don’t actually have to do any copying work, as matrix_b is the same transposed or not. So all we have to do is swap the indices.)

result[i][j] += matrix_a[i][k]*matrix_b[j][k]

So what kind of speed-up does this get us? The naive implementation takes 19-21 seconds on my system. The implementation with the transposed matrix takes 4 seconds! That’s a 5x speed-up!

Next, we’ll try to parallelize the outer for-loop using OpenMP. With OpenMP we just have to add #pragma omp parallel for in front of the loop, like this:

#pragma omp parallel for for (int i = 0; i < 1024; i++) {

And then compile and run like this:

$ gcc -fopenmp -Wall -o mmT mmT.c $ ulimit -s 16384 $ time ./mmT > /dev/null real 0m2.939s user 0m9.984s sys 0m0.016s

Next, we’ll ask gcc to auto-vectorize! Curiously enough, gcc didn’t autovectorize the version with the transposed loop, so I’ve gathered results for -O4 without autovectorization for non-transposed, -O4 with autovectorization for non-transposed, and -O4 transposed:

-O4 with SSE autovectorization | -O4 with AVX autovectorization | -O4 without autovectorization | |

Straight | 2.99 | 1.527 | 8.921 |

Transposed | n/a | n/a | 1.565 |

And here are the commands and some example output:

$ # -O4, no auto-vectorization, straight $ gcc -g -O4 -fopt-info-optall-optimized -fno-tree-vectorize -o mm mm.c mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls. $ time ./mm > /dev/null real 0m8.921s user 0m8.912s sys 0m0.004s $ # -O4, SSE auto-vectorization, straight $ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mm mm.c mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls. mm.c:18:9: note: loop vectorized mm.c:9:9: note: loop vectorized $ # -O4, AVX auto-vectorization, straight $ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm mm.c $ # -O4, no auto-vectorization, transformed $ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mmT mmT.c

Let’s add OpenMP into the mix:

-O4 with AVX autovectorization and OpenMP | -O4 with OpenMP | |

Straight | 1.18 | 5.568 |

Transposed | n/a | 0.882 |

Just asking OpenMP to parallelize the i-loop makes the auto-vectorization break, but we can work around that by manually splitting the matrix multiplication into chunks. This is the full code:

#include <stdio.h> #define N 1024 float matrix_a[N][N]; float matrix_b[N][N]; float result_matrix[N][N]; void chunked_mm(int chunk, int n_chunks) { for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { result_matrix[i][j] += matrix_a[i][k] * matrix_b[k][j]; } } } } int main(int argc, char **argv) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { matrix_a[i][j] = 0.1f; matrix_b[i][j] = 0.2f; result_matrix[i][j] = 0.0f; } } #pragma omp parallel for for (int chunk = 0; chunk < 4; chunk++) { chunked_mm(chunk, 4); } for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { printf("%f ", result_matrix[i][j]); } printf("\n"); } }

Compile and run:

$ gcc -g -O4 -fopenmp -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm_autovectorized_openmp mm_autovectorized_openmp.c mm_autovectorized_openmp.c:11:9: note: loop vectorized mm_autovectorized_openmp.c:11:9: note: loop vectorized mm_autovectorized_openmp.c:21:9: note: Loop 4 distributed: split to 1 loops and 1 library calls. mm_autovectorized_openmp.c:21:9: note: loop vectorized mm_autovectorized_openmp.c:21:9: note: loop peeled for vectorization to enhance alignment mm_autovectorized_openmp.c:21:9: note: loop turned into non-loop; it never loops. mm_autovectorized_openmp.c:21:9: note: loop with 7 iterations completely unrolled mm_autovectorized_openmp.c:19:5: note: loop turned into non-loop; it never loops. mm_autovectorized_openmp.c:19:5: note: loop with 7 iterations completely unrolled $ time ./mm_autovectorized_openmp > /dev/null real 0m1.182s user 0m3.036s sys 0m0.012s

From the user time being larger than the real time, we can tell that this was indeed running in multiple threads. Enclosing the parallel loop with something like:

for (int loop = 0; loop < 10; loop++) { #pragma omp parallel for for (int chunk = 0; chunk < 8; chunk++) { chunked_mm(chunk, 8); } }

allows us a better measurement of how much improvement we get.

time ./mm_autovectorized_openmp > /dev/null real 0m6.649s user 0m23.572s sys 0m0.012s

Anyway, the transpose still beats gcc’s auto-vectorization of the non-transposed code. I wish I could get gcc to auto-vectorize the transposed code, but alas.

In the next post we’ll vectorize this ourselves!