In my previous post, I tried various things to improve the performance of a matrix multiplication using compiler features.
# 20 seconds
gcc Wall o mm mm.c
# 1.182 seconds
gcc g O4 fopenmp foptinfooptalloptimized ftreevectorize mavx o mm_autovectorized_openmp mm_autovectorized_openmp.c
However, O4 fopenmp using transposed matrices turned out faster (0.882 seconds) than O4 fopenmp and autovectorization using untransposed matrices. I couldn’t get autovectorization to work with the transposed matrices.
In this post, we’ll use simple SIMD instructions to optimize this further. It builds up on my post from two days ago, where I explain how to use SIMD instructions for a very simple and synthetic example.
Note that much more can be done to optimized matrix multiplication than is described in this post. This post just explains the very basics. If you need more advanced algorithms, maybe look through these three links:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0 High Performance Matrix Multiplication
https://news.ycombinator.com/item?id=17164737 Hacker News discussion of above post
https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf Anatomy of HighPerformance Matrix Multiplication (academic paper)
Using transposed matrices makes vectorizing matrix multiplication quite easy. Why? Well, remember that in our simple example, there were three steps. The first step requires that the data to be loaded is laid out sequentially in memory.
 Loading data into SIMD registers
 Performing operations on corresponding operands in two SIMD registers
 Storing the result
Step 1: Loading data
Remember that the data load wanted a memory address where the four (or eight) float values were stored sequentially. Well, if we just transpose the matrix before we start doing stuff, we can just load the matrix B floats sequentially. So the code looks almost the same as in the baby steps post. To make things a bit easier, we will be using SSE for now.
va = _mm_loadu_ps(&(matrix_a[i][k]));
vb = _mm_loadu_ps(&(matrix_b[j][k]));
Step 2: Doing the calculations
All right. We have our floats loaded into two registers. In SSE, we have four floats per register:
Register 1 (va) 
0.1 
0.1 
0.1 
0.1 
Register 2 (vb) 
0.2 
0.2 
0.2 
0.2 
The first step is to multiply. In the baby steps post, we used _mm_add_ps to perform addition. Well, multiplication uses an intrinsic with a similar name: _mm_mul_ps. (The AVX version is _mm256_mul_ps.) So if we do:
vresult = _mm_mul_ps(va, vb)
And we get:
vresult 
0.02 
0.02 
0.02 
0.02 
Great! Now we just need to add the contents of vresult together! Unfortunately, there is no SIMD instruction that would add every component together to give us 0.08 as the output, given the above vresult as its only input.
From SSE3, there exists _mm_hadd_ps however, the “horizontal add” instruction (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&expand=2777), which takes two registers as input (you can use the same registers), and computes:
dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]
Here’s an example:
va 
0.1 
0.2 
0.3 
0.4 
vb 
0.5 
0.6 
0.7 
0.8 
vresult 
0.3 
0.7 
1.1 
1.5 
Sorry for the weird color scheme. Maybe you can already see that this is a bit odd – why does it want two registers as input, for starters? We wanted 0.1+0.2+0.3+0.4, which should be 1. Well, let’s see what happens when we use the same register for both inputs, and perform this operation twice!
va 
0.1 
0.2 
0.3 
0.4 
va 
0.1 
0.2 
0.3 
0.4 
vresult 
0.3 
0.7 
0.3 
0.7 
vresult 
0.3 
0.7 
0.3 
0.7 
vresult 
0.3 
0.7 
0.3 
0.7 
vresult (new) 
1 
1 
1 
1 
Yay, we did it! We got 1, which is the result of 0.1+0.2+0.3+0.4. (This works for SSE. We will talk about AVX later.) Here’s the code:
vresult = _mm_hadd_ps(vresult, vresult);
vresult = _mm_hadd_ps(vresult, vresult);
Step 3: Storing the result
Step 3 involves storing the result. We can of course just store the four bytes into an array as before, but as they’re all the same, we’re really only interested in one of them. We could use _mm_extract_ps, which is capable of extracting any of the four floats. But we can do slightly better, we can just cast, which will get us the lowest float in the 128bit register. There is an intrinsic for this type of cast, _mm_cvtss_f32, so we can just write:
result[i][j] += _mm_cvtss_f32(vresult);
And that’s (assuming SSE3) four suboperations of the matrix multiplication done in one go! Because we’re doing four ks at once, we have to change the inner loop to reflect that:
for (int k = 0; k < 1024; k += 4) {
...
}
So let’s see the code. In this example I’ve also decided to use malloc instead of stack arrays (except for result), so matrix_a[i][k] turns into matrix_a+(i*1024)+k.
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
float *matrix_a = malloc(1024*1024*sizeof(float));
float *matrix_b = malloc(1024*1024*sizeof(float));
float result[1024][1024];
__m128 va, vb, vresult;
// initialize matrix_a and matrix_b
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 4) {
// load
va = _mm_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
vb = _mm_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]
// multiply
vresult = _mm_mul_ps(va, vb);
// add
vresult = _mm_hadd_ps(vresult, vresult);
vresult = _mm_hadd_ps(vresult, vresult);
// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}
return 0;
}
gcc O4 foptinfooptalloptimized msse3 o sse_mm_unaligned sse_mm_unaligned.c
time ./sse_mm_unaligned > /dev/null
real 0m1.054s
user 0m1.044s
sys 0m0.008s
And the run time is about 1.054 seconds using a single thread. Note that we have to pass msse3 to gcc, as vanilla SSE does not support the horizontal add instruction.
AVX
As mentioned earlier, the doublehadd method does not work for the AVX _mm256_hadd_ps intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_hadd_ps&expand=2778), which works like this:
dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]
dst[159:128] := a[191:160] + a[159:128]
dst[191:160] := a[255:224] + a[223:192]
dst[223:192] := b[191:160] + b[159:128]
dst[255:224] := b[255:224] + b[223:192]
Here’s a vavbtable that shows what happens with AVX:
va 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
vb 
0.9 
1 
1.1 
1.2 
1.3 
1.4 
1.5 
1.6 
vresult 
0.3 
0.7 
1.9 
1.3 
1.1 
1.5 
2.7 
3.1 
Here’s the first vava table of the doublehadd method:
va 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
va 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
vresult 
0.3 
0.7 
0.3 
0.7 
1.1 
1.5 
1.1 
1.5 
And the second vresultvresult table:
vresult 
0.3 
0.7 
0.3 
0.7 
1.1 
1.5 
1.1 
1.5 
vresult 
0.3 
0.7 
0.3 
0.7 
1.1 
1.5 
1.1 
1.5 
vresult (new) 
1 
1 
1 
1 
2.6 
2.6 
2.6 
2.6 
As you can see, we do not reach our expected result of 3.6 (0.1+0.2+…+0.8). (It’s just like it’s doing two SSE hadds completely independent from each other.) There are various ways to get out of this problem, e.g. extract the two 128bit halves from the 256bit register, and then use SSE instructions. This is how you extract:
vlow = _mm256_extractf128_ps(va, 0);
vhigh = _mm256_extractf128_ps(va, 1);
The second argument indicates with half you want.
As an aside: instead of extracting the lower 128 bits and putting them in a register, we can also use a cast, _mm256_castps256_ps128 (https://software.intel.com/enus/node/524181).
The lower 128bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.
Anyway, let’s go with the extracted values first. So we have the following situation:
vlow 
0.1 
0.2 
0.3 
0.4 
vhigh 
0.5 
0.6 
0.7 
0.8 
And we want to add all these eight values together. So why don’t we just simply use our trusty _mm_add_ps(vlow, vhigh) first? This way we can do four of eight required additions, leaving us with the following 128bit register:
And now we want to add up horizontally, so we use the double_mm_hadd_ps method described above:
vresult 
0.6 
0.8 
1 
1.2 
vresult 
0.6 
0.8 
1 
1.2 
vresult 
1.4 
2.2 
1.4 
2.2 
vresult 
1.4 
2.2 
1.4 
2.2 
vresult 
1.4 
2.2 
1.4 
2.2 
vresult 
3.6 
3.6 
3.6 
3.6 
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
float *matrix_a = malloc(1024*1024*sizeof(float));
float *matrix_b = malloc(1024*1024*sizeof(float));
float result[1024][1024];
__m256 va, vb, vtemp;
__m128 vlow, vhigh, vresult;
// initialize matrix_a and matrix_b
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 8) {
// load
va = _mm256_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
vb = _mm256_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]
// multiply
vtemp = _mm256_mul_ps(va, vb);
// add
// extract higher four floats
vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
// add higher four floats to lower floats
vresult = _mm_add_ps(_mm256_castps256_ps128(vtemp), vhigh);
// horizontal add of that result
vresult = _mm_hadd_ps(vresult, vresult);
// another horizontal add of that result
vresult = _mm_hadd_ps(vresult, vresult);
// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}
return 0;
}
$ gcc O4 foptinfooptalloptimized mavx o avx256_mm_unaligned avx256_mm_unaligned.c
$ time ./avx256_mm_unaligned > /dev/null
real 0m0.912s
user 0m0.904s
sys 0m0.004s
That is… a tiny bit faster. (Note that I’m running everything multiple times to make sure the difference isn’t just due to change.) However, with AVX we are supposed to get twice the FLOPs, right? We’ll look at other optimizations of the vectorization in a later post. Before that, let’s add OpenMP into the mix.
OpenMP
Unfortunately, OpenMP’s #pragma omp parallel for sometimes doesn’t appear to do what you need it to do. Sticking this in front of the outer (i) loop reduces performance by half! However, we can be sure that this isn’t the processor “oversubscribing” the SIMD units, because if we run two instances of our program at the same time, both finish with almost the same run time we see with just a single instance:
$ time (./avx256_mm_unaligned & ./avx256_mm_unaligned; wait) > /dev/null
real 0m1.001s
user 0m0.988s
sys 0m0.008s
So we’ll use the same chunking trick that we used last time, and our result gets a little better: 0.753 seconds:
#include <x86intrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>
#include <stdlib.h>
float *matrix_a;
float *matrix_b;
float result[1024][1024];
void chunked_mm(int chunk, int n_chunks) {
__m256 va, vb, vtemp;
__m128 vlow, vhigh, vresult;
for (int i = chunk*(1024/n_chunks); i < (chunk+1)*(1024/n_chunks); i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 8) {
// load
va = _mm256_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
vb = _mm256_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]
// multiply
vtemp = _mm256_mul_ps(va, vb);
// add
// extract higher four floats
vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
// add higher four floats to lower floats
vresult = _mm_add_ps(_mm256_castps256_ps128(vtemp), vhigh);
// horizontal add of that result
vresult = _mm_hadd_ps(vresult, vresult);
// another horizontal add of that result
vresult = _mm_hadd_ps(vresult, vresult);
// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}
}
int main(int argc, char **argv) {
// initialize matrix_a and matrix_b
matrix_a = malloc(1024*1024*sizeof(float));
matrix_b = malloc(1024*1024*sizeof(float));
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 4; i++) {
chunked_mm(i, 4);
}
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}
return 0;
}
$ gcc fopenmp O4 mavx o avx256_mm_unaligned_openmp avx256_mm_unaligned_openmp.c
$ time ./avx256_mm_unaligned_openmp > /dev/null
real 0m0.753s
user 0m1.332s
sys 0m0.008s
To be honest, with a 2 core/4 thread system, I would have expected better. Running multiple instances doesn’t increase the run time, and the previous version took only 1.27 times as long as this.
Reevaluating our performance measurements
Array initialization will always take the same small amount of time, but printf(“%f”, …) takes a nonconstant amount of time and depends on the values. Let’s see what kind of timing we get when we change this to an %x format string.
printf("%x ", *(unsigned int*)&result[i][j]);
time ./avx256_mm_unaligned > /dev/null
real 0m0.488s
user 0m0.480s
sys 0m0.004s
time ./avx256_mm_unaligned_openmp > /dev/null
real 0m0.277s
user 0m0.832s
sys 0m0.008s
That sounds much better, both in absolute terms and in OpenMP terms. By the way, if we remove the matrix multiplication and only leave initialization and output, we still get an execution time of about 0.111 seconds. So it’s reasonably safe to say that our matrix multiplication takes about 0.377 seconds on a single thread. (I feel like I shot myself in the foot for measuring this using shell’s time, rather than embedding the measurement in the code itself…)
Aligned accesses
To allow the use of the aligned _mm256_load_ps, allocate your memory like this:
matrix_a = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));
matrix_b = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));
Unfortunately, I didn’t notice a significant difference. (You may be able to shave off a few percent.)
Results
Here are the results, again:

AVX, no OpenMP 
AVX, OpenMP 
SSE, no OpenMP 
Run time 
0.488 
0.277 
0.59 
Minus init/output 
0.377 
0.166 
0.479 