## Factors of the first 1,000,000 integer numbers

A quick Google search only brought up lists up to e.g. 1,000, so here’s a CSV with the factors for the first 1,000,000 integers. (Note that from a programming perspective, you’re most likely better off calculating the factors yourself rather than parsing a CSV file. So it’s not quite clear if this list is actually useful.) Feel free to do with it whatever you want.

1000000_factors.gz (23 MB)

Here’s an extract:

```1,1
2,1
3,1
4,1,2
5,1
6,1,2,3
7,1
8,1,2,4
9,1,3
10,1,2,5
11,1
12,1,2,3,4,6
13,1
14,1,2,7
15,1,3,5
16,1,2,4,8
17,1
18,1,2,3,6,9
19,1
20,1,2,4,5,10
21,1,3,7
22,1,2,11
23,1
24,1,2,3,4,6,8,12
25,1,5
26,1,2,13
27,1,3,9
28,1,2,4,7,14
29,1
30,1,2,3,5,6,10,15
31,1
32,1,2,4,8,16
33,1,3,11
34,1,2,17
35,1,5,7
36,1,2,3,4,6,9,12,18
37,1
38,1,2,19
39,1,3,13
40,1,2,4,5,8,10,20
41,1
42,1,2,3,6,7,14,21
43,1
44,1,2,4,11,22
45,1,3,5,9,15
46,1,2,23
47,1
48,1,2,3,4,6,8,12,16,24
49,1,7
50,1,2,5,10,25```

## Matrix multiplication using the FMA instruction

In my previous post, we did matrix multiplication using regular SSE/AVX instructions. In this post, we’ll implement matrix multiplication using the FMA (fused multiply-add) instruction, which takes three arguments and is able to multiply and add at the same time. (Think c = a*b + c.)

If you got here directly without reading the previous posts, note that this is just a somewhat naive implementation using inverted matrices (but without doing the inversion ourselves). (The intent of this article series is to show how to use SIMD instructions.) My previous post had a few links for people who need something really optimized.

Note that we’re using “FMA3” instructions, rather than FMA4 instructions, which only seem to be supported on some AMD processors. (The number indicates the number of arguments passed to the instruction. In the case of FMA4, the formula would be a = b*c + d.)

The first table below shows the operation of the first FMA instruction, where a is still 0 (as there is nothing to add on the very first instruction), and the second table the following FMA instruction, where the “addend” is the result in the first table.

 Addend (a) 0 0 0 0 Factor 1 (b) 0.1 0.1 0.1 0.1 Factor 2 (c) 0.1 0.1 0.1 0.1 Result (a) 0.01 0.01 0.01 0.01
 Addend (a) 0.01 0.01 0.01 0.01 Factor 1 (b) 0.1 0.1 0.1 0.1 Factor 2 (c) 0.1 0.1 0.1 0.1 Result (a) 0.02 0.02 0.02 0.02

Here’s the code. I changed the square matrix size to 2048 to make measuring a bit easier.

```#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

#define N 2048

float *matrix_a;
float *matrix_b;
float result[N][N];

void chunked_mm(int chunk, int n_chunks) {
__m256 va, vb, vc;
for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) {
for (int j = 0; j < N; j++) {
float buffer = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
for (int k = 0; k < N; k += 8) {

}
_mm256_storeu_ps(buffer, vc);
result[i][j] = buffer + buffer + buffer + buffer + buffer + buffer + buffer + buffer;
//result[i][j] = buffer + buffer + buffer + buffer;
}
}
}

int main(int argc, char **argv) {
// initialize matrix_a and matrix_b
matrix_a = malloc(N*N*sizeof(float));
matrix_b = malloc(N*N*sizeof(float));

for (int i = 0; i < N*N; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
result[i][j] = 0.0f;
}
}

#pragma omp parallel for
for (int i = 0; i < 4; i++) {
chunked_mm(i, 4);
}

for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
//printf("%f ", result[i][j]);
printf("%x ", *(unsigned int*)&result[i][j]);
}
printf("\n");
}

return 0;
}```

### Performance

Since the CPU used in the previous articles doesn’t support FMA (and we changed N), I’m re-benchmarking the AVX256 version on the new processor.

AVX256: 1.25 seconds
FMA: 1 second

Unfortunately this is borrowed hardware so I can’t play around with this too much, but the above result is pretty consistent.

## Matrix multiplication using SIMD instructions

In my previous post, I tried various things to improve the performance of a matrix multiplication using compiler features.

```# 20 seconds
gcc -Wall -o mm mm.c

# 1.182 seconds
gcc -g -O4 -fopenmp -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm_autovectorized_openmp mm_autovectorized_openmp.c```

However, -O4 -fopenmp using transposed matrices turned out faster (0.882 seconds) than -O4 -fopenmp and auto-vectorization using untransposed matrices. I couldn’t get auto-vectorization to work with the transposed matrices.

In this post, we’ll use simple SIMD instructions to optimize this further. It builds up on my post from two days ago, where I explain how to use SIMD instructions for a very simple and synthetic example.

Note that much more can be done to optimize matrix multiplication than is described in this post. This post just explains the very basics. If you need more advanced algorithms, maybe look through these three links:

https://news.ycombinator.com/item?id=17164737 Hacker News discussion of above post

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf Anatomy of High-Performance Matrix Multiplication (academic paper)

Here’s an interesting article that implements high-performance matrix multiplication in just 100 lines, using FMA3: https://cs.stanford.edu/people/shadjis/blas.html

Using transposed matrices makes vectorizing matrix multiplication quite easy. Why? Well, remember that in our simple example, there were three steps. The first step requires that the data to be loaded is laid out sequentially in memory.

2. Performing operations on corresponding operands in two SIMD registers
3. Storing the result

Remember that the data load wanted a memory address where the four (or eight) float values were stored sequentially. Well, if we just transpose the matrix before we start doing stuff, we can just load the matrix B floats sequentially. So the code looks almost the same as in the baby steps post. To make things a bit easier, we will be using SSE for now.

```va = _mm_loadu_ps(&(matrix_a[i][k]));

### Step 2: Doing the calculations

All right. We have our floats loaded into two registers. In SSE, we have four floats per register:

 Register 1 (va) 0.1 0.1 0.1 0.1 Register 2 (vb) 0.2 0.2 0.2 0.2

The first step is to multiply. In the baby steps post, we used _mm_add_ps to perform addition. Well, multiplication uses an intrinsic with a similar name: _mm_mul_ps. (The AVX version is _mm256_mul_ps.) So if we do:

` vresult = _mm_mul_ps(va, vb)`

And we get:

 vresult 0.02 0.02 0.02 0.02

Great! Now we just need to add the contents of vresult together! Unfortunately, there is no SIMD instruction that would add every component together to give us 0.08 as the output, given the above vresult as its only input.

From SSE3, there exists _mm_hadd_ps however, the “horizontal add” instruction (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&expand=2777), which takes two registers as input (you can use the same registers), and computes:

```dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]```

Here’s an example:

 va 0.1 0.2 0.3 0.4 vb 0.5 0.6 0.7 0.8 vresult 0.3 0.7 1.1 1.5

Sorry for the weird color scheme. Maybe you can already see that this is a bit odd – why does it want two registers as input, for starters? We wanted 0.1+0.2+0.3+0.4, which should be 1. Well, let’s see what happens when we use the same register for both inputs, and perform this operation twice!

 va 0.1 0.2 0.3 0.4 va 0.1 0.2 0.3 0.4 vresult 0.3 0.7 0.3 0.7
 vresult 0.3 0.7 0.3 0.7 vresult 0.3 0.7 0.3 0.7 vresult (new) 1 1 1 1

Yay, we did it! We got 1, which is the result of 0.1+0.2+0.3+0.4. (This works for SSE. We will talk about AVX later.) Here’s the code:

```vresult = _mm_hadd_ps(vresult, vresult);

### Step 3: Storing the result

Step 3 involves storing the result. We can of course just store the four bytes into an array as before, but as they’re all the same, we’re really only interested in one of them. We could use _mm_extract_ps, which is capable of extracting any of the four floats. But we can do slightly better, we can just cast, which will get us the lowest float in the 128-bit register. There is an intrinsic for this type of cast, _mm_cvtss_f32, so we can just write:

`result[i][j] += _mm_cvtss_f32(vresult);`

And that’s (assuming SSE3) four sub-operations of the matrix multiplication done in one go! Because we’re doing four ks at once, we have to change the inner loop to reflect that:

```for (int k = 0; k < 1024; k += 4) {
...
}```

So let’s see the code. In this example I’ve also decided to use malloc instead of stack arrays (except for result), so matrix_a[i][k] turns into matrix_a+(i*1024)+k.

```#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
float *matrix_a = malloc(1024*1024*sizeof(float));
float *matrix_b = malloc(1024*1024*sizeof(float));
float result;
__m128 va, vb, vresult;

// initialize matrix_a and matrix_b
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 4) {

// multiply
vresult = _mm_mul_ps(va, vb);

// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}

return 0;
}```
```gcc -O4 -fopt-info-optall-optimized -msse3 -o sse_mm_unaligned sse_mm_unaligned.c
time ./sse_mm_unaligned > /dev/null

real    0m1.054s
user    0m1.044s
sys     0m0.008s```

And the run time is about 1.054 seconds using a single thread. Note that we have to pass -msse3 to gcc, as vanilla SSE does not support the horizontal add instruction.

### AVX

As mentioned earlier, the double-hadd method does not work for the AVX _mm256_hadd_ps intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_hadd_ps&expand=2778), which works like this:

```dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]
dst[159:128] := a[191:160] + a[159:128]
dst[191:160] := a[255:224] + a[223:192]
dst[223:192] := b[191:160] + b[159:128]
dst[255:224] := b[255:224] + b[223:192]```

Here’s a va-vb-table that shows what happens with AVX:

 va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 vb 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 vresult 0.3 0.7 1.9 1.3 1.1 1.5 2.7 3.1

Here’s the first va-va table of the double-hadd method:

 va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5

And the second vresult-vresult table:

 vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5 vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5 vresult (new) 1 1 1 1 2.6 2.6 2.6 2.6

As you can see, we do not reach our expected result of 3.6 (0.1+0.2+…+0.8). (It’s just like it’s doing two SSE hadds completely independent from each other.) There are various ways to get out of this problem, e.g. extract the two 128-bit halves from the 256-bit register, and then use SSE instructions. This is how you extract:

```vlow = _mm256_extractf128_ps(va, 0);
vhigh = _mm256_extractf128_ps(va, 1);```

The second argument indicates with half you want.

As an aside: instead of extracting the lower 128 bits and putting them in a register, we can also use a cast, _mm256_castps256_ps128 (https://software.intel.com/en-us/node/524181).

The lower 128-bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.

Anyway, let’s go with the extracted values first. So we have the following situation:

 vlow 0.1 0.2 0.3 0.4 vhigh 0.5 0.6 0.7 0.8

And we want to add all these eight values together. So why don’t we just simply use our trusty _mm_add_ps(vlow, vhigh) first? This way we can do four of eight required additions, leaving us with the following 128-bit register:

 vresult 0.6 0.8 1 1.2

And now we want to add up horizontally, so we use the double-_mm_hadd_ps method described above:

 vresult 0.6 0.8 1 1.2 vresult 0.6 0.8 1 1.2 vresult 1.4 2.2 1.4 2.2
 vresult 1.4 2.2 1.4 2.2 vresult 1.4 2.2 1.4 2.2 vresult 3.6 3.6 3.6 3.6
```#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
float *matrix_a = malloc(1024*1024*sizeof(float));
float *matrix_b = malloc(1024*1024*sizeof(float));
float result;
__m256 va, vb, vtemp;
__m128 vlow, vhigh, vresult;

// initialize matrix_a and matrix_b
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 8) {

// multiply
vtemp = _mm256_mul_ps(va, vb);

// extract higher four floats
vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
// add higher four floats to lower floats
// horizontal add of that result
// another horizontal add of that result

// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}

return 0;
}```
```\$ gcc -O4 -fopt-info-optall-optimized -mavx -o avx256_mm_unaligned avx256_mm_unaligned.c
\$ time ./avx256_mm_unaligned > /dev/null

real    0m0.912s
user    0m0.904s
sys     0m0.004s```

That is… a tiny bit faster. (Note that I’m running everything multiple times to make sure the difference isn’t just due to change.) However, with AVX we are supposed to get twice the FLOPs, right? We’ll look at other optimizations of the vectorization in a later post. Before that, let’s add OpenMP into the mix.

### OpenMP

Unfortunately, OpenMP’s #pragma omp parallel for sometimes doesn’t appear to do what you need it to do. Sticking this in front of the outer (i) loop reduces performance by half! However, we can be sure that this isn’t the processor “oversubscribing” the SIMD units, because if we run two instances of our program at the same time, both finish with almost the same run time we see with just a single instance:

```\$ time (./avx256_mm_unaligned & ./avx256_mm_unaligned; wait) > /dev/null
real    0m1.001s
user    0m0.988s
sys     0m0.008s```

So we’ll use the same chunking trick that we used last time, and our result gets a little better: 0.753 seconds:

```#include <x86intrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>
#include <stdlib.h>

float *matrix_a;
float *matrix_b;
float result;

void chunked_mm(int chunk, int n_chunks) {
__m256 va, vb, vtemp;
__m128 vlow, vhigh, vresult;
for (int i = chunk*(1024/n_chunks); i < (chunk+1)*(1024/n_chunks); i++) {
for (int j = 0; j < 1024; j++) {
for (int k = 0; k < 1024; k += 8) {

// multiply
vtemp = _mm256_mul_ps(va, vb);

// extract higher four floats
vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
// add higher four floats to lower floats
// horizontal add of that result
// another horizontal add of that result

// store
result[i][j] += _mm_cvtss_f32(vresult);
}
}
}
}

int main(int argc, char **argv) {
// initialize matrix_a and matrix_b
matrix_a = malloc(1024*1024*sizeof(float));
matrix_b = malloc(1024*1024*sizeof(float));
for (int i = 0; i < 1048576; i++) {
*(matrix_a+i) = 0.1f;
*(matrix_b+i) = 0.2f;
}
// initialize result matrix
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
result[i][j] = 0;
}
}

for (int i = 0; i < 4; i++) {
chunked_mm(i, 4);
}

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result[i][j]);
}
printf("\n");
}

return 0;
}```
```\$ gcc -fopenmp -O4 -mavx -o avx256_mm_unaligned_openmp avx256_mm_unaligned_openmp.c
\$ time ./avx256_mm_unaligned_openmp > /dev/null

real    0m0.753s
user    0m1.332s
sys     0m0.008s```

To be honest, with a 2 core/4 thread system, I would have expected better. Running multiple instances doesn’t increase the run time, and the previous version took only 1.27 times as long as this.

### Re-evaluating our performance measurements

Array initialization will always take the same small amount of time, but printf(“%f”, …) takes a non-constant amount of time and depends on the values. Let’s see what kind of timing we get when we change this to an %x format string.

`printf("%x ", *(unsigned int*)&result[i][j]);`
```time ./avx256_mm_unaligned > /dev/null

real    0m0.488s
user    0m0.480s
sys     0m0.004s

time ./avx256_mm_unaligned_openmp > /dev/null

real    0m0.277s
user    0m0.832s
sys     0m0.008s```

That sounds much better, both in absolute terms and in OpenMP terms. By the way, if we remove the matrix multiplication and only leave initialization and output, we still get an execution time of about 0.111 seconds. So it’s reasonably safe to say that our matrix multiplication takes about 0.377 seconds on a single thread. (I feel like I shot myself in the foot for measuring this using shell’s time, rather than embedding the measurement in the code itself…)

### Aligned accesses

To allow the use of the aligned _mm256_load_ps, allocate your memory like this:

```    matrix_a = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));
matrix_b = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));```

Unfortunately, I didn’t notice a significant difference. (You may be able to shave off a few percent.)

### Results

Here are the results, again:

 AVX, no OpenMP AVX, OpenMP SSE, no OpenMP Run time 0.488 0.277 0.59 Minus init/output 0.377 0.166 0.479

## Matrix multiplication using gcc’s auto-vectorization

In my previous post, I tried to explain how to use SIMD instructions for a really simple (and artificial) example: just adding numbers in two vectors together. In this post, I’d like to take this just a little bit further and talk about matrix multiplication. In this post, we’re using gcc’s auto-vectorization. We’ll vectorize this ourselves in my next post.

If you’re here, you probably know what matrix multiplication is. It’s got a lot of uses, including graphics and neural networks.

We’ll keep our implementation simple by only supporting square matrices with n dividable by 16 (in the case of AVX). Our example will use n=1024. So before we do the vectorized implementation, let’s look at a general (“naive”) example:

```#include <stdio.h>

int main(int argc, char **argv) {
float matrix_a;
float matrix_b;
float result_matrix;

// initialize arrays
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
matrix_a[i][j] = 0.1f;
matrix_b[i][j] = 0.2f;
result_matrix[i][j] = 0.0f;
}
}

for (i = 0; i < 1024; i++) { // iterate over rows of matrix A/result matrix
for (j = 0; j < 1024; j++) { // iterate over columns matrix B/result matrix
for (k = 0; k < 1024; k++) { // iterate over columns of matrix A and rows of matrix B
result_matrix[i][j] += matrix_a[i][k]*matrix_b[k][j]
}
}
}

// output
for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 1024; j++) {
printf("%f ", result_matrix[i][j]);
}
printf("\n");
}
}```

To compile and run, execute the following commands:

```\$ gcc -Wall -o mm mm.c
\$ ulimit -s 16384
\$ time ./mm > mm_output

real    0m20.189s
user    0m20.016s
sys     0m0.072s```

(Note that we are allocating the arrays on the stack rather than using malloc, so we need to raise the stack size a bit, otherwise we get an immediate segmentation fault.)

The reason matrix multiplication code can look a bit mysterious is that there are a lot of things that can be optimized. However, there is only one optimization that is required to get vectorization to work at all.

As you can see, when we access matrix_a, we access matrix_a[i], then matrix_a[i], matrix_a[i], matrix_a[i], and so on until we have hit the end. This is nice and sequential memory access, and is much faster than haphazard (“random”) accesses. In matrix_b, we have somewhat haphazard accesses. The first access is matrix_b[j], the second access is (in our example) 1024 bytes away from the first, matrix_b[j], then another 1024 bytes away at matrix_b[j], etc. There is a 1024 byte gap between every access. This kind of access is slow. It ruins the CPU’s caching system. This is why matrix_b will often be transposed in matrix multiplication code. If you transpose the matrix, the rows will be the columns and the columns the rows, thus you get nice and sequential access to matrix_b. (In our demonstration code, we are using square matrices with the same values everywhere, so we don’t actually have to do any copying work, as matrix_b is the same transposed or not. So all we have to do is swap the indices.)

`            result[i][j] += matrix_a[i][k]*matrix_b[j][k]`

So what kind of speed-up does this get us? The naive implementation takes 19-21 seconds on my system. The implementation with the transposed matrix takes 4 seconds! That’s a 5x speed-up!

Next, we’ll try to parallelize the outer for-loop using OpenMP. With OpenMP we just have to add #pragma omp parallel for in front of the loop, like this:

```    #pragma omp parallel for
for (int i = 0; i < 1024; i++) {```

And then compile and run like this:

```\$ gcc -fopenmp -Wall -o mmT mmT.c
\$ ulimit -s 16384
\$ time ./mmT > /dev/null

real    0m2.939s
user    0m9.984s
sys     0m0.016s```

Next, we’ll ask gcc to auto-vectorize! Curiously enough, gcc didn’t autovectorize the version with the transposed loop, so I’ve gathered results for -O4 without autovectorization for non-transposed, -O4 with autovectorization for non-transposed, and -O4 transposed:

 -O4 with SSE autovectorization -O4 with AVX autovectorization -O4 without autovectorization Straight 2.99 1.527 8.921 Transposed n/a n/a 1.565

And here are the commands and some example output:

```\$ # -O4, no auto-vectorization, straight
\$ gcc -g -O4 -fopt-info-optall-optimized -fno-tree-vectorize -o mm mm.c
mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls.
\$ time ./mm > /dev/null

real    0m8.921s
user    0m8.912s
sys     0m0.004s
\$ # -O4, SSE auto-vectorization, straight
\$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mm mm.c
mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls.
mm.c:18:9: note: loop vectorized
mm.c:9:9: note: loop vectorized
\$ # -O4, AVX auto-vectorization, straight
\$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm mm.c
\$ # -O4, no auto-vectorization, transformed
\$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mmT mmT.c```

Let’s add OpenMP into the mix:

 -O4 with AVX autovectorization and OpenMP -O4 with OpenMP Straight 1.18 5.568 Transposed n/a 0.882

Just asking OpenMP to parallelize the i-loop makes the auto-vectorization break, but we can work around that by manually splitting the matrix multiplication into chunks. This is the full code:

```#include <stdio.h>

#define N 1024

float matrix_a[N][N];
float matrix_b[N][N];
float result_matrix[N][N];

void chunked_mm(int chunk, int n_chunks) {
for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
result_matrix[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}

int main(int argc, char **argv) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
matrix_a[i][j] = 0.1f;
matrix_b[i][j] = 0.2f;
result_matrix[i][j] = 0.0f;
}
}
#pragma omp parallel for
for (int chunk = 0; chunk < 4; chunk++) {
chunked_mm(chunk, 4);
}

for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%f ", result_matrix[i][j]);
}
printf("\n");
}
}```

Compile and run:

```\$ gcc -g -O4 -fopenmp -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm_autovectorized_openmp mm_autovectorized_openmp.c
mm_autovectorized_openmp.c:11:9: note: loop vectorized
mm_autovectorized_openmp.c:11:9: note: loop vectorized
mm_autovectorized_openmp.c:21:9: note: Loop 4 distributed: split to 1 loops and 1 library calls.
mm_autovectorized_openmp.c:21:9: note: loop vectorized
mm_autovectorized_openmp.c:21:9: note: loop peeled for vectorization to enhance alignment
mm_autovectorized_openmp.c:21:9: note: loop turned into non-loop; it never loops.
mm_autovectorized_openmp.c:21:9: note: loop with 7 iterations completely unrolled
mm_autovectorized_openmp.c:19:5: note: loop turned into non-loop; it never loops.
mm_autovectorized_openmp.c:19:5: note: loop with 7 iterations completely unrolled
\$ time ./mm_autovectorized_openmp > /dev/null

real    0m1.182s
user    0m3.036s
sys     0m0.012s```

From the user time being larger than the real time, we can tell that this was indeed running in multiple threads. Enclosing the parallel loop with something like:

```for (int loop = 0; loop < 10; loop++) {
#pragma omp parallel for
for (int chunk = 0; chunk < 8; chunk++) {
chunked_mm(chunk, 8);
}
}```

allows us a better measurement of how much improvement we get.

```time ./mm_autovectorized_openmp > /dev/null

real    0m6.649s
user    0m23.572s
sys     0m0.012s```

Anyway, the transpose still beats gcc’s auto-vectorization of the non-transposed code. I wish I could get gcc to auto-vectorize the transposed code, but alas.

In the next post we’ll vectorize this ourselves!