In my previous post, we did matrix multiplication using regular SSE/AVX instructions. In this post, we’ll implement matrix multiplication using the FMA (fused multiply-add) instruction, which takes three arguments and is able to multiply and add at the same time. (Think c = a*b + c.)

If you got here directly without reading the previous posts, note that this is just a somewhat naive implementation using inverted matrices (but without doing the inversion ourselves). (The intent of this article series is to show how to use SIMD instructions.) My previous post had a few links for people who need something really optimized.

Note that we’re using “FMA3” instructions, rather than FMA4 instructions, which only seem to be supported on some AMD processors. (The number indicates the number of arguments passed to the instruction. In the case of FMA4, the formula would be a = b*c + d.)

The first table below shows the operation of the first FMA instruction, where a is still 0 (as there is nothing to add on the very first instruction), and the second table the following FMA instruction, where the “addend” is the result in the first table.

Addend (a) |
0 | 0 | 0 | 0 |

Factor 1 (b) |
0.1 | 0.1 | 0.1 | 0.1 |

Factor 2 (c) |
0.1 | 0.1 | 0.1 | 0.1 |

Result (a) |
0.01 | 0.01 | 0.01 | 0.01 |

Addend (a) |
0.01 | 0.01 | 0.01 | 0.01 |

Factor 1 (b) |
0.1 | 0.1 | 0.1 | 0.1 |

Factor 2 (c) |
0.1 | 0.1 | 0.1 | 0.1 |

Result (a) |
0.02 | 0.02 | 0.02 | 0.02 |

Here’s the code. I changed the square matrix size to 2048 to make measuring a bit easier.

#include <x86intrin.h> #include <stdio.h> #include <stdlib.h> #define N 2048 float *matrix_a; float *matrix_b; float result[N][N]; void chunked_mm(int chunk, int n_chunks) { __m256 va, vb, vc; for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) { for (int j = 0; j < N; j++) { float buffer[8] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f }; vc = _mm256_loadu_ps(buffer); for (int k = 0; k < N; k += 8) { // load va = _mm256_loadu_ps(matrix_a+(i*N)+k); // matrix_a[i][k] vb = _mm256_loadu_ps(matrix_b+(j*N)+k); // matrix_b[j][k] // fused multiply and add vc = _mm256_fmadd_ps(va, vb, vc); } //vc = _mm256_hadd_ps(vc, vc); _mm256_storeu_ps(buffer, vc); result[i][j] = buffer[0] + buffer[1] + buffer[2] + buffer[3] + buffer[4] + buffer[5] + buffer[6] + buffer[7]; //result[i][j] = buffer[0] + buffer[2] + buffer[4] + buffer[6]; } } } int main(int argc, char **argv) { // initialize matrix_a and matrix_b matrix_a = malloc(N*N*sizeof(float)); matrix_b = malloc(N*N*sizeof(float)); for (int i = 0; i < N*N; i++) { *(matrix_a+i) = 0.1f; *(matrix_b+i) = 0.2f; } // initialize result matrix for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { result[i][j] = 0.0f; } } #pragma omp parallel for for (int i = 0; i < 4; i++) { chunked_mm(i, 4); } for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { //printf("%f ", result[i][j]); printf("%x ", *(unsigned int*)&result[i][j]); } printf("\n"); } return 0; }

### Performance

Since the CPU used in the previous articles doesn’t support FMA (and we changed N), I’m re-benchmarking the AVX256 version on the new processor.

AVX256: 1.25 seconds

FMA: 1 second

Unfortunately this is borrowed hardware so I can’t play around with this too much, but the above result is pretty consistent.