Matrix multiplication using the FMA instruction

In my previous post, we did matrix multiplication using regular SSE/AVX instructions. In this post, we’ll implement matrix multiplication using the FMA (fused multiply-add) instruction, which takes three arguments and is able to multiply and add at the same time. (Think c = a*b + c.)

If you got here directly without reading the previous posts, note that this is just a somewhat naive implementation using inverted matrices (but without doing the inversion ourselves). (The intent of this article series is to show how to use SIMD instructions.) My previous post had a few links for people who need something really optimized.

Note that we’re using “FMA3” instructions, rather than FMA4 instructions, which only seem to be supported on some AMD processors. (The number indicates the number of arguments passed to the instruction. In the case of FMA4, the formula would be a = b*c + d.)

The first table below shows the operation of the first FMA instruction, where a is still 0 (as there is nothing to add on the very first instruction), and the second table the following FMA instruction, where the “addend” is the result in the first table.

Addend (a) 0 0 0 0
Factor 1 (b) 0.1 0.1 0.1 0.1
Factor 2 (c) 0.1 0.1 0.1 0.1
Result (a) 0.01 0.01 0.01 0.01
Addend (a) 0.01 0.01 0.01 0.01
Factor 1 (b) 0.1 0.1 0.1 0.1
Factor 2 (c) 0.1 0.1 0.1 0.1
Result (a) 0.02 0.02 0.02 0.02

Here’s the code. I changed the square matrix size to 2048 to make measuring a bit easier.

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

#define N 2048

float *matrix_a;
float *matrix_b;
float result[N][N];

void chunked_mm(int chunk, int n_chunks) {
    __m256 va, vb, vc;
    for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) {
        for (int j = 0; j < N; j++) {
            float buffer[8] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
            vc = _mm256_loadu_ps(buffer);
            for (int k = 0; k < N; k += 8) {
                // load
                va = _mm256_loadu_ps(matrix_a+(i*N)+k); // matrix_a[i][k]
                vb = _mm256_loadu_ps(matrix_b+(j*N)+k); // matrix_b[j][k]

                // fused multiply and add
                vc = _mm256_fmadd_ps(va, vb, vc);
            }
            //vc = _mm256_hadd_ps(vc, vc);
            _mm256_storeu_ps(buffer, vc);
            result[i][j] = buffer[0] + buffer[1] + buffer[2] + buffer[3] + buffer[4] + buffer[5] + buffer[6] + buffer[7];
            //result[i][j] = buffer[0] + buffer[2] + buffer[4] + buffer[6];
        }
    }
}

int main(int argc, char **argv) {
    // initialize matrix_a and matrix_b
    matrix_a = malloc(N*N*sizeof(float));
    matrix_b = malloc(N*N*sizeof(float));

    for (int i = 0; i < N*N; i++) {
        *(matrix_a+i) = 0.1f;
        *(matrix_b+i) = 0.2f;
    }
    // initialize result matrix
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            result[i][j] = 0.0f;
        }
    }

    #pragma omp parallel for
    for (int i = 0; i < 4; i++) {
        chunked_mm(i, 4);
    }
    
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            //printf("%f ", result[i][j]);
            printf("%x ", *(unsigned int*)&result[i][j]);
        }
        printf("\n");
    }
    
    return 0;
}

Performance

Since the CPU used in the previous articles doesn’t support FMA (and we changed N), I’m re-benchmarking the AVX256 version on the new processor.

AVX256: 1.25 seconds
FMA: 1 second

Unfortunately this is borrowed hardware so I can’t play around with this too much, but the above result is pretty consistent.

2 thoughts on “Matrix multiplication using the FMA instruction”

  1. This is a great introduction to SIMD. Thank you!

    Can you talk about how the implementation here compare to OpenBLAS, and the techniques used in OpenBLAS for computing gemv and gemm?

Leave a Reply

Your email address will not be published. Required fields are marked *